Logo

Generalized Linear ModelsΒΆ

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: from scipy import stats

In [4]: from matplotlib import pyplot as plt

Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach

The response variable is (of students above the math national median, #of students below)

The explanatory variables are (in column order)
The proportion of low income families “LOWINC”
The proportions of minority students,”PERASIAN”,”PERBLACK”,”PERHISP”
The percentage of minority teachers “PERMINTE”,
The median teacher salary including benefits in 1000s “AVSALK”
The mean teacher experience in years “AVYRSEXP”,
The per-pupil expenditures in thousands “PERSPENK”
The pupil-teacher ratio “PTRATIO”
The percent of students taking college credit courses “PCTAF”,
The percentage of charter schools in the districut “PCTCHRT”
The percent of schools in the district operating year round “PCTYRRND”
The following are interaction terms “PERMINTE_AVYRSEXP”,”PERMINTE_AVSAL”,
“AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO”,”PERSPEN_PCTAF”,”PTRATIO_PCTAF”,
“PERMINTE_AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO_PCTAF”
In [5]: data = sm.datasets.star98.load()

In [6]: data.exog = sm.add_constant(data.exog)

The response variable is (success, failure). Eg., the first observation is

In [7]: print data.endog[0]
[ 452.  355.]

Giving a total number of trials for this observation of print data.endog[0].sum()

In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

In [9]: binom_results = glm_binom.fit()

The fitted values are

In [10]: print binom_results.params
[-0.01681504  0.00992548 -0.01872421 -0.01423856  0.25448717  0.24069366
  0.08040867 -1.9521605  -0.33408647 -0.16902217  0.0049167  -0.00357996
 -0.01407656 -0.00400499 -0.0039064   0.0917143   0.04898984  0.00804074
  0.00022201 -0.00224925  2.95887793]

The corresponding t-values are

In [11]: print binom_results.tvalues
[-38.74908321  16.50473627 -25.1821894  -32.81791308   8.49827113
   4.21247925   5.7749976   -6.16191078  -5.45321673  -5.16865445
   3.92119964 -15.87825999  -7.39093058  -8.44963886  -4.05916246
   6.3210987    6.57434662   5.36229044   7.42806363  -6.44513698
   1.91301155]

It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.

In [12]: means = data.exog.mean(axis=0)

In [13]: means25 = means.copy()

In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)

In [15]: means75 = means.copy()

In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)

In [17]: resp_25 = binom_results.predict(means25)

In [18]: resp_75 = binom_results.predict(means75)

In [19]: diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is

In [20]: print diff*100
-11.8752509918

In [21]: means0 = means.copy()

In [22]: means100 = means.copy()

In [23]: means0[0] = data.exog[:,0].min()

In [24]: means100[0] = data.exog[:,0].max()

In [25]: resp_0 = binom_results.predict(means0)

In [26]: resp_100 = binom_results.predict(means100)

In [27]: diff_full = resp_100 - resp_0

In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %

In [29]: nobs = binom_results.nobs

In [30]: y = data.endog[:,0]/data.endog.sum(1)

In [31]: yhat = binom_results.mu

Plot of yhat vs y

In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0xb229ee4c>

In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.PathCollection at 0xb232768c>

In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params

In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?

In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0xb1a9702c>]

In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0xb22539ec>

In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0xb22e732c>

In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0xb22e4a6c>
../../_images/glm_fitted.png

Plot of yhat vs. Pearson residuals

In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0xb20f4dcc>

In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.PathCollection at 0xb231da8c>

In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0xb2e70ecc>]

In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0xb232766c>

In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0xb2317bac>

In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0xb226b86c>
../../_images/glm_resids.png

Histogram of standardized deviance residuals

In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0xb21190cc>

In [47]: res = binom_results.resid_deviance.copy()

In [48]: stdres = (res - res.mean())/res.std()

In [49]: plt.hist(stdres, bins=25)
Out[49]: 
(array([  3.,   2.,   4.,  11.,  11.,  20.,  27.,  31.,  33.,  41.,  28.,
         26.,  23.,  17.,  11.,   4.,   2.,   3.,   1.,   2.,   1.,   0.,
          1.,   0.,   1.]),
 array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
        -1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
         0.18069979,  0.45305454,  0.72540929,  0.99776403,  1.27011878,
         1.54247353,  1.81482827,  2.08718302,  2.35953777,  2.63189252,
         2.90424726,  3.17660201,  3.44895676,  3.7213115 ,  3.99366625,
         4.266021  ]),
 <a list of 25 Patch objects>)

In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0xb2f6326c>
../../_images/glm_hist_res.png

QQ Plot of Deviance Residuals

In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0xb22e01cc>

In [52]: res.sort()

In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)

In [54]: quants = np.zeros_like(res)

In [55]: for i in range(nobs):
   ....:     quants[i] = stats.scoreatpercentile(res, p[i]*100)
   ....: 

In [56]: mu = res.mean()

In [57]: sigma = res.std()

In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)

In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.PathCollection at 0xb21c524c>

In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0xb21cfc6c>]

In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0xb2077a0c>

In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0xb1ffb28c>

In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0xb2f639ac>

In [64]: from statsmodels import graphics

In [65]: img = graphics.gofplots.qqplot(res, line='r')
../../_images/glm_qqplot.png

Example for using GLM Gamma for a proportional count response Brief description of the data and design

In [66]: print sm.datasets.scotland.DESCRLONG

This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts.  This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.

The original source files and variable information are included in
/scotland/src/


In [67]: data2 = sm.datasets.scotland.load()

In [68]: data2.exog = sm.add_constant(data2.exog)

In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())

In [70]: glm_results = glm_gamma.fit()

Example for Gaussian distribution with a noncanonical link

In [71]: nobs2 = 100

In [72]: x = np.arange(nobs2)

In [73]: np.random.seed(54321)

In [74]: X = np.column_stack((x,x**2))

In [75]: X = sm.add_constant(X)

In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)

In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))

In [78]: gauss_log_results = gauss_log.fit()

check summary

In [79]: print binom_results.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           ['y1', 'y2']   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2998.6
Date:                Sat, 27 Aug 2016   Deviance:                       4078.8
Time:                        10:00:51   Pearson chi2:                 4.05e+03
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0168      0.000    -38.749      0.000        -0.018    -0.016
x2             0.0099      0.001     16.505      0.000         0.009     0.011
x3            -0.0187      0.001    -25.182      0.000        -0.020    -0.017
x4            -0.0142      0.000    -32.818      0.000        -0.015    -0.013
x5             0.2545      0.030      8.498      0.000         0.196     0.313
x6             0.2407      0.057      4.212      0.000         0.129     0.353
x7             0.0804      0.014      5.775      0.000         0.053     0.108
x8            -1.9522      0.317     -6.162      0.000        -2.573    -1.331
x9            -0.3341      0.061     -5.453      0.000        -0.454    -0.214
x10           -0.1690      0.033     -5.169      0.000        -0.233    -0.105
x11            0.0049      0.001      3.921      0.000         0.002     0.007
x12           -0.0036      0.000    -15.878      0.000        -0.004    -0.003
x13           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
x14           -0.0040      0.000     -8.450      0.000        -0.005    -0.003
x15           -0.0039      0.001     -4.059      0.000        -0.006    -0.002
x16            0.0917      0.015      6.321      0.000         0.063     0.120
x17            0.0490      0.007      6.574      0.000         0.034     0.064
x18            0.0080      0.001      5.362      0.000         0.005     0.011
x19            0.0002   2.99e-05      7.428      0.000         0.000     0.000
x20           -0.0022      0.000     -6.445      0.000        -0.003    -0.002
const          2.9589      1.547      1.913      0.057        -0.073     5.990
==============================================================================

This Page