Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am trying to fit a multivariable linear regression on a dataset to find out how well the model explains the data. My predictors have 120 dimensions and I have 177 samples:

X.shape=(177,120), y.shape=(177,)

Using statsmodels, I get a very good R-squared of 0.76 with a Prob(F-statistic) of 0.06 which trends towards significance and indicates a good model for the data.

When I use scikit-learn's linear regression and try to compute 5-fold cross validation r2 score, I get an average r2 score of -5.06 which shows very poor generalization performance.

The two models should be exactly the same as their train r2 score is. So why the performance evaluations from these libraries are too different? Which one should I use? Greatly appreciate your comments on this.

Here is my code for your reference:

    # using statsmodel:
    import statsmodels.api as sm
    X = sm.add_constant(X)
    est = sm.OLS(y, X)
    est2 = est.fit()
    print(est2.summary())

    # using scikitlearn:
    from sklearn.linear_model import LinearRegression
    lin_reg = LinearRegression()
    lin_reg.fit(X, y)
    print 'train r2 score:',lin_reg.score(X, y)
    cv_results = cross_val_score(lin_reg, X, y, cv = 5, scoring = 'r2')
    msg = "%s: %f (%f)" % ('r2 score', cv_results.mean(),cv_results.std())
    print(msg)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
133 views
Welcome To Ask or Share your Answers For Others

1 Answer

The difference in rsquared because of the difference between training sample and left out cross-validation sample.

You are most likely strongly overfitting with 121 regressors including constant and only 177 observations without regularization or variable selection.

Statsmodels only reports rsquared, R2, for the training sample, there is no cross-validation. Scikit-learn needs to reduce the training sample size for cross-validation which makes overfitting even worse.

A low cross-validation score as reported by scikit-learn, then means that the overfitted estimates do not generalize to the left out data, and is matching idiosyncratic features of the training sample.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...