Data dredging/ p-hacking

10/3/2023

In a sense, you end up inadvertently fitting your test set. In both cases, you only did so because your model didn’t perform well enough against the holdout set, and what ends up happening is that you begin to slowly tailor the model until it scores well-enough against your test set. Maybe you re-fit, trying more estimators or a steeper learning rate. predict ( X_test )))īut this is a dangerous practice!!! By introducing your holdout set too early, your design decisions may reflect what you’ve learned about the model’s performance. Print ( "Test MSE: %.3f" % mean_squared_error ( y_test, search. So what do most people do? They score the model against their holdout set: 1įrom trics import mean_squared_error # Evaluate: Search = RandomizedSearchCV ( GradientBoostingRegressor ( random_state = 42 ), scoring = 'neg_mean_squared_error', n_iter = 25, param_distributions = hyper, cv = cv, random_state = 12, n_jobs = 4 ) # Fit the grid searchĪt this point, we’ve fit a valid model, and we want to know how it performs. Hyper = # Define our CV class (remember to always shuffle!)Ĭv = KFold ( shuffle = True, n_splits = 3, random_state = 1 ) # Define our estimator X, y = load_boston ( return_X_y = True ) X_train, X_test, y_train, y_test = train_test_split ( X, y, random_state = 42 ) # Define (roughly) our hyper parameters Here’s a quick problem setup: 1įrom sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.model_selection import RandomizedSearchCV from sklearn.model_selection import KFold from sklearn.ensemble import GradientBoostingRegressor from scipy.stats import randint import numpy as np # Load the data Most of the time I see data dredging, it’s in the context of evaluating a grid search.

In a practical sense, it’s when you repeatedly expose your holdout set to your model while continuing to make adjustments. What’s data dredging?Īlso commonly called “p-hacking,” data dredging is essentially the practice of allowing your test or validation set to inform decisions around your model-building or hyper-parameter tuning. This time, I want to cover an equally egregious practice: data dredging. Last time, we saw how covariate shift can be accidentally introduced by (seemingly harmlessly) applying a fit_transform to your test data. In the spirit of my last post, I want to continue talking about some common mistakes I see among machine learning practitioners.

0 Comments

Data dredging/ p-hacking

Leave a Reply.

Author

Archives

Categories