machine learning

Wrong feature preprocessing is a source of train-test leakage

Feature selection should be done after train-test splitting to avoid leaking information from the test set into the training pipeline. This also means that feature selection should be done within each fold of cross-validation, not before. This sounds obvious, but this is something that goes wrong easily and often. Especially when the feature …
Read more

See archives for more ...