Sue Gonella presented on some efficiencies in building predictive scorecards. In particular she covered the use of sampling data vs using all records into a model development exercise.
Rather than using all records she advocated using stratified random sampling where a sample of each group of interest is used to build and validate the models. This works better because turn-around times are better and experimentation easier. She demonstrated that predictive power is comparable if you use 10,000 records or so per performance group so there is no loss of accuracy if this is done right.
She walked through an example of this showing that for the same model performance she could save more than 99% of the time involved. This enabled a lot more experimentation as most changes to the model made little or no difference to the time taken when 10,000 sample records are being manipulated (whereas the same changes would have caused the full dataset to run even slower). Similarly demoting predictors that make zero contribution so that they don’t affect subsequent iterations makes for even better performance with little or no impact on predictive power.
Clearly taking these steps – stratified random sampling and the elimination of zero-contribution predictors – make for MUCH faster iteration in model development and thus better models. She also pointed out that, even if you are required to use all records in the final model, you can do a lot of the development work with the sample data to improve performance and so allow many more iterations.