John Elder was next. John is a well-known data mining expert who runs Elder Research Inc. John was presenting on Top 10 Data Dangers When Discovering Business Rules. John’s focus was on data mining and other analytics techniques and he presented his top 10 mistakes:
- Lack Data
For instance fraud cases can be so rare in a dataset that it is very hard to find good rules. For instance, fraud cases in government contracting had 12 cases out of millions so lots of work to make even small progress. In contrast a particular kind of tax fraud had a surprising number of known cases and so was able to develop effective rules to reduce 100 possibles to 4 likelys and so focus effort. Some projects should gather data first by running sample and see what happens.
- Focus on Training data
If you only worry about your training data then you can use a lookup table! One trick is to include the case number and see if it shows up! If it does then this is typically a problem – the order is important perhaps or the sample is biased. Really important not to overfit to the training data. Should always check with other data to make sure so keep some data back to check the results.
- Rely on one technique
At least check your new method against a basic technique. Ideally try several different techniques and compare and merge the results as each new technique will add perhaps 5-10% to the work. At least a 2×2 matrix with results for training set and evaluation set for both the new technique and the basic technique. Combining techniques works much better.
- Ask the Wrong Question
Must aim at the right target. Instead of looking for fraud examples, for instance, develop a model of normal behavior and look for variations. You also want the model to be penalized/rewarded the same way you are. For instance, standard error rates treat under and over errors the same whereas people often treat over and under errors quite differently.
- Listen (only) to the data
Data can contain noise that predicts well so much test and check.
- Accept Leaks from the future
For instance survivor bias, changes to data since the point at which the decision was made when looking at historical data. You might need to do several passes to eliminate things that seem to work too well.
- Discount pesky cases
Outliers should be considered and explained not discounted. Not “aha” but “that’s odd”. Visualization can help spot these.
Early experiences can really impact your approach and it can be hard eliminate early factoids. Verbal explanation and team settings work well for this. Selecting breeding is a better metaphor than creating life!
- Answer every inquiry
Don’t know is a valid answer for a model. Trying to always have an answer is not always helpful as it can get very hard to answer accurately.
- Believe the best model
Sometimes you have to be able to explain and that constrains models. Often though it is only necessary to get accuracy not explicability. Multiple models typically combine to produce the best answer.
He pointed out there are many tools but a core set of techniques. He recommends Decision Trees, Nearest Neighbor (these two are less accurate but have clean cut offs), Neural Networks, Kernel, Delaunay Triangles (these three are nonlinear or smooth curves) with, of course, the old standby of Regression. He also pointed out that the prediction should be managed separately from the actions and their consequences.
This was a great overview of the questions to ask of someone who is doing data mining for you!