Salford Systems was founded in 1983 and initially developed SAS procedure– add-ons for SAS. They then worked with a PC-based statistics package before moving solidly into data mining and predictive analytics after meeting the inventors of the CART algorithm around 1990. Their product range now consists of CART®, TreeNet® stochastic gradient boosting, MARS®, RandomForests®, GPS™ (Generalized Pathseeker), ISLE™ (Importance Sampled learning Ensembles) and Rulefit™ Rule Extraction Engine all of which are also part of the Salford Data Mining Suite along with a SCORING Server. Salford Systems has a long history of awards and honors from the KDD Cup and other events for its software and all are based on code and algorithms only available from Salford Systems.
Salford Systems aim to deliver a number of key features in the suite including a high degree of automation, minimal need for data preparation and automated selection of predictive data elements – all tasks that chew up a lot of time in a typical data mining or predictive analytic modeling effort. The tools aim to be very robust in the face of outliers and dirty data, automatically handle missing values and offer powerful scripting with complete audit trails. The latest version of the tools is 64 bit, multi-threaded and supports multi-core machines while a single server can handle 1 TB+ of data. The tool has a classic Windows look and feel and runs as a pure client or server installation.
Taking the CART tool as an example, data can be loaded using ODBC or any of 50+ commonly used file formats. Data can be graphed, analyzed and viewed. Modelers can interact with the data (eliminating variables you don’t want to include for instance) but can also just select a dependent variable that they wish to predict and kick off the tool. In the case of CART the algorithm builds a decision tree automatically until it is larger than is useful and then prunes it back to the optimum result. Besides the implementation of the core algorithm, the tool allows the modeler to automatically select those trees that are sufficiently similar to the optimum and then use the simplest tree within that set. Trees can also be analyzed to see how performance and lift line up between training and test data, flagging nodes that have different direction or ranking to show overfitting for instance. TreeNet uses a different approach that involves growing a small conventional tree, growing a second tree based on residuals from the first and a third based on the residuals from the first two. Many, many such trees can be built and each tree contains terminal nodes that are either positive or negative contributors to the prediction. The final prediction is a sum of the nodes in each tree. These more complex tree approaches can boost predictive power as well as identify small segments that behave very differently from others. You can generate rules for TreeNet trees and pick out the rules that are most interesting while controlling the complexity of the rules using the tree complexity.
Normally users of SPM have organized and prepared their data for analysis in data management systems dedicated to this task, such as SQL Server. However, a moderate amount of data preparation and manipulation can be accomplished within SPM using its built-in programming module. The data preparation module offers a standard version of the BASIC language permitting creation of new columns of data using simple or complex transformations. Users can use a collection of about 50 operators and reference data across multiple rows for aggregation. These features are intended to allow users to modify their data conveniently on the fly and in the midst of an analysis sessions without being required to go back to their database management systems every time.
In all the tools there are a range of automated tools for improvement of models called the Battery. These allow for models to be rebuilt in various ways to improve results. For instance in CART the tool can continually eliminate the least predictive variables used in the tree, specify the minimum size of a terminal node in the tree (how many customers must be in each segment for instance), specify multiple test data sets to see if the results vary (different variables showing up as most predictive for instance), penalizing variables with lots of missing values and much more. The mindset of these tools is to automate the steps that an experienced modeler would follow but do so automatically, systematically and completely. Each product has a range of similar improvement tools in its own Battery. All the Salford Systems tools are aimed at a relatively sophisticated user – someone who understands how the techniques supported work to improve the model. None of the tools require binning and all allow modelers to work with their original data. If you want to bin data, however, the new tools support 6 different ways to automatically bin and code data.
Models built using the tools are saved as binary files that can then be managed in file systems, databases or source code control systems. These capture the final model and metadata about the process for producing the model. You can begin work with one of these models, allowing for updating and retuning of models. Scripting allows this to be automated to support large numbers of models.
Deployment options include scoring datasets on demand and outputting the results to a file, updating the database that stores the original data, generating SAS,C or Java code, and outputting PMML (Predictive Model Markup Language) code. Salford’s scoring server also supports batch and interactive scoring using their various models.
There is more information on Salford Systems on their website and you can download the software here. You can also see some case studies here and Salford Systems is one of the vendors in our Decision Management Systems Platform Technology report.