≡ Menu

First Look: DataRobot


DataRobot is focused on automated machine learning and on helping customers build an AI driven business, especially by focusing on decisions that can be automated using machine learning and other AI technologies. DataRobot was founded in 2012 and currently has nearly 300 staff including 150+ data scientists. Since it was founded, well over 200M models have been built on the DataRobot cloud.

DataRobot’s core value proposition is that they can speed the time to build and deploy custom machine learning models, deliver great accuracy “out of the box” and provide a simple UI for business analysts so they can leverage machine learning without being a data scientist. The technology can be used to make data scientists more productive as well as to increase the range of people who can solve data science problems.

DataRobot runs either on AWS or on a customer’s hardware. Modeling-ready datasets can be loaded from ODBC databases, Hadoop, URLs or local files – partnerships with companies like Alteryx support data preparation, blending etc. The software then automatically performs the kind of data transformations needed to make machine learning work – data cleansing, feature engineering needed for the various machine learning algorithms such as scaling and converting data to match the algorithms. It does not currently generate domain-specific potential features/characteristics from raw data, instead making it easy for data and business analysts to create them and feed them into the modeling environment. Once data is loaded, some basic descriptive statistics are loaded and the tool recommends a measurement approach (to select between algorithms) based on the kind of data/target.

DataRobot can apply a wide variety of machine learning algorithms to these datasets, for now almost exclusively supervised learning techniques where a specific target is selected by the user. Multiple algorithms are run and DataRobot partitions data automatically to keep holdout data for validation (to prevent overfitting), applies smart downsampling to improve the accuracy of algorithms and allows some other advanced parameters to be configured for specific kinds of data. Once started, DataRobot looks at target variable, dataset, characteristics, combinations of characteristics and selects a set of machine learning algorithms/configurations (blueprints) to run. These then get trained and more “workers” can be configured to speed the time to complete, essentially spinning up more capacity for a specific job.

As the algorithms complete, the results are displayed on a leader board based on the measurement approach selected. DataRobot speeds this process by running the blueprints initially only against a subset of the data and then running the top ones against the full dataset. Users who are data scientists can investigate the blueprints, see exactly the approach taken for the blueprint in terms of algorithm configuration, data transformations etc. Key drivers- the features that make the most difference – are identified and a set of reason codes generated for each entry in the dataset. Several other descriptive elements, such as word clouds for text analytics, are also generated to allow models to be investigated.

The tool also has a UI for non-technical users. This skips the display of the leader board and internal status information and displays just a summary of the best model with its confusion matrix, lift and key drivers. A word cloud for text fields and a point and click deployment of a scoring UI (for batch scoring of a data file or scoring a single hand-entered record) complete the process. More advanced users can interact with the same projects, allowing the full range of deployment and reuse of projects created this way.

Once a model is done, the best way to deploy them is to use the DataRobot API. A REST API end point is generated for each model and can be used to score a record. All the fields used in the sample are used to create the REST API and the results come back with the reason codes generated. Everything to do with modeling is also available through an API, allowing customers to build applications that re-build and monitor models. Users can also generate code for models but this is discouraged.

You can get more information on DataRobot at http://datarobot.com