≡ Menu

First Look: Dataiku’s Data Science Studio


Dataiku, a company founded in 2013 and based in France, launched their product, Data Science Studio (DSS), in February 2014. DSS is a web-based analytic software platform designed for data scientists and analysts. The product is designed to improve the effectiveness and productivity of data teams especially when it comes to turning raw data into an analytic API. The tool is focused on creating and running analytic applications in a production environment., a few hundred people are using the product, which includes both a free version, the Community Edition, and a subscription based version, the Enterprise Edition. More than 15 companies are using the Enterprise Edition for production problems.

The product is web-based and designed to support the loading, preparation, analysis, and deployment of analytic models in a collaborative environment. When the user signs in to his/her DSS company account, the initial environment displays a set of projects as tiles. In each project, the users’ work with a set (or sets) of data is saved as a visual workflow. These workflows start from integration and go to cleansing, to analyzing, and to modeling (though there are some visualization and display options as well).

The data preparation elements of the tool allow a user to access data in a wide variety of formats – uploaded files, Hadoop, relational and nosql databases, fabrics such as cascading and web services, and even Excel or csv files. The tool helps the user visually discover the kind of data involved by profiling the data and providing a set of automated tools to detect and apply typical data cleansing and enriching activities (resolving geolocation IP addresses, parsing dates in text fields, merging datasets, etc). The user can apply these transformations interactively and the tool also recognizes obvious cleansing and enriching activities that the user can choose to apply automatically. The user can also choose to inject his / her own code for a very specific cleaning or enriching feature. All the integration and cleaning activities can be saved as a recipe that can be included in the workflow and reused. Additional steps like joins, custom formulas, etc. can also be defined and saved to the workflow.

Once the data is ready, the workbench allows the user to develop a predictive model using a wide variety of machine learning algorithms. Thanks to the product’s connectors to analytical machine learning frameworks, the user can fine-tune the algorithms with a visual editor to build optimal models. Multiple results can be developed in parallel and compared to each other in order to see which one(s) yield the best results. A recipe – workflow – can be saved to create the predictions based on the selected approach. Additional workflows can be generated to periodically re-train the model. In this latter case the tool automatically identifies the date segmentation used in the model to see what new data should be considered.

These workflows can be extended with custom steps that use Python, R, SQL, Hive or Pig scripts. The custom node provides an editor for each with some code validation based on the language selected. The workflows also support versioning, commenting, and sharing of datasets – the option to share and reuse recipes/workflow fragments is on the roadmap.

There is a cloud version of the product as well as an on-premise version (which can also be downloaded and tried for free).

You can get more information on Dataiku here and they will be included in a future release of our Decision Management Systems Platform Technology Report.


Comments on this entry are closed.