≡ Menu

First Look: Dataiku 2.0


I last got an update from Dataiku in November of 2014. Since then they have raised money and opened an office in New York. New features and capabilities have been added to the product and they are seeing good interest in the product from US customers as they expand here.

The 2.0 version has been available since April of this year and has a significantly redesigned user interface. Much has not changed: Projects still have a visual workflow to describe what’s going on but it has been streamlined to make it easier to read. Plus there is now a redesigned search and a sidebar palette of activities, including support for specific data manipulation activities and integration of python and R.

Data can still be read from hadoop, SQL and no-SQL databases. The interactive editor for datasets is largely unchanged, providing a rich set of tools for handling missing or bad data, fixing problems in the data, suggesting changes based on analysis of the data etc. The user interface has been updated and some new transformations have been added such as extracting numeric values from text columns.

New in this release you can create new predictive models from this same data environment, allowing a modeler to move back and forth more easily between the data and the model development environment. They package algorithms from various third parties, trying to offer users all the best algorithms from the open source environment. They add some expertise on parameter setting, training approaches etc to make these algorithms easier to use while still exposing some of these settings for customization. The tool keeps out validation sets etc and as part of a general update in is results reporting now also offers K-fold cross testing (to test the variability of the scores being developed).

The resulting project can be executed on hadoop as MapReduce jobs and in 2.1 the model too will be executable using Spark and MLlib. Jobs can be scheduled so that the data scientists and analysts can manage deployment of their projects. Most projects are still deployed in batch but Dataiku also allows the project to be generated as code for deployment into a transactional environment. Future plans include a server for deploying projects for real-time scoring against production databases (which may not be the same data environment as that used for training).

More information on the product can be found here.


Comments on this entry are closed.