As part of some ongoing research on support for PMML I recently spoke with Concurrent. Concurrent is an enterprise software company focused on simplifying Big Data development on Hadoop. The company’s core product is called Cascading. This is a free, open-source, development framework for Apache Hadoop designed to let developers build sophisticated data processing applications on top of Hadoop without requiring MapReduce knowledge and letting developers work with the tools they know. Cascading was released in 2007 and has been adopted by thousands of companies to allow them to bring Hadoop into their IT infrastructure. Customers range from well known web companies to large, Fortune 500 companies.
Cascading allows developers to use an SQL interface using ANSI compliant SQL to access Hadoop and provides a concise set of Java APIs. Concurrent has recently released Pattern to extend Cascading to support PMML.
Pattern is open source software that brings PMML to this Hadoop. Built on Cascading it allows you to drop a PMML model developed using your favorite data mining or predictive analytic environment directly onto a Hadoop installation. Pattern takes the PMML models and executes them as MapReduce jobs and a simple Java API can then be used to score new data using the model.
Pattern is focused on batch scoring, taking all the records you have stored in Hadoop and scoring them with your models. The scored data can be stored in Hadoop or pushed to another environment, a database or data warehouse say, to support more interactive applications. The APIs involved are just like the regular Cascading APIs, making it easy for teams familiar with Cascading to adopt Pattern also. Because Cascading supports SQL/Java it is easy to use it also to handle the pre-processing required by a typical model – writing effective code to handle the pre-processing at scale and feeding the resulting data into the PMML model for scoring.
The current release supports all of the core PMML scoring model types, though not quite all, but does not support the pre-processing added in PMML 4.0 (most teams still prefer to code their own preprocessing and use Cascading to execute this). Data Scientists, data analysts, and developers can now quickly (in a matter of hours) operationalize their predictive models on Hadoop and deploy them at scale.