Table of contents for Standards in Predictive Analytics
In this series so far we have discussed a number of standards – R, PMML and Hadoop – that are well established. There are also some future developments that are worth considering—the emergence of the Decision Model and Notation standard, growing acceptance of Hadoop 2 and planned updates to PMML specifically.
As regular readers of this blog know, the Object Management Group recently accepted the Decision Model and Notation standard as a beta specification for finalization in 2014. DMN provides a common modeling notation, understandable by both business and technical users, that allows decision-making approaches to be precisely defined. These decision models can include the specification of detailed decision logic and can model the role of predictive analytics at a requirements level and at an implementation level through the inclusion of PMML models as functions. DMN will allow the modeling of automated and manual decisions in a way that shows how business rules and predictive analytics are being combined. It can also be used to describe the requirements for predictive analytic models. If you are working in predictive analytics, DMN is definitely worth a look.
Hadoop 2 – technically Apache Hadoop 2.2.0 – was released last October. Most Hadoop users are not using it yet, though, so it counts as a future. Hadoop 2 is really all about YARN—a resource management system that manages load across Hadoop nodes and supports other approaches besides MapReduce. This allows Hadoop resources to be shared across jobs using different approaches and, in particular, will allow more real-time processing. For instance new engines that support SQL, steam processing and graph processing can and are being developed and integrated into the Hadoop stack using this approach. This sets the stage for vendors to address some of the known limitations of Hadoop when it comes to predictive analytics.
Finally, PMML Release 4.2 is expected to be released in the first half of 2014. As with 4.1, 4.2 is expected to improve support for post-processing, model types and model elements. 4.2 is particularly focused on improving support for predictive scorecards (especially those with complex partial scores), adding regular expressions as built in functions, and continuing to expand support for different types such as continuous input fields in Naïve Bayes Models. These incremental updates to PMML are important as they make it harder to argue that PMML does not support one’s favorite model type or configuration option and this helps broaden support and adoption.
Next week I will wrap up the series and post a new white paper, sponsored by the Data Mining Group, Revolution Analytics and Zementis.