Table of contents for Standards in Predictive Analytics
Continuing my series on standards in Predictive Analytics I am going to talk first about PMML. PMML is an XML standard for the interchange of predictive analytic models developed by the Data Mining Group. The basic structure is an XML format document that contains a header, a data dictionary, data transformations and one or more models each consisting of a mining schema based on the type of model, a target and other outputs such as measures of accuracy. PMML started in 1998 and the most recent release was 4.1 in 2011. The 4.x releases marked a major milestone with support for pre- and post-processing, time series, explanations and ensembles. Support for PMML is widespread and growing with an active community and many analytic vendors are either members of DMG or provide support for the standard in their products.
PMML has particular value for organizations as they move away from a batch scoring mindset to a more real-time scoring approach. When scoring was done in batch it was generally done using the same technology as was used to build the model. With real-time scoring it has become essential to be able to move models from their development environment to a more real-time, interactive scoring environment and PMML has emerged as the primary way to do this.
It should be noted that there is nothing inherently real-time about PMML. PMML can and is used in batch scoring scenarios too and the focus of the standard is on interoperability, something that is valuable independent of an organization’s need to move to real-time scoring.
PMML offers an open, standards-based approach to operationalizing predictive analytics. This is a critical need for organizations looking to maximize the value of predictive analytics: unless predictive analytic models can be effectively operationalized, injected into operational systems, then the danger is that they will sit on the shelf and add no value. Similarly if it takes too long to operationalize them—weeks or months say—then the value of the model will degrade even as the cost to implement the model rises. The wide range of deployment options for PMML models addresses these concerns and also means that organizations can relax their concerns about multiple analytic development environments. If models can be generated as PMML and that PMML can be executed widely then it is possible to create an environment in which models are developed with any analytic tool and run anywhere.
The primary challenge for PMML, as it is for any standard, is to get the vendor community to regard support for it as more than just a “check the box” capability. Market adoption of PMML has recently reached a tipping point where organizations are relying on PMML in critical production environments. These organizations are demanding strong PMML support from their suppliers which, in turn, is putting a great deal of pressure on vendors to provide solid support. This pressure also helps address the challenge of getting vendors to stay current and support the latest release. These challenges are typical for a standard and PMML is in the fortunate position of being the only standard for predictive models that is widely supported.
I’m a big fan of PMML and firmly believe that all organizations approaching predictive analytics should include PMML in their list of requirements for products. Selecting analytic tools that do a good job of generating and consuming PMML and identifying operational platforms that can consume and execute PMML just makes sense. Even organizations committed to a single vendor stack should consider PMML to give themselves the ability to bring models developed by a consortium or third party into their environment.