Table of contents for Standards in Predictive Analytics
With the fourth post in this series I am going to talk about Hadoop – something with even more hype than R or predictive analytics. As we all know the era of Big Data has arrived. As anyone who reads the IT or business press knows, there is more data available today, this data is no longer just structured data suitable for storing in a relational database but is increasingly unstructured and semi-structured data, and this data is arriving at organizations more rapidly than ever. All this is putting pressure on traditional data infrastructures and Hadoop has emerged as a key technology in dealing with these Big Data challenges. Grossly simplifying, Hadoop consists of two core elements—the Hadoop Distributed file System or HDFS and the MapReduce programming framework. HDFS is a highly fault tolerant distributed file system that runs on low-cost commodity hardware while MapReduce is a programming framework that breaks large data processing problems into pieces so they can be executed in parallel on lots of machines close to the data that they need.
Hadoop offers is well suited to the challenges of Big Data. It uses commodity hardware so it scales at low cost and handles Big Data volumes. It uses a file metaphor that is flexible enough to handle the wide variety of Big Data. It uses a streaming-centric approach that enables it to handle the velocity of Big Data. While some newer organizations are using a pure Hadoop strategy most are adding Hadoop to an existing infrastructure. This allows them to use Hadoop as a landing zone for data where it can be pre-processed (cleaned, aggregated) before being moved to a data warehouse, as an active archive for data used less often, and for rapid data exploration where many data sources must be quickly combined and analyzed. The explosive growth of Hadoop means that this open source project is supported by commercial organizations and that major vendors have integrated with Hadoop, providing the support needed for success.
Hadoop has many features that make it appealing but it is not perfect. Like R, Hadoop is free so projects can get started and Hadoop can become essential without organizations lining up the support they need to succeed. Hadoop is also a programmer-centric somewhat techie environment with limited support for SQL making it difficult to use with existing analytic tools. Hadoop is also better at batch processing and so interactive systems can be a real challenge when a single record must be processed with minimum latency for instance. There’s a lack of specific predictive analytics support and scoring anlaytic models against Hadoop data means writing pre- and post-processing MapReduce scripts. These challenges are being addressed by additional projects that improve SQL support (opening standard tools to access Hadoop data) while commercial analytic tools increasingly offer access to Hadoop for both modeling and scoring using both PMML and “in-database” techniques.
It is possible to create Hadoop-sized problems by focusing on batch scoring. If an organization has many customers to score and things in terms of batch scoring—that every customer must be scored every day—then this can sound like a Hadoop problem. However the scoring could be done on-demand in real-time for the much smaller number of customers who interacted with the company on a typical day. This real-time scoring problem might be much smaller in scale and so not demand a Hadoop infrastructure.
While organizations should not begin by installing Hadoop and focusing on the technology, Hadoop does have potential for companies adopting predictive analytics When the analytics that will be required to make a decision more accurately require data that does not lend itself to traditional data infrastructure, Hadoop may well offer a better solution. Using Hadoop to store and manage data that would otherwise not be available offers a compelling value proposition when that data can be tied to analytics and decisions that matter. Companies familiar with open source can use Hadoop directly but others will likely find good commercial options for support whether from a Hadoop pureplay or an existing vendor offering Hadoop services. Once Hadoop becomes part of your data infrastructure it is important that it is supported by your analytic tools and in-database analytics.