It’s been a while since I got an update from Cloudera (last time was in 2010, here). Since then, obviously, much has changed. Back in 2010 Cloudera was focused on storage and batch processing against this data with little focus on advanced analytics. Today, Cloudera has added support for real-time SQL queries through improvements in Apache Hive and support for Impala. This layers more real-time access on to the core capabilities without impacting the base high performance batch processing.
The modeling building element of predictive analytics and machine learning have an inherent batch focus – they need to process a large amount of data to derive the analytic insight at the heart of the model. Cloudera’s support for these more advanced analytics has exploded in recent years with partnerships across the whole data mining and predictive analytics spectrum (R, SAS, IBM SPSS and more). With this focus on data science and predictive analytics, running against the Cloudera storage layer, it is now possible to run these algorithms against larger volumes of data (avoiding the need to sample) and allowing that data to be stored in cheaper Hadoop infrastructure. This is still mostly batch model-building in an offline model where Hadoop makes a great processing fabric as well as a data storage layer.
Demand is evolving now towards both real-time scoring of analytic models (what is my prediction for this customer right now based on everything I know about them at this moment) and towards real-time learning in systems (using the last result to adapt the model in real-time). These changes are driving change into the Hadoop environment. These changes don’t eliminated the need for model building that is well suited to Hadoop but do require new real-time capabilities for analytics in the same way that Impala provided real-time SQL.
These approaches also result in a split between the tasks of model building and model scoring/updating (one batch, one more real-time). These two capabilities can be separated and each supported in a robust way if they can be connected. Cloudera sees increasing use of PMML in this context to connect the model building and model execution layers. But there is clearly potential if the Hadoop environment can become better at the real-time end of this.
Cloudera is a part of this “big learning” evolution, aiming to support the Hadoop compute/storage fabric that powers the number crunching and partnering with the major tool vendors so companies can use their favorite analytic tool and use it with Hadoop. Cloudera has a data science team to advise customers on how to put it all together. This team is actively building real-time / scalable learning on Hadoop now and Cloudera expects to see more platform support for these scenarios going forward.