I got my first chance to catch up with the folks at Cloudera recently. Founded in 2008 Cloudera has nothing really to do with “cloud” and focuses instead on “big data” – helping organizations capture, integrate and analyze new sources of detailed business data. Cloudera like to describe themselves as the RedHat for Hadoop – they take the open source Hadoop project and a collection of related open source projects and wrap them into an Apache licensed distribution called Cloudera’s Distribution for Hadoop. Of course they are not as big as RedHat (45 people, 50 or so customers) but the analogy is a good one.
Cloudera has built a team with deep expertise in using Hadoop (several from the original project as well as founders who worked with Hadoop at Facebook, Yahoo, Google etc) as well as enterprise software experience. Backed by Accel and Greylock they have a business in providing support, training and professional services around Hadoop as well as their own software.
There is a lot of interest out there in Hadoop with many companies starting to use it. Cloudera find that people download and try it in a test environment before calling and asking for production support. Most companies don’t have experience with Hadoop yet, obviously, so this makes sense – IT departments are willing to experiment with open source to see if it will solve their problem and then look for support as they move into production. Cloudera’s paying customers were initially web 2.0 companies but now they are adding financial services companies like Bank of America, for instance, who are using Hadoop to bring both web logs and various internal data sources together thanks to the flexible storage options.
Hadoop consists of two main components – a scalable fault-tolerant distributed file system (HDFS) that can store data of more or less any type and MapReduce for fault-tolerant distributed processing of this data. Hadoop is very flexible (you need not define a schema to store or access the data), cheap (because it manages data across commodity hardware) and scalable.
Typical problems addressed by Hadoop are those involving complex data from lots of sources in high volume (or at least some combination of these three). Many involve web log data (where Hadoop started), text and structured data. A lot of this data used to go to tape or not get captured at all – the price/storage characteristics of HDFS make storing this practical, often for the first time. Hadoop performs particularly well in batch scenarios – what can be learned from my data as a whole and over time. Often quite long running, these scenarios are easy to do in parallel – taking the computation to the data. Examples include text mining, index building, graph creation and analysis, pattern recognition, collaborative filtering, sentiment analysis, predictive models and risk assessment (especially portfolio risk).
The biggest driver of Hadoop adoption is practicality – some big challenges are now possible for the first time at a reasonable price– but it is also true that many companies use Hadoop to reduce the cost of data capture and long term storage and analysis. Storing data in Hadoop can be as little as 10-20% of the cost of a data warehouse because it uses cheap, commodity hardware. Some problems are also faster when handled by Hadoop– HDFS replicates data multiple times and multiple jobs can be started to take advantage of this redundancy.
Hadoop is typically used alongside a data warehouse. Some companies use Hadoop for preloading – having it act as a landing zone for data that might end up in the data warehouse. Some use it for long-term storage for data that would otherwise be purged. For instance, they may take all the web logs and search logs and extract data about the patterns of behavior of customers (matching online sessions to customers in the CRM system say) to load into the data warehouse. Because Hadoop/Hive supports “schema on read” you can manage and evolve metadata over time and come back to reapply a different metadata structure to the original data to support different analysis. Examples include a financial services company storing 12 months of trade data for fraud detect ion or all the monitoring data from a smart grid. The ability to store everything also helps with advanced analytics, helping avoid the need to sample data when there are simply too many records to process. R, the open source analytics environment, has been integrated with Hadoop to support this kind of analysis.
HDFS and MapReduce are very technical components, however, so there is a large and growing ecosystem of open source projects like Hive (SQL-like interface for MapReduce), Pig (a scripting language), HBase (row level update in addition to Hadoop’s block level updates e.g. for customer records). All of these use HDFS as ultimate storage and MapReduce for access. This ecosystem has challenges though – mostly these are command line tools while the number of components makes it hard for companies to manage and keep synchronized. There can also be challenges with the interoperability of these components. Cloudera targets these issues by packaging UI, SDK, Workflow, Scheduling, Metadata, Data Integration, row-level access and job coordination components into a single Apache-licensed stack. Cloudera can do this in part because it employs contributors or committers for all these components and have a certified committer for 70% of them. In fact, several of these components were actually created at Cloudera. Companies downloading the stack can therefore get one coherent set of Hadoop-centric components and can get training and support from Cloudera also.
In June Cloudera launched a licensed product – Cloudera Enterprise. This consists of management tools for provisioning and authorization, resource management/SLAs as well as integration configuration and monitoring. This is licensed rather than open source and has all the normal kinds of enterprise support. Cloudera is seeing high worldwide demand for training and are also running Hadoop World 2010 in New York where they are targeting 800 attendees.
Given the widespread use of Hadoop alongside data warehouses they have partnerships with Netezza and Quest as well as their recently announced partnership with Teradata. In each case the intent is for Cloudera’s Hadoop stack to support a wide array of data for general purpose analysis while the data warehouse provides more focused analysis.