Jim Falgout of Pervasive DataRush presented on best practices for custom big data applications. I have spoken to Jim before, when I reviewed DataRush. Jim sees the big data challenges as being driven by both complexity (high performance computing problems like fluid dynamics, climate modeling) and data size (internet scale data for web indexing and search for instance). These have been addressed by super computers and large farms of servers respectively. Enterprise problems are moving both towards internet scale data volumes and towards high performance computing problems and the question is what is the right platform for this kind of analysis.
Pervasive DataRush is Pervasive’s solution for this problem. Jim describes DataRush as a parallel dataflow platform that eliminates performance bottlenecks. DataRush aims at:
Taking advantage of all the cores available on a machine without recoding
- High throughput
By flowing data through parallelized I/O
- Cost efficient
Because it takes advantage of all the cores in commodity servers
- Easy to implement
DataRush is designed to eliminate some typical bottlenecks like data preparation – profiling, fuzzy matching, cleansing – and analytics – understanding, modeling, predicting. DataRush can also work with various data warehouse and increasingly NOSQL sources. DataRush is very fast based on some published benchmarks like Malstone-B10. This benchmark uses 10B rows and 1 TB of data. Pervasive ran their test on a single 4 way/32 core server and outperformed a 20 4-core machine cluster for Hadoop/MapReduce. They were faster (31minutes to 891) and used dramatically less power. They also demonstrated pretty good scalability – with more cores almost reducing time to run the test proportionally – 3.2 hours for 4 cores and under 1 for 16 for instance.
At the core of DataRush is a patented Parallel Dataflow Engine. Sitting on top is an Java SDK with both Pervasive’s core libraries for data preparation and analytics (as well as any user-defined libraries). Pervasive has built some modules on these – DataRush DataMatcher, Recommender and Profiler. These modules build on the core libraries and perform at very high speeds. The whole stack is designed to be embeddable so you can build high performance applications wrapped around it. Datarush is also partnered with KNIME. This allows you to use a graphical composer to manage the DataRush technology – laying out the various elements of the modeling process graphically from a palette that includes various DataRush nodes. You can run this either letting KNIME manage the flow and calling out to wrapped DataRush elements or turn a whole flow over to DataRush.
Jim did a demo with KNIME and DataRush based on a retail scenario – bringing in sales transactions, organize this into baskets and various analysis dimensions, pushing this into a warehouse and doing analytics against the data such as which products sell well together. Obviously being able to do this quickly is important as it keeps the data being analyzed more current. In this scenario they took the TCP-H data. DataRush does extract and cleanse on data and definitions, enriched the data by joining it with definitions such as customer segment and product details, aggregated it into various dimensions like country and product category, and analyzed it using association rule data mining. All of this was set up in a single KNIME flow that took 10 minutes to run 120M line items/23GB on a 4×8 Intel box. Similar scenarios come in healthcare, finance, utilities etc.
To get started with DataRush you can download it from pervasivedatarush.com (and KNIME from knime.org/download), get a quickstart consultation etc. You can get started on your existing commodity hardware and take advantage of all the cores before scaling out to multiple nodes with DataRush 5.