≡ Menu

Pervasive DataRush – an update


I have blogged about Pervasive DataRush before and I got a quick update this week. Pervasive often talks about helping companies with “big data” issues and they see this as one dimension of difficulty – with the complexity of processing being done being the other dimension. So some folks, for instance, handle big data but don’t do very complex processing (web analytics for instance). Others have complexity but not so much data (typical enterprise computing). As enterprises move to more data and more complexity they have been forced to use sample data for their analytics (to reduce volume), run processes overnight, adopt expensive clustering technology or do very expensive custom development. Pervasive is focused on using multicore processors (32 or 48 core are under testing right now, for instance) to address these challenges by making the multicore processing power available to enterprises while taking advantage of the low TCO of multicore servers. Two areas of focus – data preparation (de-duping, matching, cleansing) and analytics (understanding, modeling, predicting). Key selling points remain scalability, throughput, cost efficiency, ease of implementation and extensibility. One such use of the extensibility has been their work with KNIME to integrate DataRush analytic functions into this open source workbench.

They have some great benchmarks like a Malstone-B10 benchmark for 10B rows and 1 TB data for web site logs. A published benchmark on a 20 node cluster (4 cores per node) took 14 hours vs. DataRush on a single 32 core machine which took 31 minutes. This is a26x improvement in time but also a massive reduction in running costs like electricity. Interestingly their work on this benchmark also showed some nice scalability – 3.2 hours for 4 cores, 1.5 hrs for 8, under 1 hr for 16, 31 minutes for 32. They have also worked on the Smith-Waterman algorithm (gene sequence alignment) and showed that code written for 8 core machine scaled up to 384 cores without any changes. This is a nice future-proofing example – companies can use Pervasive DataRush to build for their current machines, confident that this will scale as cores increase over time.

Since I spoke to them they have acquired significantly more real customers which is great. Some examples include a healthcare insurance provider who used a fuzzy matching algorithm based on DataRush to replace a bunch of stored procedures. This made it easier to target new prospects and meant that queries did not bring down the performance of the overall system. Another healthcare example was patient matching – data from multiple sources had to be integrated into a single customer data source. Initially this customer was concerned only about accuracy but they found that the increased speed let them iterate more rapidly, trying new fuzzy matching approaches (because it only took minutes to run rather than overnight). Third example was post-payment claims analysis on very large claims files (10s of millions of records).

The core DataRush Parallel Dataflow Engine is the base for the product, with a Java SDK on top. Pervasive have added Data Preparation and Core Analytics libraries (which you can extend) and then layered things like the KNIME and Data Matcher components on top. The KNIME integration provides a node for the drag and drop interface that uses DataRush under the covers. This allows an analyst to take advantage of DataRush without having to write any Java code – important as many analysts are not familiar with Java. Besides integration with KNIME and new operators, DataRush 4.4 has also added a Javascript scripting language to make it quicker and easier to use DataRush without having to write a bunch of Java code. DataRush 5 is adding multi-node support, more analytics and extensions to matching/data quality libraries.


Comments on this entry are closed.