Big Data Analytics #sassb

March 4, 2013

in Analytics, Data Mining


Paul Kent, VP of Big Data, came up next and gave a white board talk with no slides. His initial focus was on parallel processing for analytics, something that SAS has been working on for a while and that is key to their HPA and Big Data approaches. He points out that some problems (min as a simple example) are easy to parallelize while others (average for example) are harder. AsSAS developed parallel processing for their math routines they found it was increasingly hard to push processing into SQL and thus into the database. It worked much better to do the parallel processing of analytic algorithms in memory not in database.

Their initial attempt was a separate copy of data in memory with nodes shared between the database and the analytics.  This was not all that popular with IT so the current approach uses a dedicated appliance for running the analytics in parallel while accessing data that is ALSO parallelized on a data infrastructure.

This new approach allows a clustered appliance to manage the data and another appliance or a set of racks to do the math. Working with Teradata they developed TD Connect (available Q2) so that the math running on the SAS appliance accesses the data in Teradata completely in parallel using parallel datafeeds. The SAS nodes can talk to each other, essential to resolving the parallel algorithms, as well as to the parallel datafeeds. In addition the performance is good because the different nodes focus on what they do – the Data Warehouse nodes do data while the SAS nodes just do math. A similar approach, with different networking approaches obviously, works with Exadata, Greenplum, HANA etc.

However many companies are slowing their data warehouse group by adding Hadoop clusters. The SAS approach is extensible to this approach as it can extract parallel data streams from Hadoop also. In fact the approach allows the same SAS nodes to access data from Hadoop and one or more data warehouse appliances. This is the focus for 2013 – parallel in-memory analytics accessing data from high performance appliances through parallel datafeeds whether that data is coming from a data warehouse or Hadoop Big Data storage.

From a customer point of view the same tools (Enterprise Miner for instance) continue to work but have access to this massively increased power. SAS sees customers using this to:

  • Get more cycles from their analytic teams because the turnaround time is so much lower (faster) – leverage scarce resources
  • Use the same data but do more complex analysis of it such as moving from analyzing individuals to analyzing networks e.g. from a telco customer to a telco customer and the people they call regularly
  • Add new data sources, integrating Big Data into their analytics. For instance an insurance company that is offering special pricing for people who install a logging device that generates a tremendous amount of log data or a project that used elevator logs to predict office space needs, a situation where the data value was not clear initially
  • Use all their data – anything can be brought to bear, analyzed, applied because of the flexibility of the storage and its performance

A great tour de force from Paul Kent and his whiteboard.

** typo corrected 3/6


Previous post:

Next post: