Speed of Analytics: Why Infrastructure and Platforms Matter #BDA13

June 6, 2013

in Analytics, BI, Data Mining


Next up is a session on the new infrastructure and platforms for analytics. IBM’s view (and I would agree) is that use cases for analytics are evolving to increasingly combine traditional structured data and newer unstructured, more dynamic, “Big Data” sources. As customers change the time frames in which they need to make decisions (more real-time) and using a richer set of data (more streaming data, more unstructured data) to produce more sophisticated analytics they need new kinds of infrastructure.

The range of analytics involved means that one size of infrastructure is not going to fit all – they range from memory intensive operations like OLAP to very storage intensive operations like analyzing sensor or weblog data. In between are streaming scenarios, operational Decision Management scenarios and more. Latency, bandwidth, concurrency, scalability and availability all need to be considered to deliver faster time to insight and value.  The reality of multiple scenarios mean that there is a different mix of Cores, Network Bandwidth, Storage and SCM required. In each case the right balance should accelerate the flow from data to insight to value.

IBM is looking at various technologies to improve performance. One example is the possible use of Field Programmable Gate Arrays (FPGAs) that allow low latency, high throughput parallelism with power efficiency and an ability to configure the logic at start time or even run time. FPGAs could be used for compression, sorting and analytics among other things.

Another technology is, of course, flash memory. As SSD/flash memory costs fall it becomes more practical to use memory based storage in place of (some) disk storage. Using automatic relocation it is possible to move the busiest data from regular drives to solid state drives. With even a small percentage of data moved it is possible to get significant performance improvements.

From a software perspective, IBM Platform Symphony helps with managing grids and workloads. It manages allocations across various grid technologies and can handle Apache/Hadoop also.

DB2 BLU is a critical component in the acceleration of analytics at IBM.  I have blogged about BLU before when it was launched. Today BLU is focused on reporting, analytics and OLAP rather than OLTP. Key things for BLU are:

  • BLU is a column store with the advantages of same
  • BLU leverages existing DB2 back up and buffers etc.
  • BLU columnar tables can coexist in DB2 with row tables and be joined/queried with them
  • BLU tables are automated so tuning, compression, optimizing, indexing are all handled automatically for both improved performance AND consistency
  • No indexing and better compression both contribute to the 10x improvement
  • Compression is focused on storing most frequent values more efficiently and on being register-friendly to improve IO. It’s also order preserving so a lot of predicate logic can be run against compressed data for improved processing
  • BLU uses SIMD chips effectively, processing multiple data elements in parallel in hardware. It’s also careful to make sure queries are parallelized to keep all the available cores busy
  • BLU has data skipping – it keeps track of range of values stored in a data page and can skip large sections of data for specific queries as a result,

The end result of all these things can take a 10TB query, reduce it to 1TB through compression, get down to 10GB by only looking at one column, use data skipping to focus on the 1Gb that might matter, parallelize it over 32 cores each of which delivers SMID to effectively only have to process 10MB.


Previous post:

Next post: