IBM has recently announced a new strategy for bringing Big Data to the enterprise. In particular this includes InfoSphere Streams v2 (announced April 12) and InfoSphere BigInsights 1.1 announced today. Big Data is an issue, of course, largely because the amount of data available to organizations is growing rapidly. Surveys show that many managers already make decisions based on data they don’t trust (1 in 3) and that many don’t have the data they need (1 in 2) – this is why 83% of CIOs cite business analytics as a top issue and 60% of CEOs think that they need to better use data. IBM estimates that 44x data will be created in the coming decade as is currently being managed by enterprise. Lots of this data, as with all data, is unstructured. This rising volume will make existing data challenges worse unless organizations can bring together more data from more sources to make better decisions.
Big Data says IBM is a challenge because of the three Vs:
There are lots of types of data
Lots of it is streaming and lots of data is moving around and changing quickly
There’s a tremendous amount of it
As a result IBM draws a distinction between “traditional analysis” and “big data analysis”. Traditional analysis defines a set of business questions that business users need answered and IT then structures a system that answers this question. Big Data they say is about exploring the data more iteratively – so IT delivers a platform that enables the business to explore more creatively.
IBM believes that the solution is to develop a Big Data platform that deals with these issues and lets you analyze large amounts of streaming and unstructured data. One example of an early customer was a turbine operator using 6PB of data to find where to position turbines given that you have to consider weather, the kind of turbine to use in each location and how to manage and maintain them – lots of different kinds of data that must all be analyzed to make the best decision. While IBM feels that a new platform is called for, it should not be a silo – Big Data should be a permanent part of the information architecture and should be used alongside more traditional data management and analysis tools. The key requirements for the platform then are:
- Support the variety, velocity and volume of Big Data
Examples include telemetry, schedule and weather in logistics (variety), 100k records/second in customer service (velocity) and 6PB of data in turbine analysis (volume)
- Provide analytics for data in its native format and adjust analysis automatically
Text, video, image, time series, statistics, data mining, geospatial etc. Must be able to do predictive analytics too and the platform allows all the data to be used to build a model rather than just a sample or recent records (this generally improves the accuracy of models and one customer went from using 30 days in its fraud models to using 7 years).
- Provide ease of use for developers and users
- Enterprise class with failure tolerance and scale
- Integration capabilities to bring in lots of sources and leverage existing integration technologies
Support for governance and incorporation of Big Data insights in the data warehouse
The platform is based on open source foundational components (Hadoop, HBase, Pig, Lucene, Jaql) with two Big Data Enterprise Engines – Streaming Analytics and Internet Scale Analytics – on top. User environments for administrators, developers and end users are layered on top and all of this plugs in to the usual integration products and solutions. IBM is contributing to various open source projects based on this work, notably jaql. Specific products announced:
- InfoSphere BigInsights 1.1
There is a free “Basic” edition for which you can buy support and an Enterprise Edition with provisioning, job management, more integrations, large scale indexing etc.This seems to target Cloudera pretty directly. The product has:
- Hadoop foundation with large scale indexing and integrated text analytics
- Provisioning, storage and advanced security
- Connectivity with DB2, InfoSphere Warehouse and IBM Smart Analytics system with third party products planned for the future
- InfoSphere Streams 2.0
- Runtime optimizations based on large numbers of Java virtual machines
- More operators and functions out of the box with analytics for text, data mining, statistics
- Monitoring and deployment flexibility have been improved
- Connectivity expanded to Netezza, Microsoft SQL Server, MySQL as well as DB2, Informix, solidDB and Oracle.
The last point made was a quick discussion of Watson. Watson used this kind of Big Data platform to manage its knowledge base (not the announced platform but the same underlying approaches) and can consume data from InfoSphere BigInsights. I blogged about Watson and its relationship to broader themes in analytics and Decision Management before.
IBM has clearly made a big investment in building out a powerful Big Data platform. Building on open source components but adding IBM’s own research and its focus on enterprise platforms to deliver something that is not a standalone Big Data environment but something that can become part of the core information architecture of an enterprise.
Decision Management implications
What I found interesting about the Big Data story, however, is the mismatch between the “iterative, ad-hoc, creative” mantra and the examples I see of people getting value from Big Data. IBM, for instance, says that it sees the predominant use case today as a very ad-hoc, investigative one. Yet the examples IBM gave of customers getting value were often highly structured, operational decision-making systems – Decision Management systems.
One example IBM gave is of the value of a 360 degree view of a customer’s mood or sentiment. Website logs, call detail records, social media, call center and customer service emails and more feed into the platform. The derived sentiment can be fed into the existing data infrastructure (for use in analytics for campaigns for instance) or used to trigger events. A great example of Big Data powering an operational Decision Management system.
Others included getting a 360 degree view of customers in retail to identify opportunities for targeted marketing activities (improving customer micro decisions, clearly an operational system that requires automated decisioning), reducing churn in telco by analyzing call detail records and much more (another operational decision) and analyzing data to find patterns in streaming data generating rules as an outcome (which presumably get deployed into a rules-based operational system). Not “Aha” moments but Decision Management systems.
And this matches what I have been thinking about with Big Data. For all the focus on visualization and ad-hoc queries in Big Data systems, the end result is often going to be automation – a Decision Management system. Given the volumes and velocity of Big Data it is most unlikely that people will be able to be plugged into the solution once it is up and running. Their role will be to do the analysis, make the judgments and set up a system to handle the transactions as they flow through. When you are talking about decisions that involve real-time, streaming data in huge volumes you are talking about building systems to handle these decisions. Not visualizations or dashboards but systems that handle things like multi-channel customer treatment decisions, detecting life threatening situations in time to intervene, managing risk in real-time etc.
Big Data for Decision Management.