I got a chance to catch up with Via Science recently [Post updated June 2014 to reflect some changes in terminology]. Via Science is focused on “Big Math” for “big data.” They see big data companies spanning from data collection to search/storage to analysis/visualization. Big Math – the category in which they like to put themselves – is “the use of leading edge mathematics optimized on a very large supercomputing platform to make predictions, recommendations, and explain how really complex systems work, directly from data”. Math, they say, makes data useful – it turns GPS data into directions, purchase history into recommendations, and delivers scale. Big Math skills are, however, not all that widespread in businesses. Given the volume of data and this lack of skills, business people need sophisticated tools to make it useful. Via Science applies Bayesian statistics and networks on a big computing platform (supercomputer) to process big data and small data.
Today Via Science builds models for users and the results are deployed using Java. This small software application can be run on commodity hardware and built into operational decision management systems as well as more traditional decision support – BI – systems. They have subsidiaries dedicated to healthcare and quantitative trading customers. They are now expanding beyond those sectors to retail, CPG, energy, and telecommunications.
They have a big focus on cause and effect relationships to make predictions, detect anomalies and create explanatory models. Even when causation is not essential, their approach also provides a rich set of information about correlation and the approach also improves accuracy. They aim at three distinct differentiators:
- Handle big data and small data
Big data is generally long – lots of records. There are many real world problems with wide data– lots of columns and relatively few rows. It’s hard to publish long data because you might have 100s of millions of records that reflect what has happened and must be analyzed to predict what might happen next. Wide data is hard because you might have 10s of thousands of variables for each entry. This is hard to process even if you only have some thousands of rows because traditional frequentist statistics requires many more rows than columns to make accurate predictions.
- Causality, describing the way data interacts and leads to an outcome.
The network of causality, how the variables interact and in what “direction,” lets you ask when will something happen (predictions), why something happens (causal explanations) and how to make more or less of something happen (optimization).
- Handling volatility and uncertainty in data.
Standard statistical approaches work well for normal distributions and continuing historic trends while their approach will detect events or make predictions outside the bounds of historical analysis.
The math behind this, Bayesian statistics and networks, is well understood and the challenge is how to scale it. The basis for this is the work of Judea Pearl who developed a branch of mathematics for determining cause and effect relationships in data. The platform applies Pearl’s work (Bayesian networks) at massive scale on a hosted platform. The platform (REFSTM) handles four steps:
- Network Fragment Identification where lots of network fragments are constructed and evaluated to see if they can explain a part of the picture. Probability scores for each fragment are calculated using the full distribution of data and a Bayesian approach to weed out meaningless correlations.
- Network Model Creation combines the most important fragments into an optimal ensemble. An ensemble Bayesian network is created using a Markov-Chain Monte Carlo algorithm. Initially networks are selected randomly and random changes are made to the model at each stage of optimization to prevent local maxima becoming a problem. Large numbers of networks are generated each with a different result and some weight reflecting the likelihood it is correct. Very large numbers of networks might be generated and perhaps 500 to 1,000 of the best are selected. These are then combined to generate a weighted mean and standard deviation. This is what code can be generated for.
- Deployment is performed once this complete and REFS generates a small Java application for either local or cloud execution.
- Simulation is run on the ensemble of models to see how much impact you might expect for a change to each variable. This gives you a sense of the causal impact of each variable.
Via Science’s software platform runs on both Amazon EC2 and IBM Blue Gene/Q supercomputers with 130,000+ CPU cores however the resulting models are executable locally. You can find more information about Via Science here.