≡ Menu

First Look – Skytree Server


Skytree is singularly focused on advanced analytics, machine learning on massive datasets. They have been in development for several years and are based in Silicon Valley (with engineering teams there and in Atlanta) and the product was officially launched in 2012. They believe that machine learning will be at the core of solutions for big data (solutions like Google AdWords or Netflix recommendations are based around machine learning techniques) and their product is designed to make scalable, general purpose machine learning infrastructure readily available. Skytree is positioning itself at the intersection of large data scale (data warehousing, Hadoop, storage appliances, etc.) and advanced analytics (Statistical packages and predictive analytic workbenches). In particular their focus is on pattern recognition, predictive models and data mining and increasingly doing these things against streaming data in a distributed computing environment. Most customers are using Skytree Server with other modeling tools but some are new to machine learning and advanced analytics and are only using Skytree Server.

Skytree Server has been designed from the ground up to perform machine learning efficiently. The standalone software version is already available and is cloud and Virtual Machine ready with standards-based data import and export. Full scale out distributed support (thousands of cores) and support for real-time/streaming data is coming later this year and is currently in private beta. The server delivers high performance for classic machine learning techniques such as K-Means clustering, SVM Classification and Nearest Neighbor. Comparing to open source solutions like WEKA and R they report 3-200x improvements over the more efficient of the two in each case (they can show up to 10,000x relative to the slower of the two) on just one machine. When the distributed version of the product is released, comparisons with distributed open source solutions will be available as well.

The product is purely API based and takes a data file or stream from any standard data source (RDBMS, NoSQL, HDFS, flat files). Machine Learning algorithms are then applied (to both variable identification and modeling) and the engine either outputs a file containing predictions, in batch scoring, or creates a model file that can be used to score individual records interactively in production mode. For those who want to build and continuously refine data the streaming engine gets a high degree of accuracy quickly as data streams in. Multiple streams can feed a single modeling routine (important when modeling in the cloud as it allows lots of end points to stream data to a single modeling routine). The streaming engine uses a TCP connection so it can be connected, for instance, to a listener on a drip-fed data warehouse  to analyze a data stream as the data warehouse updates.

Skytree is one of the vendors listed in our Decision Management Systems Platform Technologies report.


Comments on this entry are closed.