I got an update on Oracle Data Mining – the in-database solution for data mining and predictive analytics offered by Oracle – recently. They have made ODM available on the amazon compute cloud so you can easily try it and they have been doing some interesting work on a new GUI and on integration with Oracle Exadata.
The core premise of ODM is that, as data volumes explode, at a certain point it makes more sense to “move the algorithms to the data” rather than “move the data to the algorithms”. Consequently, when Oracle acquired the Thinking Machines “Darwin” data mining team 11 years ago, they started with a clean sheet of paper. Oracle Data Mining, a priced option to the Oracle Database, was built from the ground up to focus on algorithms that are based on approaches that complement what a DBMS does well. For instance rather than attempting to shoe horn neural network inside a 35 year old relational database, the original ODM algorithms were selected to leverage the databases’ strengths in counting and conditional probabilities, parallel execution, bitmap indexes, aggregation techniques etc.
Over time, Oracle added support for additional capabilities and new algorithms that go further e.g. SVMs that are more robust, can mine thousands of input attributes, and mine “text”. The result of this approach is the Oracle Data Mining Option, a collection of 12+ machine learning algorithms that are available directly in SQL – essentially adding keywords like predict, classify, cluster etc to the SQL language. In addition models are first class objects in the database, so they are managed like other database assets. The algorithms focus on being quick and elegant and on ensuring that they work seamlessly with the DBMS in terms of performance, scalability, reliability etc. The current list of algorithms is Logistic Regression (GLM), Decision Trees, Naïve Bayes, Support Vector Machines (for classification & regression), Multiple Regression (GLM), One Class SVM (for anomaly detection) Minimum Description Length (for attribute importance), Apriori (association rules), Hierarchical K-Means and Hierarchical O-Cluster (clustering) and NonNegative Matrix Facorization (NMF, feature extraction). Additionally, Oracle now offers 50 or so basic statistical functions e.g. median, t-test, F-test, Pearson’s test for correlation, ANOVA, etc.—all free in every Oracle database.
Oracle Data Mining models can be created using the optional and free Oracle Data Miner UI available for download from the Oracle Technology Network (www.oracle.com/technology/products/bi/odm/index.html) or using the PL/SQL or Java APIs for model building and applications development and SQL functions for model apply. As in most uses of data mining and predictive analytics the models are not created when they are used but are created and evaluated as part of a modeling process. Once created, they can be used in SQL statements using defined extensions to SQL. The results are calculated live using the data in the database at that moment. The integration with the database, and with Exadata—where the ODM models are pushed down to storage for execution, means that the performance should be the same as it would be if the value was pre-calculated and stored without the delay inherent in actually doing so. A SQL query can, for instance, score every customer and then retrieve those that are, say, at the highest risk of churn.
In-database analytics, like ODM, often result in models that are faster to build and iterate as well as to score. The fact that data does not have to be extracted and moved before being modeled reduces the time to a working model and the use of the database infrastructure can mean that each iteration is also very fast. Integration with the database also means that the usual effort of “flattening” data into an analytic dataset is not necessary and the ODM algorithms can be run against views defined in the database that have structure like customer-has-order.
ODM has a new UI heading out to beta customers pretty soon (you can see screenshots on the Oracle site here) and it is free to Oracle database customers. The new UI will go to beta “late this spring” and looks like it will bring the GUI for ODM up to a nice level. It looks similar to tools from SAS or SPSS, but running in SQL developer, and is based on the same IDE. In fact it will be made available as an update for SQL Developer so you will be able to download it easily. When you run it within SQL Developer you can select a role and it will turn off unused menu items and enable the data mining specific ones. This integration should make it easy for folks to consume the new data mining capabilities.
The UI looks to have a nice look and feel including graphical model development flows, easy access to the data, nice little micro graphs when browsing data records and more. Model management is still done largely through the database management capabilities, though an XML file can be exported that defines the steps in model creation so that modelers can share models and one model process can invoke another as one of its steps and for deployment, the Oracle Data Miner UI generates SQL scripts for model apply. Clearly, though, ODM does not yet have the kinds of collaborative model management/creation capabilities of its competitors.
Thanks to its integration with the Oracle database, ODM is starting to appear in other Oracle products. Oracle BI Suite Enterprise Edition, for instance, can simply use the models as though they were data elements. Similarly Oracle Business Rules and other Oracle SOA platform products can access models like data. The ODM team is also working with Oracle Applications to bring data mining and predictive analytics into the applications with use cases like Oracle Retail Data Model, Oracle Communications Data Model, Oracle Sales prospector and Oracle Spend Classification. While I would like to see more integration of ODM with Oracle’s other decisioning products – like Real-Time Decisions and Oracle Business Rules – the reality is that you can always access ODM models through SQL.
It seems likely that Oracle will also be a target for SAS’ in-database analytic strategy (though neither company has released any details) and SAS already offers a native access engine for Oracle as a data source. There is, however, a fundamental difference between the two approaches (SAS in-database and ODM). SAS is working with vendors to make some SAS functionality available as User Defined Functions in the databases/warehouses it supports. This means that a SAS developer would build the same script or model creation flow that they would before but some or all of the steps execute in the database. This is not the same as the integration that Oracle currently delivers for ODM. ODM in contrast runs everything in the database kernel but also controls the model creation process. Oracle say that the use of the ODM routines in the Exadata kernel is faster than running a native ODM model in the database by a factor of 2 and that this increases as more joins are used. This could mean that ODM outperforms even third party in-database analytics. The tradeoff would then become one of pure performance against the fact that ODM does not support every modeling algorithm nor does it work in other databases.
Anyway, ODM is a nice product for Oracle database customers and well worth looking into. The new UI will only make it more so.