While R has become very popular in recent years the fact remains that as an open source product it has some scalability and performance issues (discussed in our paper on Standards in Predictive Analytics for instance). Base open source R is not really designed for the kind of large data volumes, Big Data, that are increasingly common as it is designed to run in-memory and largely single threaded.
Teradata has today announced Teradata Aster R. This is designed to allow you to run R “in-database” on Aster’s MPP architecture – to run open source R at scale. Not only should it mean you don’t need to sample or manage partitions, you can also mix and match the R language with Aster Discovery Portfolio functions now exposed as R functions. The whole is accessible through the Teradata SNAP Framework so the R engine runs alongside the existing SQL, SQL-MapReduce and SQL-GR (graph calculation) engines. All the SNAP engines, including the new R one, can access the various storage types supported by the Teradata UDA including relational tables, columnar stores and their file store as well as externally linked databases and Hadoop.
There are three specific components in the release:
- Aster R Parallel library
About 150 of the most popular R functions re-written to work on Teradata Aster. These generate SQL or MapReduce functions for execution in parallel on Teradata Aster. They replace the equivalent base R libraries and are integrated with the R interface for Aster Discovery functions. These leverage virtual data frames allowing all the existing data to be easily presented to R functions as a data frame.
- Aster R Parallel Constructor
This allows more sophisticated R users to parallelize additional R functions with a split-apply-combine approach. You have to define how the function runs in each distributed job and how the results are combined but the function handles scaling out and data distribution to potentially very many nodes.
- R Engine in the SNAP framework
This allows an R package, any R package, to be executed in parallel across all the nodes in the cluster. Data can be partitioned and passed to many nodes at once and each execution has access to the other SNAP engines (SQL, MapReduce, Graph). Running on SNAP means that the optimizer and other framework elements are all used.