I got an update from Microsoft recently. We covered lots of different products in and around data/analytics – some public, some under NDA. Microsoft is making a serious investment in SQL Server as well as into SharePoint and Excel. BI tools, they say, are not getting to most people and Microsoft sees this as an opportunity. They are targeting business users, bright domain experts who know where to find data and how to massage them in Excel. They estimate that there are some 500M such information workers of which about 100M could be considered power BI users even though not using a formal BI tool. Microsoft is targeting these folks with extensions to Excel, with PowerPivot (described as a way to mashup data in Excel) and with Report Builder 3.0 (making it easier to build operational reports). Extra use of data will mean increased IT demand to support them, so Microsoft is making a parallel investment in IT tools to see how taxing the queries are, which data sources are being used etc.
SharePoint is acting as the glue between these two audiences and the framework for much of what is going on. SharePoint 2010 has a lot of new features out of the box such as new intranet search, social networking around the shared components being developed as well as the long standing support for workflow, content management etc. SharePoint collaboration allows a document or document equivalent (such as a report or dashboard component) to be passed around, passed to IT (to change a data source say) etc. Microsoft has recently wrapped PerformancePoint into SharePoint. This makes developing composite applications around dashboards etc. easier and more integrated. This has also been integrated with things like presence, user security, teams etc.
Microsoft has developed some nice analytic tools for data investigation such as a decomposition tree for drilling into segmentation. They have also integrated search with BI assets, allowing search results to bring back a thumbnail, some structured data about the asset and even make some recommendations for filters like data source, data category etc. SharePoint supports folksonomies as well as taxonomies that can also be integrated into this search.
One feature that seemed particularly useful allows a user that found a report with the right data to do things like use it as the basis for a query. The environment reads the source system from the report and re-uses it. This is a nice feature as many people are more familiar with their reports than with their data sources. SharePoint can also display galleries of reports that allow users to browse, see who built what, see thumbnails etc.
SharePoint pieces can easily be linked using web part connections so that, for instance, you can see the person responsible for an element of a dashboard. This allows links between any web parts so, for instance, you could link from a dashboard element to a business rules editor to change the rules that impact that dashboard (though this would require you to use one of the .Net rules engines out there like InRule). SharePoint Maps can use any diagram (Visio) and bind it to data and will soon allow ESRI maps also. SharePoint, of course, handles things like workflow for updates. Data can be managed in different places – sales data could be in SQL Server and quotas in SharePoint directly for instance.
SharePoint has lots of APIs and support for development tools including support from within Visual Studio to manage code that uses SharePoint. You can get at data in PerformancePoint scorecards and drive programmatic elements into the scorecard. The environment allows you to build an integrated layer of Microsoft and third party elements within a single portal environment.
I had not seen PowerPivot before, though I am of course familiar with PivotTables in Excel. PowerPivot allows you to bring in data from standard SQL Server Analysis Services data sources using list of tables, a query designer or a specific query. You can specify details– remove columns or filter using Excel filters for instance. Each table becomes a sheet and you can edit the SQL for each as well as add more Excel filters once the data is loaded. You can also have other data that you pasted in or that you have in another Excel file. There is some automated building of filters/expressions based on the relationships in the data sources and tables that started off related (through foreign keys say) can also be linked manually. If you try to build a pivot table on data that needs a relationship then the tool warns you and helps you finds and create these relationships. As well as all the usual Pivot Table like features Microsoft has added Slicers for visual, multi-select and cascading filtering.
Users can push the results to SharePoint and view the data online using Analysis Services (including live Slicers etc). Users can see thumbnails in the gallery, work in a thin client using Excel Services or download into Excel. Loading the data compresses it for size and performance – it actually creates an Analysis Services database that can also be accessed directly by Analysis Services. Data can also be accessed as a feed and SharePoint can be used to schedule updates. IT reporting allows analysis of who accesses which work books, what data sources they have, how access changes over time, impact of shared data sources, when are workbooks added etc. PowerPivots can link to each other and the reporting for IT traces back through this.
PowerPivot uses in memory features to allow very rapid interactions and uses Excel-like language for column definitions etc. Files can be managed and deployed through team sites where things like data refresh can be managed and collaborative/social activities performed. In the future, PowerPivot data sources will be more integrated with the analysis services on the server for making sharing/production support easier.
Microsoft takes a quite different perspective when it comes to Predictive Analytics. Traditionally products for predictive analytics are targeted at very high end users – statisticians, mathematicians. These users build sophisticated models that then must be interpreted for their users (executives, systems). Microsoft’s plan was to bring modeling to a broader audience by targeting analysts without a lot of data mining or statistics expertise/experience. They also wanted to target developers without a huge background in statistics as well as people who needed to do some data mining in their day to day job. The result is a data mining stack on top of SQL Server with Analysis services sitting above Integration Services and below Reporting. Analysis Services supports OLAP, PowerPivot and data mining. The data mining environment provides an extensive editing environment and 9 popular algorithms. It is designed to be easy to integrate with Office and the rest of SQL Server as well as easy to embed through SQL based API. Finally a set of APIS make it extensible so companies can add additional algorithms.
The API is a language called DMX – modeled on SQL. Users create a model very like they would create a table. Training a model looks like an Insert statement and queries look very similar when pulling results from a model and when pulling data from a table. Joins etc can cross models and tables as you would expect. Microsoft also uses nested tables to avoid the typical explosion of relational data into a flat dataset for modeling. The nested tables are processed like a flat set but it is not necessary to store or describe it this way. Supported algorithms include decision trees, clustering, naïve-bayes, neural networks, association rules, time series, sequence clustering and logistic/linear regression. KXEN, Megaputer and various SIs/customers have developed extensions.
Everything runs in the Visual Studio framework though it is a separate tool (so people only doing data mining don’t need to see all the usual programming tools). A standard data connection is used and the user specifies a View for the data they want to use. A wizard runs that allows the selection of an algorithm, target variable and input data. Users can mark attributes for discretization (into buckets) and the wizard detects variables that are discrete or continuous. The wizard further allows a user to specify the testing dataset and the engine does cross-validation on automated slices of the data to avoid overfitting and other problems.
It is easy to copy and edit models to create a set of many slightly different models for comparison. The user pushes the models out to the server for execution. While SQL Server does not support these routines for in database mining, the models can keep a copy of the training data to support drill through and analysis of model results. As you would expect you can use standard SQL Server tools to examine results, compare models etc.
Once built the model can be used to generate a report, for instance, or just to add a field (a score say) to a SQL string by setting up the server to use the model. As always you need to map data from an input table to input variables. Because everything uses Analysis Services the same model can be used in automated services, reports and Excel or any ETL process built in Integration Services – a nice feature.
The new Parallel Data Warehouse is the outgrowth of Microsoft’s DATAllegro acquisition, now ported to Windows/SQL Server and cleaned up. PDW is where SQL Server data warehousing is going, especially at the high end. MPP based on off the shelf components from multiple vendors – it is an appliance that comes integrated with Microsoft BI but one you can buy from multiple vendors – you can essentially pick the hardware vendor you want. Key feature:
- Easy set up and admin (appliance)
- Fast loads
- Hold fine-grained business event data thanks to MPP and capacity (important for organizations serious about data mining or predictive analytics)
- Fast, interactive query responses (and data reorganization for instance)
PDW comes in a single rack or in multiples of a full rack. Critical pieces are the control nodes and a bunch of database servers connected by dual infiniband. Database servers have “shared nothing” storage nodes. There is also a spare database server to replace a failing database server while the SANs are already high-availability/redundant. Also has landing zone (for loading), back up and management servers. PDW is a little more loosely coupled than solutions like those from Teradata as SQL text is between the components rather than binary files. Big tables are partitioned across servers using a hash and then partitioned within those (distributions) while small tables are replicated. Queries are spread among the nodes to produce interim results and then re-assembled to produce final results.
PDW is based on SQL Server, which has been extended to support relational data warehousing in SQL Server 2008. As you would expect, the SQL Server Analysis Services runs on top.