Next up at the Teradata Influencer event was a session on data lakes and related technology / approaches from the Think Big consulting team. Think Big sees a data lake as an approach to capturing, refining, storing and exploring any form of raw data at scale, enabled by low cost technologies and from which downstream facilities may draw. Data variety is the key driver and hadoop is the key underlying technology. Data governance is critical for a successful data lake they say:
- Control processes like security, authentication
- Usage tracking processes such as lineage and users
- Descriptive processes for views and business usages of data
- Organization and management processes like metadata, lifecycle management
They talk about “goldilocks” governance – enough but not too much – but they do believe that proper metadata management is what makes a data lake into a data reservoir not a data swamp. They always like to use meta data tooling such as Teradata Loom. Teradata Loom, as an aside, is a tool they find very effective for data and metadata management. It automates data lineage and profiling, reducing the time required to prepare the data for self service for instance.
To help customers with this they have a number of consulting offerings:
- 4 week Data Lake Assessment and gap analysis
- 8-10 week Data Lake Starter – gets one set up, puts basic governance in place and does some initial ingests
- 6-10 week Data Lake Optimization – clean up a data swamp
- 3-6 Data Lake Publishing for self service provisioning
All these programs are designed to transfer knowledge and get customer’s own staff up to sped while getting them onto a well governed data lake. There use sprints to keep the projects moving fast and mostly begin with batch feeds. They make sure everything works on all the major hadoop distributions as well as AWS / cloud deployments etc.
At the end of the day they see this kind of managed data repository on hadoop becoming increasingly mainstream.