Big Data is quickly becoming a business’s best resource. Refined and analyzed data can help companies to improve operations and uncover insights. But as data begins to flow in from multiple sources, it’s important to prevent it from being siloed by pooling it in a central repository.
Enter the data lake: an architecture that acts as a storage space for data so engineers and IT teams can easily access it for future use.
Moreover, a data lake can form the firm foundation of a modern data platform. This was the case for telecom provider TalkTalk, which tapped Microsoft Azure’s Data Lake, Data Lake Analytics and Data Factory cloud-based hybrid data integration tools to increase data maturity.
Tapping Azure, the company quickly developed real-time and batch data pipelines while introducing security best practices and DevOps processes. This change brought data to the forefront of the company’s architectural decisions.
With this architecture in place, TalkTalk is looking to “scale the modern data platform, introducing real-time and event-driven ingestion and processing, as well as migrating on-premises systems to the cloud,” says Ben Dyer, TalkTalk’s head of data technology and architecture.
So, how do companies go about building and using a data lake? A good place to start is to understand the architecture and what it can deliver.
What Is a Data Lake?
Data lakes store data of any type in its raw form, much as a real lake provides a habitat where all types of creatures can live together.
A data lake is an architecture for storing high-volume, high-velocity, high-variety, as-is data in a centralized repository for Big Data and real-time analytics. And the technology is an attention-getter: The global data lakes market is expected to grow at a rate of 28 percent between 2017 and 2023.
Companies can pull in vast amounts of data — structured, semistructured and unstructured — in real time into a data lake, from anywhere. Data can be ingested from Internet of Things sensors, clickstream activity on a website, log files, social media feeds, videos and online transaction processing (OLTP) systems, for instance. There are no constraints on where the data hails from, but it’s a good idea to use metadata tagging to add some level of organization to what’s ingested, so that relevant data can be surfaced for queries and analysis.
“To ensure that a lake doesn’t become a swamp, it’s very helpful to provide a catalog that makes data visible and accessible to the business, as well as to IT and data management professionals,” says Doug Henschen, vice president and principal analyst at Constellation Research.
Data Lakes vs. Data Warehouses
Data lakes should not be confused with data warehouses. Where data lakes store raw data, warehouses store current and historical data in an organized fashion.
IT teams and data engineers should think of a data warehouse as a highly structured environment, where racks and containers are clearly labeled and similar items are stacked together for supply chain efficiency.
The difference between a data lake and a data warehouse primarily pertains to analytics.
Data warehouses are best for analyzing structured data quickly and with great accuracy and transparency for managerial or regulatory purposes. Meanwhile, data lakes are primed for experimentation, explains Kelle O'Neal, founder and CEO of management consulting firm First San Francisco Partners.
With a data lake, businesses can quickly load a variety of data types from multiple sources and engage in ad hoc analysis. Or, a data team could leverage machine learning in a data lake to find “a needle in a haystack,” O’Neal says.
“The rapid inclusion of new data sets would never be possible in a traditional data warehouse, with its data model–specific structures and its constraints on adding new sources or targets,” O’Neal says.
Data warehouses follow a “schema on write” approach, which entails defining a schema for data before being able to write it to the database. Online analytical processing (OLAP) technology can be used to analyze and evaluate data in a warehouse, enabling fast responses to complex analytical queries.
Data lakes take a “schema on read” approach, where the data is structured and transformed only when it is ready to be used. For this reason, it’s a snap to bring in new data sources, and users don’t have to know in advance the questions they want to answer. With lakes, “different types of analytics on your data — like SQL queries, Big Data analytics, full-text search, real-time analytics and machine learning — can be used to uncover insights,” according to Amazon. Moreover, data lakes are capable of real-time actions based on algorithm-driven analytics.
Businesses may use both data lakes and data warehouses. The decision about which to use turns on “understanding and optimizing what the different solutions do best,” O’Neal says.
Installing a Data Lake Architecture
Modern Big Data infrastructures act as components of data lake architectures. These platforms offer various services, including:
- Hadoop — Delivers distributed storage and distributed processing of very large data sets on computer clusters
- Microsoft’s Azure Data Lake — A scalable public cloud service that uses Azure Data Lake Store, a Hadoop-compatible repository for analytics on data of any size, type and ingestion speed
- Azure Databricks — An Apache Spark analytics platform that is optimized for Azure cloud services and integrates with Azure databases and stores, including Data Lake Store
“There are various software providers, and all leading cloud providers offer this type of infrastructure as a cloud service,” Henschen says.
Hadoop services, for instance, exist not only on Microsoft Azure but also on Google Cloud and other leading cloud services. Other clouds, in addition to Azure, offer similar combinations of high-scale databases, object stores and analytical platforms.
While many providers, such as Oracle, offer both on-premises and cloud data lake solutions, most businesses are turning to the cloud when seeking data lake architectures.
“Cloud-based options — whether provided by cloud providers or software vendors with their own cloud services — are seeing the lion’s share of the growth these days,” says Henschen.
O’Neal backs up this assertion, recommending that businesses consider implementing a data lake in the cloud. “Based on their sophistication and deep pockets, many of the cloud providers have better security capabilities than most companies can afford to build on their own.”
She also points out that many of the data lake component providers and other data platform providers are releasing new types of metadata solutions to ensure data engineers and IT teams understand what data they have, where it goes and how it should be secured and managed.
While there are many advantages to adopting data lake technology, businesses should keep in mind that building a solid data lake architecture that delivers the analytics value a company needs is no small task. In fact, McKinsey & Company estimates that in many instances, the cost of a data lake investment can be millions of dollars. But for companies with growing volumes of data, the return on investment could be worth it.
“When data volumes start creeping into the tens of terabytes it’s time to consider something in the lake vein,” says Henschen.