To create the maximum value out of the organization’s data landscape, traditional decision support system architecture is no longer adequate. New architectural patterns need to be developed to harness the power of data. To fully capture the value of using big data, organizations need to have flexible data architectures and able to extract maximum value from their data ecosystem.
The data lake architecture diagram concept has been around for some time now. However, I have seen organizations struggle to understand the concept as many of them are still boxed in the older paradigm of Enterprise Data Warehouses.
In this article, I will deep-dive into conceptual constructs of Data Lake Architecture pattern and layout an architecture pattern.
Lets start with the known first.
Traditional Data Warehouse (DWH) Architecture:
Traditional Enterprise DWH architecture pattern has been used for many years. There are data sources, data is extracted, transformed and loaded (ETL) and on the way, we do some kind of structure creation, cleansing etc. We predefine the data model in EDW (dimensional model or 3NF model) and then create departmental data marts for reporting, OLAP cubes for slicing and dicing and self-service BI.
This pattern is quite ubiquitous and has served us well for a long time now.
However, there are some inherent challenges in this pattern that can’t scale in the era of Big Data. Lets look at few of them:
Firstly, the philosophy with which we work is that we need to understand the data first. What is the source system structure, what kind of data it holds, what is the cardinality, how should be model is based on the business requirements, are there any anomalies in data so on and so forth. This is tedious and complex work. I used to spend at least 2–3 months in the requirement analysis and data analysis phase. The EDW projects span from a few months to a few years. And this is all based on the assumption that the business knows the requirements.
We also have to make choices and compromises on which data to store and which data to discard. A lot of time is spent upfront on deciding what to bring in, how to bring in, how to store, how to transform etc. Lesser time is spent on actually performing data discovery, uncovering patterns, or creating new hypotheses for business value add.