What Do You Get When You Cross A Data Warehouse With A Data Lake?
by Kent Teague, Managing Consultant – Altis Melbourne
Did you guess Data LakeHouse? And, no I’m not talking about the latest episode of Grand Designs. Although like Kevin McCloud in his quest to follow design projects from laying the foundation through to the building of dream homes. We’ll be looking at the foundation of the latest Data Platform Architecture Paradigm the Data LakeHouse.
But before we jump right in and talk about the Data LakeHouse. Let’s revisit the definition of the Data Lake as coined by James Dixon, the founder and CTO of Pentaho:
“If you think of a DataMart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
So, what is a Data LakeHouse? The Data LakeHouse is essentially a hybrid concept that offers the key features of both a Data Lake and a Data Warehouse. Effectively serving as the middle-ground between your Data Warehouse and Data Lake by combining the structured, standardised, and connected data entities found in a traditional Data Warehouse with the low-cost/flexible storage of data in a Data Lake.
In other words, as described by Databricks “They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available”.
In addition, the design of the Data LakeHouse serves to address the criticism that a Data Warehouse requires a large amount of upfront effort to cleanse, standardise, and build relationships between entities, whereas the Data Lake require too little effort in these areas.
So, what are some of the real-world benefits a Data LakeHouse implementation offers? Databricks states that Data LakeHouse offers the following as key benefits:
- ACID Transaction support
- Schema enforcement and governance
- Using BI tools directly over source data
- Storage is decoupled from compute
- Open and standardised storage formats
- Support for diverse data types ranging from unstructured to structured data
- Support for diverse workloads
- End-to-end streaming and real-time reporting
For those interested in reading further about the Data LakeHouse, I recommend starting with the following:
To discuss this topic further or for an initial review of an existing or planned data platform get in touch with us.