How to Best Design a Data Lake Storage?
by Lynn Naing, Managing Consultant – Altis Sydney
As Data Lake popularity grows, the common fundamental question we get asked from clients is ‘How to best design a Data Lake Storage?’. There is no silver bullet for ‘Best’ design as different clients have different requirements and use cases. However, there is a good common foundational design that we embark from. In this design, there are five core zones and is technology agnostic.
This is a transient folder where data will land. It is separated by Data Sources. This zone is to move data instantaneously from source to Data Lake and to reduce contention at source systems. Valid data is moved across to Raw Zone whereas invalid data is a move to Bad/Quarantine folder for manual intervention.
Valid data is the move to Raw folder in native format and it is ready for Ingestion by subsequent processes. Data is categorised into Data Source>Year>Month>Day>Hour. Depending on the frequency of data transfer and requirements, folder granularity can change. By categorising, it also systematically archives data in native format.
This is a useful but optional zone. This zone is about standardising to a particular format for best suitable for the curated layer. For example, standardising of flat files to *.txt, photos to *.jpeg or videos to *.mkv files. Also, there is an option here to perform some standard data cleansing if repeated processes are transpiring across multiple files such as removing special characters and characters encoding. The folder structure is kept the same as Raw and files are copied and converted into chosen formats.
Data is transformed, cleansed and ready for consumption in Curated Zone. Data is categorised into Subject Area>Files. Depending on requisites and tools selected, there is an option to partition data at a specified level such as Subject Areas>Files>Year>Month or Subject Areas>Files>Region. This zone is utilised by most data users including BI developers.
This zone is mainly for Data Scientist and Data Champions who understand data as well as the organisation’s business well. Data may originate from Curated as well as Standardized Zone. Data is organised by Project. Data Scientist can further separate per model within a project and write back of results are possible in Sandbox zone.
Adjustments to design are encouraged, depending on an organisation’s data type (structure, semi-structured, unstructured), data usage, data availability and data security requirements. Likewise, the organisation’s Data Lake usage for Data Warehouse, Applications, Master Data Management and others will prompt for further considerations and alteration to design. To conclude, the five core zones are not one size fit for all organisation but will provide a good foundation design for Data Lake Storage.