
The Clearly Podcast
Data Lakes
Summary
This week we talk about data lakes. Essentially, a data lake is a mechanism to store large quantities of (typically) raw data, both structured and unstructured, bringing together data from across an organisation.
In a "traditional" data warehouse solution, we tend to think about an "Extract, Transform and Load " process, extracting the data from source, transforming it for analysis, and loading it into the data warehouse. With a data lake, the approach tends to be "Extract, Load, and Transform", data is extracted from source, loaded into the data lake, then transformed when needed.
This can simplify the process as there is no need to transform it for every scenario at build time - so we can speed up implementation. The down side of course is that we have to do more work at run time. As such, there is probably not an either/or situation with data lakes vs more structured systems.
The flexibility of data lakes makes it tempting to dump anything and everything into the data lake. If this starts to happen without any curation, you are likely to end up in more of a data swamp. Data lakes are not a way to avoid governance.
The main cloud players all offer some sort of data lake:
Azure Data Lake
AWS Data Lake
Google Data Lake
If you already use Power BI, or are considering it, we strongly recommend you join your local Power BI user group here.
Transcript
Andy:
Welcome to today’s podcast on data lakes. Tom, can you explain what a data lake is?
Tom:
A data lake is a large repository for storing various types of data—structured (like relational databases), semi-structured (CSV files, Excel), and unstructured (text files, social media data, audio, video). Unlike a data warehouse, where data is structured when written, in a data lake, the structure is applied when data is read. This allows for more flexibility in handling raw data. Data lakes use an ELT (Extract, Load, Transform) process compared to the ETL (Extract, Transform, Load) process in data warehouses.
Andy:
How do we manage and make sense of unstructured data, like photos or PDFs, efficiently?
Shailan:
Data lakes can handle vast amounts of structured and unstructured data, up to petabytes in size. We can apply scripts and tools like NLP (Natural Language Processing) to analyze and extract patterns from unstructured data, providing valuable insights that may not be apparent initially.
Andy:
Can you give an example of data lake usage?
Shailan:
For instance, in training scenarios where audio is recorded, this data is stored in a data lake. Retrieval tools then analyze and tag key phrases. This approach is scalable, allowing for efficient processing and retrieval of large volumes of data.
Tom:
Data lakes are cost-effective for storing large datasets. For example, IoT data from sensors can be stored in a data lake, where only a subset is regularly analyzed, with the full dataset available for machine learning or other batch processing tasks.
Andy:
How do we prevent a data lake from becoming a data swamp, where it becomes disorganized and unmanageable?
Shailan:
Governance is crucial. Establish policies for archiving, retention, and data ownership. Like managing a file share, organize the data and ensure compliance with data management practices to avoid it becoming a dumping ground.
Tom:
Be selective about what data you capture. Only store data that has a clear use case to avoid unnecessary accumulation. This focused approach helps in maintaining the data lake efficiently.
Andy:
Will data lakes become a standard for organizations due to the rise of unstructured data?
Shailan:
Yes, due to the increasing volume and variety of data, data lakes are becoming more common. Tools like Azure Data Lake facilitate easy integration, making them accessible for many organizations.
Tom:
While data lakes will be ubiquitous, they won't replace structured databases entirely. They will coexist, each serving different needs based on the data and its use cases.
Andy:
What should organizations consider when deciding to implement a data lake?
Shailan:
Determine whether you need a data lake or a data warehouse based on the nature of your data (structured vs. unstructured) and your goals (analysis, reporting, accessibility).
Tom:
Start small and focused. Capture data that provides immediate value and expand from there to avoid creating a data swamp.
Andy:
Remember governance, archiving, and compliance to manage your data lake effectively. For more help, visit clearlycloudy.co.uk or clearlysolutions.net. Thank you for joining us.
Shailan and Tom:
Thank you!
Andy:
Have a great day, everyone!