Technology | Data lake

What is a data lake?

Words by Martin Kelman
What is a data lake?

A data lake is a database system that can combines various types of structured and unstructured data with large capacity and scale to match. But that doesn’t tell the whole story.

We have to ask the question; why the data lakes have become so fashionable? The main reason is because of the way we use the internet now and, in particular, the rise of social media.

These platforms needed new levels of flexibility and scalability that the older SQL database technologies didn’t offer. SQL databases are great at creating a fixed table structures and schemas onto which rules and constraints can be applied. This makes SQL prefect for 3 tier (UI, business logic, database) application architectures, but it’s not so great for using that data to generate extra business value, analytics and AI bots.

Before data lakes came about, ETL (extract, transform and load) approaches were used. However, the inherent benefits built into SQL became a weakness. The more IT architectures become data centric the less beneficial application centric architectures have been.

So, with the rise of social media follows the need for data, with the need for data there is a need for new approaches and new database technologies. These technologies fall into two main categories:

  • Databases that can handle time series/event data
  • Databases that can couple with handling unstructured or even unexpected data

Generally unstructured databases are known as No-SQL or document DB. Time-series are high performance real-time analytics database that can handle petabytes of the with milli-second resolution. These new technologies have the following advantages:

  • Improved customer interactions
  • Improve R&D innovation choices
  • Increase operational efficiencies

In the end a data lake can be what you want it to be. We have more choice of database technologies than we have ever had. Each of the databases have strengths and weakness, but one thing is unavoidable – if you put rubbish data in you will get rubbish data out.