Technology | Databases

Druid Databases – What are they and when are they useful?

Words by Alex Matheson
Druid Databases – What are they and when are they useful?

Databases come in all shapes and sizes. They’re designed to fit multiple purposes and some have very unique, application specific properties. Essentially though, they’re a means of storing and retrieving data. Today we’re going to look at one database in particular, Druid.

Who uses Druid?

Druid is used by some of the biggest online organisations of the moment. Each one has specific audiences and goals, and each one makes use of Druid in unique ways.

Netflix uses Druid to aggregate many data streams, ingesting up to 2 terabytes of information each hour. Twitch, probably the most well known live streaming service in the world, uses Druid to allow their staff to drill into high level metrics, rather than reading generated reports.

The list goes on and includes such giants as Airbnb, British Telecom, Paypal, Salesforce and more. We here at Atlas also use Druid to provide Atlas Boost a means for digesting data from Kafka, which in turn allows us to generate a suite of APIs for 3rd party dashboard applications including Grafana and MS Power BI.

What is Druid?

Druid is a column-oriented, open-source, distributed data store written in Java. That’s quite a mouthful so let’s break it down a bit.

A column-oriented, or columnar, database is one that stores databases in columns rather than rows. Think of an Excel spreadsheet where you name each column, and enter the relevant data below.

Open-source software is distributed under licenses which allow for users to study, change and distribute that source code. This approach can encourage collaborative development and innovation.

A distributed data store is one that exists, or can exist, across multiple physical systems. This is increasingly important in the digital age as databases need to serve and be accessed from physical locations almost anywhere on the globe.

Finally, Java is the programming language that Druid is written in.

Put this all together and you have an organised database that’s adaptable, cloud native and supported by a large and committed community. This makes Druid an attractive option for a lot of organisations as the support, both official and unofficial, is readily available and Druid itself is highly adaptable to an organisation’s needs and goals.

What is Druid used for?

According to the official site, Druid is “…primarily used for business intelligence (OLAP) queries on event data.” To expand on that, Druid is most commonly used in applications where rapid analytics on large amounts of event data (data related to change that occurs at a point in time) is crucial. Some examples for key uses that the Druid website lists are:

  • Clickstream analytics (web and mobile analytics)
  • Network telemetry analytics (network performance monitoring)
  • Server metrics storage
  • Supply chain analytics (manufacturing metrics)
  • Application performance metrics
  • Digital marketing/advertising analytics
  • Business intelligence / OLAP

How much data can Druid handle?

The amount of data that a Druid Database can handle depends on the application and development of that specific database. However, some existing production Druid clusters have achieved impressive results, including:

  • 3+ trillion events/month
  • 3M+ events/sec through Druid’s real-time ingestion
  • 100+ PB of raw data
  • 50+ trillion events
  • Thousands of queries per second for applications used by thousands of users
  • Tens of thousands of cores

When should I use Druid?

Druid is best used when you have a large amount of incoming data, which doesn’t require updates. It’s also helpful when your data has a time component, and requires low-latency.

For these reasons, Druid is commonly used to power graphical user interfaces (GUIs) for analytics applications. It also fits neatly into the back-end of highly-concurrent APIs that need fast aggregations.

When should you not use Druid?

Due to the focus on real-time analytics of incoming data, Druid is less useful when you’re looking to process purely historic data, though it can and is used for processing specific types of historical data. Druid excels at handling streaming inserts, or taking in data, but not streaming updates.

So what database should you use in place of Druid for these sorts of applications? Well, that’s a question and answer for another Monday.