HACKER Q&A
📣 psankar

What database solution would you choose for this type of workload?


We have a telemetry gathering system. We get about 250rps to our backend systems, which dump this data on a database. We get ~25million records per day, occupying a disk space of ~12.5GB per day (inclusive of elastic index). We use elastic as the backend.

We chose elastic because we did not what fields we may get. There are ~40 fields (per record). Some fields may be missing in some records. Coming to the reads, the data thus saved would be queried in a timeseries way. We will access data for last 1 hour, 1 week, 1 month mostly.

The data will be queried with a lot of complex aggregations and joins too. For ex: What is the avg response time (which is a field) of all the requests that talked from node (another field) appServer1 to node appServer2, last month.

With elastic we are quite happy with the data until there are ~10 million records, beyond which response times just lag. It takes 10+ seconds for each query after 10 million records. We thought it could be due to reader starvation (because of incoming writes) and created a read-only replica and did our queries in that replica. But even there the slowness is felt.

In terms of memory, We have given about 1GB for esmaster, 2GB for esdata and another 1GB for esclient. All persistent disks are normal HDDs and not SSDs. We do not have much worries about diskspace and can add more diskspace (but not SSDs). But we will not be able to provision more RAM.

Also, we want our system to add columns/fields without any stop-the-world migrations. This is what makes me nervous about choosing databases that want a schema ahead.

The writes can take (update multiple indexes on each write) but the reads have to be fast. We are also fine with choosing a hybrid approach where one system receives the writes and another system can do the analytics. The system has to work in kubernetes across clouds. So single-cloud solutions (like dynamodb) are ruledout.

Given the load and the above constraints, what kind of a database solution would you choose ?


  👤 jlgaddis Accepted Answer ✓
You say you can't add RAM or SSDs but that sounds like exactly what you need. If your data doesn't fit in RAM (and, with 25MM records/day, it doesn't), then you're gonna be hitting those slow HDDs when you run queries.

I'd add RAM first and SSDs second but you say you can't do either of those? Well, good luck, I guess -- any "database solution" you choose is going to perform like crap given these resource constraints.


👤 speedgoose
Have you tried column based databases such as Cassandra, HPE vertica, or Clickhouse?

Have you tried a "big data" approach like Apache Spark on parquet files?

Did you consider saving less data? Do you really need to save that much data? Can't you sample the data and save a lot less while keeping the same information overall ?


👤 ianamartin
When you say you have 25 million records per day, taking 25 GB of disk space, can you clarify what you mean by that?

Does each day of 25 million records take 25GB of disk space? Or are you saying that it cumulatively takes 25 GB up until now?

1 million records per GB would indicate a really, really large record.