Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

How would you tackle the following storage and retrieval problem?

Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:

  • id (unique row identifier)
  • entity_id (takes on values between 1 and 2.000.000 inclusive)
  • date_id (incremented with one each day - will take on values between 1 and 3.650 (ten years: 1*365*10))
  • value_1 (takes on values between 1 and 1.000.000 inclusive)
  • value_2 (takes on values between 1 and 1.000.000 inclusive)

entity_id combined with date_id is unique. Hence, at most one row per entity and date can be added to the table. The database must be able to hold 10 years worth of daily data (7.300.000.000 rows (3.650*2.000.000)).

What is described above is the write patterns. The read pattern is simple: all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345.

Transactional support is not needed, but the storage solution must be open-sourced. Ideally I'd like to use MySQL, but I'm open for suggestions.

Now - how would you tackle the described problem?

Update: I was asked to elaborate regarding the read and write patterns. Writes to the table will be done in one batch per day where the new 2M entries will be added in one go. Reads will be done continuously with one read every second.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
339 views
Welcome To Ask or Share your Answers For Others

1 Answer

"Now - how would you tackle the described problem?"

With simple flat files.

Here's why

"all queries will be made on a specific entity_id. I.e. retrieve all rows describing entity_id = 12345."

You have 2.000.000 entities. Partition based on entity number:

level1= entity/10000
level2= (entity/100)%100
level3= entity%100

The each file of data is level1/level2/level3/batch_of_data

You can then read all of the files in a given part of the directory to return samples for processing.

If someone wants a relational database, then load files for a given entity_id into a database for their use.


Edit On day numbers.

  1. The date_id/entity_id uniqueness rule is not something that has to be handled. It's (a) trivially imposed on the file names and (b) irrelevant for querying.

  2. The date_id "rollover" doesn't mean anything -- there's no query, so there's no need to rename anything. The date_id should simply grow without bound from the epoch date. If you want to purge old data, then delete the old files.

Since no query relies on date_id, nothing ever needs to be done with it. It can be the file name for all that it matters.

To include the date_id in the result set, write it in the file with the other four attributes that are in each row of the file.


Edit on open/close

For writing, you have to leave the file(s) open. You do periodic flushes (or close/reopen) to assure that stuff really is going to disk.

You have two choices for the architecture of your writer.

  1. Have a single "writer" process that consolidates the data from the various source(s). This is helpful if queries are relatively frequent. You pay for merging the data at write time.

  2. Have several files open concurrently for writing. When querying, merge these files into a single result. This is helpful is queries are relatively rare. You pay for merging the data at query time.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...