Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time I will perform only read requests.

What is the best way to store this data in a db?

  • 1 big table with 7.3B rows?
  • 1000 tables (one for each stock symbol) with 7.3M rows each?
  • any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)

I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.

Edit:

This is a sample row:

'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0

Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.

Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

About the hardware: I plan to use Amazon RDS so I'm flexible on that

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
692 views
Welcome To Ask or Share your Answers For Others

1 Answer

So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:

Prepare a C/C++ struct to hold the record format:

struct StockPrice
{
    char ticker_code[2];
    double stock_price;
    timespec when;
    etc
};

Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.

Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:

//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);

Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.

StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
    stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);

You can now mmap it again read-only from any program and your data will be readily available:

file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);

// do stuff with stocks;

So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.

If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...