I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.
I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).
I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.
Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?
What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?
What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?
Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).
Can some expert point me in the right direction or clear up any misunderstandings.
Thanks
question from:https://stackoverflow.com/questions/1623399/storing-massive-ordered-time-series-data-in-bigtable-derivatives