Metamarkets’ back-end technology
This is part of a three-post series:
- Introduction to Metamarkets and Druid
- Druid overview
- Metamarkets’ back-end technology (this post)
The canonical Metamarkets batch ingest pipeline is a bit complicated.
- Data lands on Amazon S3 (uploaded or because it was there all along).
- Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
- Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
- Druid is notified, and pulls the data from Hadoop at its convenience.
By “get data read to be put into Druid” I mean:
- Build the data segments (recall that Druid manages data in rather large segments).
- Note metadata about the segments.
That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)
By “build the data segments” I mean:
- Make the sharding decisions.
- Arrange data columnarly within shard.
- Build a compressed bitmap for each shard.
When things are being done that way, Druid may be regarded as comprising three kinds of servers:
- Actual data storage.
- Query brokers, which also have local cache.
- Coordination/management, including administrative interfaces and so on.
This is in addition to the aforementioned Zookeeper and MySQL.
It occurs to me that I don’t know whether that local cache is the only RAM tier, which is a pretty major point. Oh well …
The alternative is that data just streams into Druid. In that case:
- The various Hadoop pre-processing steps in the batch ingest process don’t happen.
- Instead, it’s assumed and required that data already be in structured properly — in terms of summarization and denormalization — for Druid.
- An additional tier of Druid data storage servers is involved; in particular, that tier is what receives the streaming data.
- Metamarkets calls the data storage servers that receive data “Real-Time”, while the others are “Historical”.
- The Real-Time servers persist data, on an append-only basis, into proper Druid segments. After a while, these are flushed to the Historical servers.
- The Druid query broker sends queries to the Real-Time or Historical servers as appropriate.
Comments
7 Responses to “Metamarkets’ back-end technology”
Leave a Reply
[…] Metamarkets’ back-end technology […]
[…] more on Druid, please see my post on Metamarkets’ back-end technology. Categories: Clustering, Columnar database management, Data models and architecture, Data […]
“It occurs to me that I don’t know whether that local cache is the only RAM tier, which is a pretty major point. Oh well …”
This is probably the only RAM tier, but I must admit that I don’t quite understand what is meant by “RAM tier.”
If “RAM tier” means that it is an application tier that cannot do its own processing, but just stupidly stores stuff in RAM, which it can lose on process death. Then yes, this is the only RAM tier. The main compute nodes will retain their data between process death and restart.
If “RAM tier” is referring to whether there is also a cache on the compute nodes. Then, this is the only RAM tier, the only data accessed on the compute nodes is the base table data. I.e. there is no “cache” local to a specific compute node, instead there is all of the data loaded up in main memory.
If “RAM tier” means something else, then it might not be the only RAM tier :).
Eric,
The question is:
Since you guys make such a large point of saying you do your processing based on in-memory data — on which tier is that data located in-memory?
Ah, that would be the compute nodes. The local cache in the query broker is just a lazy cache of completed query results, no real processing of data is done there.
Eric,
So those compute nodes are separate from the various data management and ingest tiers I wrote about?
[…] DBMS-like processing in BI offerings may be found in QlikView, Zoomdata, Platfora, ClearStory, Metamarkets and others. That some of those are SaaS (Software as a Service) doesn’t undermine the general […]