Introduction to CitusDB
One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:
- CitusDB does not permanently fork PostgreSQL; Citus Data has committed to always working with the latest PostgreSQL release, or at least with one that’s less than a year old.
- Citus Data never made the “fat head” mistake — if a join can’t be executed directly on the CitusDB data-storing nodes, it can’t be executed in CitusDB at all.
- CitusDB follows the modern best-practice of having many virtual nodes on each physical node. Default size of a virtual node is one gigabyte. Each virtual node is technically its own PostgreSQL table.*
- Citus Data has already introduced an open source column-store option for PostgreSQL, which CitusDB of course exploits.
*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.
Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.
Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:
- Also use single-node copies of PostgreSQL.
- Are blessedly able to scale out, although their underlying databases are entirely replicated.
- Store no actual data, but do store metadata about each virtual node, including:
- Structural metadata.
- Location.
- Min/max column values (for data skipping).
- But not (yet) stats to help with query optimization.
- Do some query planning and rewriting.
- Handle administration, some of which is nicely parallelized/centralized. (E.g., an index choice can be made once and automatically propagated across all the relevant virtual nodes.)
CitusDB is definitely in its early days. For example:
- If I understand correctly, the recent CitusDB 3.0 release is the first one on which data is redistributed among shards. Before that, you could only join tables that were either sharded on the same key, or else small enough to be broadcast-replicated across the whole cluster.
- SQL coverage isn’t great. (E.g., no Windowing.)
- Some hard-to-parallelize things aren’t implemented yet, e.g. exact median or generally-usable COUNT DISTINCT.
- ACID is still lacking. Writes are batch-only, micro-batch or otherwise as the case may be.
- CitusDB’s backup story is primitive, with the main options being:
- You can rely on having replicas on multiple nodes, even — if you like — in different data centers.
- You can backup each of the PostgreSQL nodes separately; CitusDB doesn’t yet offer automation for that.
- CitusDB’s query optimization sounds pretty primitive.
- I don’t recall Citus telling me of serious workload management.
- CitusDB compression is block-level only. (PostgreSQL’s version of Lempel-Ziv.)
Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.
Comments
6 Responses to “Introduction to CitusDB”
Leave a Reply
A few comments / clarifications:
(1) We have fully parallelized COUNT DISTINCT and have customers relying on it in production. For fastest results, we also compute accurate approximations (via hyper-log-log cardinality estimation).
(2) We use a “modular block” architecture that improves upon the approach of using multiple virtual nodes / database instances per physical node. The conceptual difference is that instead of having 1000 virtual databases on each node, CitusDB has 1 database with 1000 tables on a node. This is similar to HDFS in spirit, and has important implications on improved resource utilization, fault tolerance, and elastic scalability.
CitusDB query optimization works in two stages: (i) The global optimizer first optimizes for network I/O, using the metadata it has about each modular block. (2) The local optimizer then optimizes for disk I/O, leveraging all the Postgres statistics collection. All put together, significant information is taken into account during query optimization — we are making further improvements, of course.
On the count distincts, more precisely:
– If the database is already partitioned on the distinct column, we give exact results by pushing down the distinct.
– If not, we use the hyper-log-log approximation to push the parallelization. (We are using the popular postgres-hll extension for it.)
“CitusDB follows the modern best-practice of having many virtual nodes on each physical node.”
why is this a best-practice? do you mean for PG, or for all software?
Kelly,
I mean for a large fraction of all scale-out software. It makes things go better when you need to redistribute data for some reason — node outage, planned cluster expansion, whatever.
Sounds promising. It would be interesting how CitusDB differs from Postgres-XC regarding architecture and features (sounds as if both are very similar in the approach they took).