Cassandra and privacy requirements
For starters:
- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
The most visible DataStax client using this strategy is apparently ING Bank.
If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:
- Encryption in flight, for any Cassandra.
- Encryption at rest, specifically with DataStax Enterprise.
- No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
- Various roles and permissions stuff.
While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)
One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.
I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.
Comments
3 Responses to “Cassandra and privacy requirements”
Leave a Reply
Cassandra’s tunable replication with upcoming cell-level or row-level security ups the ante for Sqrrl / Koverse who add granular security to Accumulo.
Engineering privacy in “more better” than adding security to keep folks out…
Wait a moment. I’m pretty sure that cell-level security is a feature of any version of Accumulo.
And HBase imitated it, in what I think was the first major use of the co-processor capability.
Yes. Cell-level security is a feature of Accumulo, but not the only granular security pattern. https://patents.google.com/?q=sqrrl