Introduction to SequoiaDB and SequoiaCM
For starters, let me say:
- SequoiaDB, the company, is my client.
- SequoiaDB, the product, is the main product of SequoiaDB, the company.
- SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
- SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
- … and is usually sold for RDBMS-like use cases …
- … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
- SequoiaDB’s products are open source.
- SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
- Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.
Also:
- SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
- Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
- SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
- SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
- SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)
Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.
While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have* an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.
*Edit: Actually, this claim is overstated based on SequoiaDB’s current open source practices. Please see the comment thread below.
SequoiaDB’s technology story starts:
- SequoiaDB is a layered DBMS.
- It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
- Indexes are B-tree.
- Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
- Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
- Relational batch processing is done via SparkSQL.
- There also is a block/LOB (Large OBject) storage engine meant for content management applications.
- SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.
SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:
- SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
- Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” 🙂 — are handled by the JSON store.
- PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.
PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.
I neglected to ask how much of that remains true when SparkSQL is invoked.
SequoiaDB’s use cases to date seem to fall mainly into three groups:
- Content management via SequoiaCM.
- “Operational data lakes”.
- Pretty generic replacement of legacy RDBMS.
Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.
To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:
- 2-3 years of data, and not all the data even from that time period.
- Only enough processing power to support structured business intelligence …
- … and hence little opportunity for ad-hoc query.
SequoiaDB operational data lakes offer multiple improvements over that scenario:
- They hold as much relational data as customers choose to dump there.
- That data can be simply copied from operational stores, with no transformation.
- Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
- Queries can be run straight against this data soup.
- Of course, views can also be set up in advance to help with querying.
Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.
Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:
- Photographs as part of an authentication process.
- Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
- Storage of security videos (for example from automated teller machines).
SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.
Comments
4 Responses to “Introduction to SequoiaDB and SequoiaCM”
Leave a Reply
> I neglected to ask how much of that remains true when SparkSQL is invoked.
SparkSQL is one of the most commonly used SQL interfaces for SequoiaDB in batch processing jobs. spark-sequoiadb-connector is implemented by overriding Spark RDD, so that all Spark RDD/Streaming/SQL are able to use SequoiaDB as the data source and target storage engine.
https://github.com/SequoiaDB/spark-sequoiadb
I didn’t check out the specifics of SequoiaDB’s open source story before posting. Let me now fill that gap. The mechanics are basically:
So it was a stretch when I suggested or assumed that there was a robust open source community familiar with the latest release of SequoiaDB’s products.
You’re right. In the Chinese market we are behaving very much like a traditional enterprise software vendor, albeit with a subscription-only pricing model. We intend to engage more fully with the open source community as we expand into other geographical regions. We’re sorry for any confusion this might cause!
Some people may ask why SequoiaDB is based in China, rather than from North America like most other database companies.
China is the world’s second largest economy, and its share of the world’s top 500 enterprises is almost equal to North America’s. Each year, massive amounts of data are generated in financial, telecom, internet, government and many other industries.
While many new distributed database companies are still trying to focus on customers in North America at the moment, SequoiaDB was founded in 2011 in China, and has helped a large number of Chinese and Asian customers with its leading technologies.