Aerospike
Discussion of AeroSpike, formerly known as Citrusleaf.
Basho and Riak
Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.
For starters:
- Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
- Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
- Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
- Basho’s revenue is ~90% subscription.
- Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
- Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.
Basho’s product line has gotten a bit confusing, but as best I understand things the story is:
- There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
- Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
- Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
- Riak TS is for time series, and just coming out now.
- Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
- There’s an umbrella marketing term of “Basho Data Platform”.
Technical notes on some of that include: Read more
Notes on memory-centric data management
I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as
DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)
By way of contrast:
Hybrid memory-centric DBMS is our term for a DBMS that has two modes:
- In-memory.
- Querying and updating (or loading into) persistent storage.
These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)
Two other sources of confusion are:
- The broad variety of memory-centric data management approaches.
- The over-enthusiastic marketing of SAP HANA.
With all that said, here’s a little update on in-memory data management and related subjects.
- I maintain my opinion that traditional databases will eventually wind up in RAM.
- At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
- Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
- It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
- I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.
And finally,
- I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.
Comments on the 2013 Gartner Magic Quadrant for Operational Database Management Systems
The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:
- I admire the raw research.
- The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
- Some of the details are questionable.
- There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
- The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)
Anyhow: Read more
Layering of database technology & DBMS with multiple DMLs
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more
Aerospike 3
My clients at Aerospike are coming out with their Version 3, and as several of my clients do, have encouraged me to front-run what otherwise would be the Monday embargo.
I encourage such behavior with arguments including:
- “Nobody else is going to write in such technical detail anyway, so they won’t mind.”
- “I’ve done this before. Other writers haven’t complained.”
- “In fact, some other writers like having me go first, so that they can learn from and/or point to what I say.”
- “Hey, I don’t ask for much in the way of exclusives, but I’d be pleased if you threw me this bone.”
Aerospike 2’s value proposition, let us recall, was:
… performance, consistent performance, and uninterrupted operations …
- Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
- Uninterrupted operation is a core Aerospike design goal, and the company says that to date, no Aerospike production cluster has ever gone down.
The major support for such claims is Aerospike’s success in selling to the digital advertising market, which is probably second only to high-frequency trading in its low-latency demands. For example, Aerospike’s CMO Monica Pal sent along a link to what apparently is:
- a video by a customer named Brightroll …
- … who enjoy SLAs (Service Level Agreements) such as those cited above (they actually mentioned five 9s)* …
- … at peak loads of 10-12 million requests/minute.
Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing, Web analytics | 3 Comments |
Analytic application themes
I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.
1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.
Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.
Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:
- There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
- Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.
2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:
- Customer interaction
- Network and sensor monitoring
- Game and mobile application back-ends
Also arising fairly frequently are:
- Algorithmic trading
- Anti-fraud
- Risk measurement
- Law enforcement/national security
- Healthcare
- Stakeholder-facing analytics
I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.
YCSB benchmark notes
Two different vendors recently tried to inflict benchmarks on me. Both were YCSBs, so I decided to look up what the YCSB (Yahoo! Cloud Serving Benchmark) actually is. It turns out that the YCSB:
- Was developed by — you guessed it! — Yahoo.
- Is meant to simulate workloads that fetch web pages, including the writing portions of those workloads.
- Was developed with NoSQL data managers in mind.
- Bakes in one kind of sensitivity analysis — latency vs. throughput.
- Is implemented in extensible open source code.
That actually sounds pretty good, especially the extensibility part;* it’s likely that the YCSB can be useful in a variety of product selection scenarios. Still, as recent examples show, benchmark marketing is an annoying blight upon the database industry.
*With extensibility you can test your own workloads and do your own sensitivity analyses.
A YCSB overview page features links both to the code and to the original explanatory paper. The clearest explanation of the YCSB I found there was: Read more
Categories: Aerospike, Benchmarks and POCs, NewSQL, NoSQL, NuoDB, OLTP, Yahoo | 19 Comments |
Aerospike, the former Citrusleaf
My new clients at Aerospike have a range of minor news to announce:
- A company and product name change (they used to be Citrusleaf).
- Some new people and funding.
- In association with an acqui-hire — of AlchemyDB guy Russ Sullivan — some unspecified future technical plans.
- A community edition (Aerospike, nee’ Citrusleaf, is closed-source).
Mainly, however, they want to call your attention to the fact that they’ve been selling a fast, reliable key-value store, with a number of production references, and want to suggest that other organizations should perhaps buy it as well.
Generally, the Aerospike product story is as I described in two posts last year. At the highest level:
- Aerospike has a key-value data model.
- Secondary indexes and so on are still futures.
- Aerospike is clustered, of course.
- Two hardware/storage choices are encouraged:
- Spinning disk, but you keep all your data in RAM.
- Solid-state disk.
AeroSpike’s three core marketing claims are performance, consistent performance, and uninterrupted operations.
- Aerospike’s performance claims are supported by a variety of blazing internal benchmarks.
- Aerospike’s consistent performance claims are along the lines of sub-millisecond latency, with 99.9% of responses being within 5 milliseconds, and even a node outage only borking performance for some 10s of milliseconds.
- Uninterrupted operation is a core AeroSpike design goal, and the company says that to date, no AeroSpike production cluster has ever gone down.
Aerospike technical details start with the expected: Read more
Categories: Aerospike, Market share and customer counts, Memory-centric data management, NoSQL, Pricing | 2 Comments |
Citrusleaf RTA
Citrusleaf has released an add-on product called Citrusleaf RTA (Real-Time Attribution). It’s to be used when:
- You want to update dashboards within a minute.
- You want to update predictive models fairly quickly (within the hour?), although it’s not clear to me how much the models are being updated or changed with that latency.
The metrics envisioned are:
- 100 or so ad impressions per person …
- … for 1 billion or so people …
- … stored for 30-90 days …
- … where each ad impression is a fairly short record …
- … stored on disk …
- … but indexed in a way so that the index can fit into RAM.
- 50-100,000 writes per second. (I didn’t ask on what amount of hardware.)
- Several hundred reads per second.
A consistent relational schema is NOT assumed.
Citrusleaf’s solution is:
- Have one index entry for each of the 1 billion people.
- Bang each new object/record to disk. Include in it a pointer to the previous object/record for the same person.
- Each time a new object/record is added, update the index in place so that it now points to the new once. Hence, the index is sized according to the number of people, not according to the total number of objects/records.
- Eventually let objects/records age off in the obvious way.
The downside is that when you do read 100 objects/records per person, you might need to do 100 seeks.
Introduction to Citrusleaf
Citrusleaf is the vendor of yet another short-request/NoSQL database management system, conveniently named Citrusleaf. Highlights for Citrusleaf the company include:
- 8 employees.
- $2 million in recently acquired venture capital.
- 1 1/2 – 2 1/2 years of total company history, depending on how you count.
- An undisclosed but nonzero number of paying customers, concentrated in the real-time advertising market, with a typical application being cookie management.
Citrusleaf the product is a kind of key-value store; however, the values are in the form of rows, so what you really look up is (key, field name, value) triples. Right now only the keys are indexed; futures include indexing on the individual fields, so as to support some basic analytics. SQL support is an eventual goal. Other Citrusleaf buzzword basics include:
- ACID-compliant.
- Log-structured.
- Tunable consistency model.
To date, Citrusleaf customers have focused on sub-millisecond data retrieval, preferably .2-.3 milliseconds. Accordingly, none has chosen to put the primary Citrusleaf data store on disk. Rather:
- Citrusleaf indexes are always in RAM. (Citrusleaf forces this, actually.)
- You can keep data in RAM and copy it to disk.
- You can keep data on solid-state drives. (Just A Bunch Of Flash or Fusion I/O.)
I don’t have a good grasp on what the data structure for those indexes is.
Citrusleaf characterizes its customers as firms that have “a couple of KB” of data on “every” person in North America. Naively, that sounds like a terabyte or less to me, but Citrusleaf says 1-3 terabytes is most common. Or to quote the press release, “The most common deployments for Citrusleaf 2.0 are terabytes of data, billions of objects, and 200K plus transactions per second per node, with sub-millisecond latency.” 4-8 nodes seems to be typical for Citrusleaf databases (all figures pre-replication). I didn’t ask what kind of hardware is at each node.
Citrusleaf data distribution features include: Read more
Categories: Aerospike, NoSQL, Parallelization | 6 Comments |