Thoughts and notes, Thanksgiving weekend 2014
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with:
- Analytics carried out directly by business users.
- Automation of predictive modeling activities.
- Rapid re-training of models.
Also, the flashiest application I know of for only-moderately-successful KXEN came when one or more large retailers decided to run separate models for each of thousands of stores.
4. MongoDB, the product, has been refactored to support pluggable storage engines. In connection with that, MongoDB does/will ship with two storage engines – the traditional one and a new one from WiredTiger (but not TokuMX). Both will be equally supported by MongoDB, the company, although there surely are some tiers of support that will get bounced back to WiredTiger.
WiredTiger has the same techie principals as SleepyKat – get the wordplay?! – which was Mike Olson’s company before Cloudera. When asked, Mike spoke of those techies in remarkably glowing terms.
I wouldn’t be shocked if WiredTiger wound up playing the role for MongoDB that InnoDB played for MySQL. What I mean is that there were a lot of use cases for which the MySQL/MyISAM combination was insufficiently serious, but InnoDB turned MySQL into a respectable DBMS.
5. Hadoop’s traditional data distribution story goes something like:
- Data lives on every non-special Hadoop node that does processing.
- This gives the advantage of parallel data scans.
- Sometimes data locality works well; sometimes it doesn’t.
- Of course, if the output of every MapReduce step is persisted to disk, as is the case with Hadoop MapReduce 1, you might create some of your own data locality …
- … but Hadoop is getting away from that kind of strict, I/O-intensive processing model.
However, Cloudera has noticed that some large enterprises really, really like to have storage separate from processing. Hence its recent partnership to work with EMC Isilon. Other storage partnerships, as well as a better fit with S3/object storage kinds of environments, are sure to follow, but I have no details to offer at this time.
6. Cloudera’s count of Spark users in its customer base is currently around 60. That includes everything from playing around to full production.
7. Things still seem to be going well at MemSQL, but I didn’t press for any details that I would be free to report.
8. Speaking of MemSQL, one would think that at some point something newer would replace Oracle et al. in the general-purpose RDBMS world, much as Unix and Linux grew to overshadow the powerful, secure, reliable, cumbersome IBM mainframe operating systems. On the other hand:
- IBM blew away its mainframe competitors and had pretty close to a monopoly. But Oracle has some close and somewhat newer competitors in DB2 and Microsoft SQL Server. Therefore …
- … upstarts have three behemoths to outdo, not just one.
- MySQL, PostgreSQL and to some extent Sybase are still around as well.
Also, perhaps no replacement will be needed. If we subdivide the database management world into multiple categories including:
- General-purpose RDBMS.
- Analytic RDBMS.
- NoSQL.
- Non-relational analytic data stores (perhaps Hadoop-based).
it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.
All that said – if any of the current new entrants do pull off the feat, SAP HANA is probably the best (longshot) guess to do so, and MemSQL the second-best.
9. If you’re a PostgreSQL user with performance or scalability concerns, you might want to check what Citus Data is doing.
Comments
12 Responses to “Thoughts and notes, Thanksgiving weekend 2014”
Leave a Reply
When you say:
“it’s not obvious that the general-purpose RDBMS category on its own requires any new entrants to ever supplant the current leaders.”
What do you mean by “requires”? Are you saying the products can’t be dramatically improved?
Evan,
The older, larger vendors in a sector have much more previous engineering effort invested in their products, and much more in the way of current engineering resources.
The newer, smaller vendors can have products with more modern architectures and simpler code lines. Who wins?
In the case of general-purpose RDBMS, the older vendors have the option of in essence federating their legacy products with newer ones. If they do that the easy(ier) way, they still keep old-fashioned disk-centric, lock-heavy architectures, and eventually they lose. If they do that the hard way, however, with a thinner/more modular form of commonality among the engines, they can win.
They’ve been trying the easy(ier) way first, but there’s probably still time for them to do it right.
In realm of big data I see split of monolithic (R)DBMS into two interchangeable parts:
storage engine and query engine. HDFS, S3, are good examples of storage engines.
Impala, Hive, Spark – are examples of query engines.
I think this division gives to users two serious advantages : they still own data, since it stored in known format, and capability to use several engines on the same data.
I agree with Mike. The people at WiredTiger are very good at building database engines. WiredTiger performance is very impressive.
There is a new effort to make PostgreSQL faster for complex query processing — http://vitessedata.com
[…] are getting the message. Hadoop distributor Cloudera, which also includes Spark in its releases, has about 60 enterprise customers using Spark in some form or another, according to Monash. Other Hadoop distributors, notably Hortonworks and MapR, also offer Spark in […]
Is Hana general purpose or analytical rdbms ?
Looks like it is analytical rdbms which can’t cost effectively handle big data. Oltp database which is not really meant to be row oriented ( columnar seems to be main theme ). Product positioning seems to be a bit fuzzy, I think.
[…] big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect […]
[…] big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect […]
[…] Predictive experimentation. […]
[…] Various notes (November, 2014) […]
[…] predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s […]
[…] Thanksgiving round-up post points to a lot of my prior comments on predictive […]