Amazon and its cloud

Analysis of Amazon’s role in database and analytic technology, especially via the S3/EC2 cloud computing initiative. Also covered are SimpleDB and Amazon’s role as a technology user. Related subjects include:

June 16, 2012

Metamarkets’ back-end technology

This is part of a three-post series:

Introduction to Metamarkets and Druid
Druid overview
Metamarkets’ back-end technology (this post)

The canonical Metamarkets batch ingest pipeline is a bit complicated.

Data lands on Amazon S3 (uploaded or because it was there all along).
Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
Druid is notified, and pulls the data from Hadoop at its convenience.

By “get data read to be put into Druid” I mean:

Build the data segments (recall that Druid manages data in rather large segments).
Note metadata about the segments.

That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)

By “build the data segments” I mean:

Make the sharding decisions.
Arrange data columnarly within shard.
Build a compressed bitmap for each shard.

When things are being done that way, Druid may be regarded as comprising three kinds of servers: Read more

Categories: Amazon and its cloud, Cloud computing, Clustering, Columnar database management, Data pipelining, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, Metamarkets and Druid, Software as a Service (SaaS)

7 Comments

April 24, 2012

Notes on the Hadoop and HBase markets

I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:

Cloudera now has 220 employees.
Cloudera now has over 100 subscription customers.
Over the past year, Cloudera has more than doubled in size by every reasonable metric.
Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
Cloudera has trained over 12,000 people.
Hortonworks is training people too.
Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
There should be more HBase information at HBaseCon on May 22.
If MapR still has momentum, nobody I talked with has noticed.

Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData

1 Comment

May 24, 2011

Quick thoughts on Oracle-on-Amazon

Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).

My quick thoughts start:

Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.

Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.

And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?

Categories: Amazon and its cloud, Cloud computing, Oracle

7 Comments

July 6, 2010

Cassandra technical overview

Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.

Jonathan’s core claims for Cassandra include:

Cassandra is shared-nothing.
Cassandra has good approaches to replication and partitioning, right out of the box.
In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.

In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more

Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization

4 Comments

May 2, 2010

Daniel Abadi on NoSQL design tradeoffs

In a thought-provoking post, Daniel Abadi points out NoSQL-related terminological problems similar to the ones I just railed against, and argues

To me, CAP should really be PACELC — if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?

and goes on to say

For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC — upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler — once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.

However, I think Daniel’s improved formulation is still misleading, in at least two ways:

Daniel implicitly assumes any given NoSQL system makes a fixed set of tradeoffs, when actually — as he in fact notes in his post — some of them offer tradeoffs that are quite tunable.
I think Daniel is at best oversimplifying when he appears to assert that best-case network latency is an important design criterion for all that many NoSQL systems. Naively, anything that acknowledges reads or writes requires two hops. Two-phase commit (2PC) requires three hops. 33% latency reductions are not the kinds of goals that drive dramatic DBMS redesigns, even though tenths of seconds — i.e. 100s of milliseconds — matter in the kinds of environments where NoSQL is sprouting up.

Categories: Amazon and its cloud, Cassandra, NoSQL, Theory and architecture

2 Comments

May 1, 2010

Read-your-writes (RYW), aka immediate, consistency

In which we reveal the fundamental inequality of NoSQL, and why NoSQL folks are so negative about joins.

Discussions of NoSQL design philosophies tend to quickly focus in on the matter of consistency. “Consistency”, however, turns out to be a rather overloaded concept, and confusion often ensues.

In this post I plan to address one essential subject, while ducking various related ones as hard as I can. It’s what Werner Vogel of Amazon called read-your-writes consistency (a term to which I was actually introduced by Justin Sheehy of Basho). It’s either identical or very similar to what is sometimes called immediate consistency, and presumably also to what Amazon has recently called the “read my last write” capability of SimpleDB.

This is something every database-savvy person should know about, but most so far still don’t. I didn’t myself until a few weeks ago.

Considering the many different kinds of consistency outlined in the Werner Vogel link above or in the Wikipedia consistency models article — whose names may not always be used in, er, a wholly consistent manner — I don’t think there’s much benefit to renaming read-your-writes consistency yet again. Rather, let’s just call it RYW consistency, come up with a way to pronounce “RYW”, and have done with it. (I suggest “ree-ooh”, which evokes two syllables from the original phrase. Thoughts?)

Definition: RYW (Read-Your-Writes) consistency is achieved when the system guarantees that, once a record has been updated, any attempt to read the record will return the updated value.

Categories: Amazon and its cloud, NoSQL, OLTP, Parallelization, Theory and architecture

26 Comments

March 12, 2010

Some NoSQL links

I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more

Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB

5 Comments

December 27, 2009

Introduction to Gooddata

Around the end of the Cold War, Esther Dyson took it upon herself to go repeatedly to Eastern Europe and do a lot of rah-rah and catalysis, hoping to spark software and other computer entrepreneurs. I don’t know how many people’s lives she significantly affected – I’d guess it’s actually quite a few – but in any case the number is not zero. Roman Stanek, who has built and sold a couple of software business, cites her as a key influence setting him on his path.

Roman’s latest venture is business intelligence firm Gooddata. Gooddata was founded in 2007 and has been soliciting and getting attention for a while, so I was surprised to learn that Gooddata officially launched just a few weeks ago. Anyhow, some less technical highlights of the Gooddata story include: Read more

Categories: Amazon and its cloud, Analytic technologies, Business intelligence, Cloud computing, Games and virtual worlds, Gooddata, Jaspersoft, Market share and customer counts, Memory-centric data management, Pricing, Software as a Service (SaaS)

13 Comments

May 29, 2009

Sneakernet to the cloud

Recently, Amazon CTO Werner Vogels put up a blog post which suggested that, now and in the future, the best way to get large databases into the cloud is via sneakernet. In some circumstances, he is surely right. Possible implications include:

When sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, cloud, cloud — but Clearpace thinks you really should have a bit of its software onsite too, to compress the data before sending it across the wire.
Getting data from one cloud to another cloud could be problematic. I’m fond of saying that weblog data naturally lives in the cloud at your hosting company’s location, so you should analyze it there too. But this makes the most sense if you analyze it or at least filter/reduce it in place. (That said, the really, really big web companies have lots of different data centers, and presumably do move huge amounts of log data from place to place.)

But for one-time moves of data sets — sure, sneaker net/snail mail should work just fine.

Categories: Amazon and its cloud, Cloud computing, Database compression, EAI, EII, ETL, ELT, ETLT, Web analytics

2 Comments

April 14, 2009

Maybe Amazon should be using a real DBMS after all

Supposedly

Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as “adult,” the source said. (Technically, the flag for adult content was flipped from ‘false’ to ‘true.’)

“It’s no big policy change, just some field that’s been around forever filled out incorrectly,” the source said.

Amazon employees worked on the problem well past midnight, and then handed it over to an international team, he said.

This was the best practice for reversing an error — how? Is SimpleDB somehow implicated? If this story is remotely true, and if there’s a sensible database architecture, I can’t imagine why there wouldn’t be a faster fix.

Categories: Amazon and its cloud

7 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Amazon and its cloud

Metamarkets’ back-end technology

Notes on the Hadoop and HBase markets

Quick thoughts on Oracle-on-Amazon

Cassandra technical overview

Daniel Abadi on NoSQL design tradeoffs

Read-your-writes (RYW), aka immediate, consistency

Some NoSQL links

Introduction to Gooddata

Sneakernet to the cloud

Maybe Amazon should be using a real DBMS after all

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin