MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
Notes on HBase 0.92
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing. Read more
Categories: Benchmarks and POCs, Cloudera, Hadoop, HBase, MapReduce, NoSQL, Open source, Storage, Theory and architecture | 2 Comments |
“Enterprise-ready Hadoop”
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Well-supported.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways. Read more
Categories: Buying processes, Cloudera, Clustering, Hadoop, HBase, Hortonworks, MapR, MapReduce, Open source | 9 Comments |
Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production (this post).
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92.
The posts depend on each other in various ways.
My clients at Cloudera and Hortonworks have somewhat different views as to the maturity of various pieces of Hadoop technology. In particular:
- Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
- CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
- HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
- CDH 4 boasts sub-second NameNode failover.
- Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
- Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
- As does CDH 4, HDP 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store. (Edit: Actually, see the comment thread below.)
- Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
- HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
- Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
- Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
Categories: Amazon and its cloud, Cloudera, Hadoop, Hortonworks, MapReduce, Market share and customer counts | 14 Comments |
Introduction to Cloudant
Cloudant is one of the few NoSQL companies with >100 paying subscription customers. For starters:
- Cloudant’s core software is a fork of CouchDB.
- Cloudant only sells you software as a service.
- More precisely, whether Cloudant offers DBaaS (DataBase as a Service) or PaaS (Platform as a Service) or a “data layer” (Cloudant’s preferred terminology) depends on your taste in buzzwords.
- I gather that Cloudant (the company) wants to handle pretty much all your data management needs. But Cloudant (the product) isn’t there yet, especially on the analytic side.
- Before CouchDB and Membase joined together, Cloudant was positioned as the big(ger) data version of CouchDB.
Company demographics include:
- Cloudant is based in Boston.
- Cloudant started out as a Y Combinator company in 2008, and “got serious” in 2009.
- Cloudant now has ~20 employees.
- Management hires include a couple of former Vertica guys.
The Cloudant guys gave me some customer counts in May that weren’t much higher than those they gave me in February, and seem to have forgotten to correct the discrepancy. Oh well. The latter (probably understated) figures included ~160 paying customers, of which:
- ~100 were multitenant.
- ~60 were single tenant.
- 1 was on-premise (but still managed by Cloudant) because of privacy concerns.
The largest Cloudant deployments seem to be in the 10s of terabytes, across a very low double digit number of servers.
Categories: Cloudant, Clustering, Couchbase, CouchDB, MapReduce, Market share and customer counts, NoSQL, Pricing, Specific users, Storage | 2 Comments |
Notes on the analysis of large graphs
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more
Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray | 20 Comments |
Notes on the Hadoop and HBase markets
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
DataStax Enterprise and Cassandra revisited
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
- JDBC/SQL.
- Hive/Hadoop.
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics: Read more
Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text | 6 Comments |
Kinds of data integration and movement
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data —
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other. Read more
Categories: Clustering, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, MapReduce | 10 Comments |
Hadoop-related market categorization
I wasn’t the only one to be dubious about Forrester Research’s Hadoop taxonomy (or lack thereof). GigaOm’s Derrick Harris was as well, and offered a much superior approach of his own. In Derrick’s view, there’s Hadoop, Hadoop distributions, Hadoop management, and Hadoop applications. Taking those out of order, and recalling that no market categorization is ever precise:
- “Hadoop applications” is a catch-all category. Since Derrick offered suitable caveats around the label, I’m fine with what he said.
- Hadoop management software commonly comes in the form of suites. Derrick’s discussion was solid.
- Derrick seems to want to define “Hadoop” as being whatever is in the relevant Apache projects. Cool. He does seem to wind up on both sides of the “MapR and DataStax put Hadoop MapReduce on top of something that isn’t HDFS — so is that Hadoop or isn’t it?” question, but that’s a tough ambiguity to avoid.
- Derrick could have been a little clearer on the subject of Hadoop distributions.
Let’s drill down into that last one. Derrick refers to Hadoop distributions as “products” that:
package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a way that in theory makes them integrate more naturally, and to run both smoothly and securely.
While that’s a reasonable recitation of the idea’s benefits, I’d rather say that a “distribution” of open source software comprises: Read more
Categories: Cloudera, Hadoop, MapReduce, Open source | Leave a Comment |
Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions
Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn’t prove stable, here also is a registration-required link from IBM’s Conor O’Mahony.) My comments include:
- The Forrester Wave’s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.
- The Forrester Wave for “enterprise Hadoop” contradicts itself on the subject of Hortonworks.
- The Forrester Wave for “enterprise Hadoop” is correct when it says “Hortonworks … has Hadoop training and professional services offerings that are still embryonic.”
- Peculiarly, the Forrester Wave for “enterprise Hadoop” also says “Hortonworks offers an impressive Hadoop professional services portfolio”. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are … well, a good word might be “embryonic”.
- Forrester Waves always seem to have weird implicit definitions of “data warehousing”. This one is no exception.
- Forrester gave top marks in “Functionality” to 11 of 13 “enterprise Hadoop” vendors. This seems odd.
- I don’t know why MapR, which doesn’t like HDFS (Hadoop Distributed File System), got top marks in “Subproject integration”.
- Forrester gave top marks in “Storage” to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum’s technology is a superset of MapR’s. Very strange. (Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)
- Forrester gave higher marks in “Acceleration and optimization” to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.
- I’m not sure what Forrester is calling a “Distributed EDW file store connector”, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.
- Forrester’s “Strategy” rankings seem to correlate to a metric of “We’re a large enough vendor to go in N directions at once”, for various values of N.
- Forrester is correct to rank Cloudera’s “Adoption” as being stronger than EMC/Greenplum’s or MapR’s. But Hortonworks’ strong mark for “Adoption” baffles me.
Categories: Cloudera, Data warehousing, EMC, Greenplum, Hadoop, Hortonworks, MapR, MapReduce, Pentaho | 11 Comments |