Parallelization
Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
Eight kinds of analytic database (Part 1)
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
Forthcoming Oracle appliances
Edit: I checked with Oracle, and it’s indeed TimesTen that’s supposed to be the basis of this new appliance, as per a comment below. That would be less cool, alas.
Oracle seems to have said on yesterday’s conference call Oracle OpenWorld (first week in October) will feature appliances based on Tangosol and Hadoop. As I post this, the Seeking Alpha transcript of Oracle’s call is riddled with typos. Bolded comments below are by me. Read more
Categories: Data warehouse appliances, Hadoop, In-memory DBMS, MapReduce, Memory-centric data management, Object, Oracle | 8 Comments |
The Vertica story (with soundbites!)
I’ve blogged separately that:
- Vertica has a bunch of customers, including seven with 1 or more petabytes of data each.
- Vertica has progressed down the analytic platform path, with Monday’s release of Vertica 5.0.
And of course you know:
- Vertica (the product) is columnar, MPP, and fast.*
- Vertica (the company) was recently acquired by HP.**
Categories: Benchmarks and POCs, Columnar database management, ParAccel, Parallelization, Vertica Systems | 4 Comments |
Metaphors amok
It all started when I disputed James Kobielus’ blogged claim that Hadoop is the nucleus of the next-generation cloud EDW. Jim posted again to reiterate the claim, only this time he wrote that all EDW vendors [will soon] bring Hadoop into their heart of their architectures. (All emphasis mine.)
That did it. I tweeted, in succession:
- Actually, I vote for Hadoop as the lungs of the EDW — first place of entry for essential nutrients.
- Data integration can be the heart of the EDW, pumping stuff around. RDBMS/analytic platform can be the brain.
- iPad-based dashboards that may engender envy, but which actually are only used occasionally and briefly … well, you get the picture.*
*Woody Allen said in Sleeper that the brain was his second-favorite organ.
Of course, that body of work was quickly challenged. Responses included: Read more
Categories: Analytic technologies, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Fun stuff, Hadoop, Humor, MapReduce | Leave a Comment |
Patent nonsense: Parallel Iron/HDFS edition
Alan Scott commented with concern about Parallel Iron’s patent lawsuit attacking HDFS (Hadoop Distributed File System), filed in — where else? — Eastern Texas. The patent in question — US 7,415,565 — seems to in essence cover any shared-nothing block storage that exploits a “configurable switch fabric”; indeed, it’s more oriented to OLTP (OnLine Transaction Processing) than to analytics. For example, the Background section starts: Read more
Categories: EMC, Hadoop, MapReduce, Parallelization, Storage | 9 Comments |
Hadoop confusion from Forrester Research
Jim Kobielus started a recent post
Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”
What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.
So far, so good. But I disagree with almost everything Jim wrote after that.
Jim’s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that
Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition
where by “cloud” I presume Jim means first and foremost “private cloud.”
I don’t think any of that matches Hadoop’s actual strengths and weaknesses, whether now or in the 3-7 year future. My reasoning starts:
- Hadoop is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool …
- … but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won’t close as long as the others remain a moving target.
- I don’t think MapReduce is going to fail altogether; it’s too well-suited for too many use cases.
- Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.
- The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.
As for the rest of Jim’s claim — I see three main candidates for the “nucleus of the next-generation enterprise data warehouse,” each with better claims than Hadoop:
- Relational DBMS, much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.
- Grand cosmic data integration tools. (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.
- Nothing. (The architecture could have several strong members, none of which is truly the “nucleus.”) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.
Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Theory and architecture | 9 Comments |
Alternatives for Hadoop/MapReduce data storage and management
There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.
Known HDFS and Hadoop data storage and management issues include but are not limited to:
- Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
- HDFS compression could be better.
- HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
- Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
Different entities have different ideas about how such deficiencies should be addressed. Read more
Categories: Aster Data, Cassandra, Cloudera, Data warehouse appliances, DataStax, EMC, Greenplum, Hadapt, Hadoop, IBM and DB2, MapReduce, MongoDB, Netezza, Parallelization | 22 Comments |
Data integration vendors and Hadoop
There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:
- Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.
- ETL vendors have development environments that let you specify/script/whatever ETL jobs. ETL vendors want their development tools to develop ETL processes executed via MapReduce/Hadoop.
- In particular, this allows ETL vendors to exploit the parallel-processing capabilities of MapReduce.
Some additional twists include:
- Pentaho announced business intelligence and ETL for Hadoop last year.
- Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)
- Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)
- SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.
Finally, my former clients at Pervasive, who haven’t briefed me for a while, seem to have told Doug Henschen that they have pointed DataRush at MapReduce.* However, I couldn’t find evidence of same on the Pervasive DataRush website beyond some help in using all the cores on any one Hadoop node.
*Also see that article because it names a bunch of ETL vendors doing Hadoop-related things.
Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Parallelization, Pentaho, Pervasive Software, SnapLogic, Syncsort | 1 Comment |
DB2 OLTP scale-out: pureScale
Tim Vincent of IBM talked me through DB2 pureScale Monday. IBM DB2 pureScale is a kind of shared-disk scale-out parallel OTLP DBMS, with some interesting twists. IBM’s scalability claims for pureScale, on a 90% read/10% write workload, include:
- 95% scalability up to 64 machines
- 90% scalability up to 88 machines
- 89% scalability up to 112 machines
- 84% scalability up to 128 machines
More precisely, those are counts of cluster “members,” but the recommended configuration is one member per operating system instance — i.e. one member per machine — for reasons of availability. In an 80% read/20% write workload, scalability is less — perhaps 90% scalability over 16 members.
Several elements are of IBM’s DB2 pureScale architecture are pretty straightforward:
- There are multiple pureScale members (machines), each with its own instance of DB2.
- There’s an RDMA (Remote Direct Memory Access) interconnect, perhaps InfiniBand. (The point of InfiniBand and other RDMA is that moving data doesn’t require interrupts, and hence doesn’t cost many CPU cycles.)
- The DB2 pureScale members share access to the database on a disk array.
- Each DB2 pureScale member has its own log, also on the disk array.
Something called GPFS (Global Parallel File System), which comes bundled with DB2, sits underneath all this. It’s all based on the mainframe technology IBM Parallel Sysplex.
The weirdest part (to me) of DB2 pureScale is something called the Global Cluster Facility, which runs on its own set of boxes. (Edit: Actually, see Tim Vincent’s comment below.) Read more
Categories: Cache, Clustering, IBM and DB2, OLTP, Oracle | 15 Comments |
Oracle and Exadata: Business and technical notes
Last Friday I stopped by Oracle for my first conversation since January, 2010, in this case for a chat with Andy Mendelsohn, Mark Townsend, Tim Shetler, and George Lumpkin, covering Exadata and the Oracle DBMS. Key points included: Read more