Parallelization

Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:

July 5, 2011

Eight kinds of analytic database (Part 1)

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more

Categories: Analytic technologies, Aster Data, Benchmarks and POCs, Business intelligence, Buying processes, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Database diversity, Exadata, Greenplum, IBM and DB2, Infobright, Investment research and trading, Log analysis, Microsoft and SQL*Server, MOLAP, Netezza, OLTP, Oracle, ParAccel, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, QlikTech and QlikView, SAND Technology, Scientific research, Sybase, Teradata, Vertica Systems, Web analytics, Workload management

7 Comments

June 24, 2011

Forthcoming Oracle appliances

Edit: I checked with Oracle, and it’s indeed TimesTen that’s supposed to be the basis of this new appliance, as per a comment below. That would be less cool, alas.

Oracle seems to have said on yesterday’s conference call Oracle OpenWorld (first week in October) will feature appliances based on Tangosol and Hadoop. As I post this, the Seeking Alpha transcript of Oracle’s call is riddled with typos. Bolded comments below are by me. Read more

Categories: Data warehouse appliances, Hadoop, In-memory DBMS, MapReduce, Memory-centric data management, Object, Oracle

8 Comments

June 20, 2011

The Vertica story (with soundbites!)

I’ve blogged separately that:

Vertica has a bunch of customers, including seven with 1 or more petabytes of data each.
Vertica has progressed down the analytic platform path, with Monday’s release of Vertica 5.0.

And of course you know:

Vertica (the product) is columnar, MPP, and fast.*
Vertica (the company) was recently acquired by HP.**

Categories: Benchmarks and POCs, Columnar database management, ParAccel, Parallelization, Vertica Systems

4 Comments

June 15, 2011

Metaphors amok

It all started when I disputed James Kobielus’ blogged claim that Hadoop is the nucleus of the next-generation cloud EDW. Jim posted again to reiterate the claim, only this time he wrote that all EDW vendors [will soon] bring Hadoop into their heart of their architectures. (All emphasis mine.)

That did it. I tweeted, in succession:

Actually, I vote for Hadoop as the lungs of the EDW — first place of entry for essential nutrients.
Data integration can be the heart of the EDW, pumping stuff around. RDBMS/analytic platform can be the brain.
iPad-based dashboards that may engender envy, but which actually are only used occasionally and briefly … well, you get the picture.*

*Woody Allen said in Sleeper that the brain was his second-favorite organ.

Of course, that body of work was quickly challenged. Responses included: Read more

Categories: Analytic technologies, Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Fun stuff, Hadoop, Humor, MapReduce

Patent nonsense: Parallel Iron/HDFS edition

Alan Scott commented with concern about Parallel Iron’s patent lawsuit attacking HDFS (Hadoop Distributed File System), filed in — where else? — Eastern Texas. The patent in question — US 7,415,565 — seems to in essence cover any shared-nothing block storage that exploits a “configurable switch fabric”; indeed, it’s more oriented to OLTP (OnLine Transaction Processing) than to analytics. For example, the Background section starts: Read more

Categories: EMC, Hadoop, MapReduce, Parallelization, Storage

9 Comments

June 5, 2011

Hadoop confusion from Forrester Research

Jim Kobielus started a recent post

Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”

What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.

So far, so good. But I disagree with almost everything Jim wrote after that.

Jim’s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that

Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition

where by “cloud” I presume Jim means first and foremost “private cloud.”

I don’t think any of that matches Hadoop’s actual strengths and weaknesses, whether now or in the 3-7 year future. My reasoning starts:

Hadoop is well on its way to being a surviving data-storage-plus-processing system — like an analytic DBMS or DBMS-imitating data integration tool …
… but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won’t close as long as the others remain a moving target.
I don’t think MapReduce is going to fail altogether; it’s too well-suited for too many use cases.
Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.
The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.

As for the rest of Jim’s claim — I see three main candidates for the “nucleus of the next-generation enterprise data warehouse,” each with better claims than Hadoop:

Relational DBMS, much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.
Grand cosmic data integration tools. (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.
Nothing. (The architecture could have several strong members, none of which is truly the “nucleus.”) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.

Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Theory and architecture

9 Comments

May 14, 2011

Alternatives for Hadoop/MapReduce data storage and management

There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.

Known HDFS and Hadoop data storage and management issues include but are not limited to:

Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
HDFS compression could be better.
HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.

Different entities have different ideas about how such deficiencies should be addressed. Read more

Categories: Aster Data, Cassandra, Cloudera, Data warehouse appliances, DataStax, EMC, Greenplum, Hadapt, Hadoop, IBM and DB2, MapReduce, MongoDB, Netezza, Parallelization

22 Comments

May 12, 2011

Data integration vendors and Hadoop

There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:

Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.
ETL vendors have development environments that let you specify/script/whatever ETL jobs. ETL vendors want their development tools to develop ETL processes executed via MapReduce/Hadoop.
In particular, this allows ETL vendors to exploit the parallel-processing capabilities of MapReduce.

Some additional twists include:

Pentaho announced business intelligence and ETL for Hadoop last year.
Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)
Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)
SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.

Finally, my former clients at Pervasive, who haven’t briefed me for a while, seem to have told Doug Henschen that they have pointed DataRush at MapReduce.* However, I couldn’t find evidence of same on the Pervasive DataRush website beyond some help in using all the cores on any one Hadoop node.

*Also see that article because it names a bunch of ETL vendors doing Hadoop-related things.

Categories: Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Parallelization, Pentaho, Pervasive Software, SnapLogic, Syncsort

1 Comment

May 6, 2011

DB2 OLTP scale-out: pureScale

Tim Vincent of IBM talked me through DB2 pureScale Monday. IBM DB2 pureScale is a kind of shared-disk scale-out parallel OTLP DBMS, with some interesting twists. IBM’s scalability claims for pureScale, on a 90% read/10% write workload, include:

95% scalability up to 64 machines
90% scalability up to 88 machines
89% scalability up to 112 machines
84% scalability up to 128 machines

More precisely, those are counts of cluster “members,” but the recommended configuration is one member per operating system instance — i.e. one member per machine — for reasons of availability. In an 80% read/20% write workload, scalability is less — perhaps 90% scalability over 16 members.

Several elements are of IBM’s DB2 pureScale architecture are pretty straightforward:

There are multiple pureScale members (machines), each with its own instance of DB2.
There’s an RDMA (Remote Direct Memory Access) interconnect, perhaps InfiniBand. (The point of InfiniBand and other RDMA is that moving data doesn’t require interrupts, and hence doesn’t cost many CPU cycles.)
The DB2 pureScale members share access to the database on a disk array.
Each DB2 pureScale member has its own log, also on the disk array.

Something called GPFS (Global Parallel File System), which comes bundled with DB2, sits underneath all this. It’s all based on the mainframe technology IBM Parallel Sysplex.

The weirdest part (to me) of DB2 pureScale is something called the Global Cluster Facility, which runs on its own set of boxes. (Edit: Actually, see Tim Vincent’s comment below.) Read more

Categories: Cache, Clustering, IBM and DB2, OLTP, Oracle

15 Comments

May 3, 2011

Oracle and Exadata: Business and technical notes

Last Friday I stopped by Oracle for my first conversation since January, 2010, in this case for a chat with Andy Mendelsohn, Mark Townsend, Tim Shetler, and George Lumpkin, covering Exadata and the Oracle DBMS. Key points included: Read more

Categories: Analytic technologies, Cache, Clustering, Data warehouse appliances, Data warehousing, Emulation, transparency, portability, Exadata, MapReduce, Market share and customer counts, OLTP, Oracle, Parallelization, Predictive modeling and advanced analytics, Solid-state memory

9 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Parallelization

Eight kinds of analytic database (Part 1)

Forthcoming Oracle appliances

The Vertica story (with soundbites!)

Metaphors amok

Patent nonsense: Parallel Iron/HDFS edition

Hadoop confusion from Forrester Research

Alternatives for Hadoop/MapReduce data storage and management

Data integration vendors and Hadoop

DB2 OLTP scale-out: pureScale

Oracle and Exadata: Business and technical notes

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin