Why MapReduce matters to SQL data warehousing
Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.
The core ideas of MapReduce are:
- For large problems, parallel computing is much more cost effective and/or feasible than the alternatives.
- If you shoehorn programs into a certain very simple framework – namely that you’re limited to only having map and reduce steps — then building a general execution engine that gives parallelism “for free” is straightforward.
- A lot more problems can be solved within that framework than one might at first expect.
In essence, you can do almost anything to a single record* — that’s a map step. But you are sharply limited in how you combine information about multiple (often intermediate) records – that’s a reduce step. Still, reduce steps let you do counts, sums, or other aggregations. That, plus the general power of map steps, makes MapReduce useful for at least three major classes of applications:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
Except for the building of entire search engines, these are all application areas that data warehouse users should and do care about. And they all still could benefit from large performance increases, as is evidenced by the routine compromises analysts make in areas such as data reduction, sampling, over-simplified models and the like.
*Technically, MapReduce doesn’t allow for records. Instead, you process key-value pairs and lists of same. But so far as I can tell, that’s a distinction without a difference. LISP long ago proved that lists are a very general construct indeed.
MapReduce can be superior to pure SQL for these application areas, because they involve creation of data structures that are awkward to fit into a SQL rows-and-tables paradigm. Inverted-list text indexes just aren’t tables. Formally, graphs can always be fit into tables; but even so, if you want to follow a graph for numerous hops, relational structures can be problematic. Data mining can involve very high-dimensional problems with super-sparse tables. And while exhaustive text extraction into flat tables works OK, getting from there to common-sense semantic hierarchies can be a bit of a kludge.
Some of our recent links about MapReduce
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
Comments
24 Responses to “Why MapReduce matters to SQL data warehousing”
Leave a Reply
Curt,
We’ve seen the power of MapReduce is of immense use in Transformations (during the T step of an ELT processing) and in Data Preparation (before Export of data) as well.
—
Steve Wooledge
Aster Data Systems
http://www.asterdata.com
[…] third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one […]
Does anyone know whether Aster and/or GreenPlum signed any kind of license with Google in order to get access to MapReduce technology, or to get permission to use the term?
I’ve never asked. I can’t think of any reason why they would have had to.
CAM
You don’t need permission to implement Map/Reduce. Google published several papers on the subject including a famous white paper several years ago which was the launching point for the Map/Reduce implementation which my company developed in house.
[…] The integration of MapReduce with SQL data warehousing […]
[…] The integration of MapReduce with SQL data warehousing […]
[…] The integration of MapReduce with SQL data warehousing […]
[…] Monash, president of Monash Research, editor of DBMS2, and a leading authority on MapReduce, sees this as a major leap forward. He reports that both companies had completed adding MapReduce to their existing products and had […]
There is a coding tutorial available at this link in the middle of the page: http://www.greenplum.com/resources/mapreduce/
Key things to note about Greenplum’s MR implementation:
– It’s very similar in form and expression to Google and Hadoop
– Extensions for Joins and Pipelined task execution
– Native parallel file access
– Parallelism is full and transparent to the programmer
In summary: we have implemented MapReduce within which you can write SQL, Perl, Python and many more languages. It is straightforward use MR programs written for Hadoop or Google and port them to Greenplum.
On the topic of licensing:
Licensing is not required for MapReduce as it is a work derived from many sources of publicly shared know-how. It dates back to the original Lisp operators Map and Reduce.
The Wikipedia page is pretty complete here:
http://en.wikipedia.org/wiki/MapReduce
Greenplum’s MapReduce support is designed to provide a superset of the semantic content of open source Hadoop and Google’s implementations, making it straightforward to port from those environments to Greenplum’s data analysis and management engine.
Just a couple points on Aster’s implementation of MapReduce:
+ Developers can use Java, Python, C, Perl, and more to create SQL/MR functions which are then easily used by BI tools or business analysts as common SQL statements
+ Aster’s In-Database MapReduce framework is a superset of MapReduce
+ Aster has a process management framework to guarantee transparency and availability
More in our whitepaper here:
http://www.asterdata.com/product/whitepaper_mapreduce.html
It seems to me that Hadoop and MapReduce in general needs to avoid being bogged down by dealing with a database. It’s about accessing files in parallel without all the garbage that a database puts people through.
I don’t want to have to write an application with a SQL driver and write SQL to use MapReduce. I think that’s kind of the whole point.
I haven’t looked at the other DB vendors of MapReduce, but when I look at the Asterdata examples it looks like a database trying to do MapReduce using UDFs, which kind of misses the whole point for me.
Greenplum and Aster have somewhat different approaches to SQL/MapReduce integration. I want to look into them both further before trying to write about the respective syntaxes.
That said, a little SQL wrapper never hurt anybody.
CAM
[…] benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s […]
[…] line: Mike Stonebraker more than disagrees with the claim that MapReduce is a valuable addition to SQL data warehousing, on somewhat different grounds than he emphasized in the Great MapReduce Debate last January. […]
[…] of the sidebars. And I link to other of my posts whenever it seems to make sense, as in my posts on MapReduce and database […]
[…] подход является «Темой Недели»: MapReduce. Когда я опубликовал список канонических приложений […]
[…] Автор: Curt Monash Дата публикации оригинала: 2008-08-26 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]
[…] Jak to widać na uroczym obrazku na stronie CouchDB, bebechy można podzielić na silnik “widoków”, storage i replikacje. Pisząc o widokach należy wspomnieć, że nasza znajomość SQL-92 i pochodnych nie zda się na nic. Nie ma klasycznych zapytań, zamiast tego stosowana jest metoda MapReduce. Funkcje map i reduce piszemy w JavaScripcie, ale tak naprawdę można dodać obsługę funkcji w dowolnym języku (o tym za chwilę). Na temat samego MapReduce można wiele napisać, dlatego nie będę w tej chwili rozwijał tego tematu. Zainteresowanych odsyłam do Wikipedii i, na przykład, tego wpisu. […]
[…] and wish Aster wouldn’t tie its marketing identity so closely to the admittedly cool supports-MapReduce feature. That said, I do think Aster’s nPath story is pretty interesting, and I plan to blog about […]
[…] http://en.wikipedia.org/wiki/Map_Reduce • DMBS2, Why MapReduce matters to SQL data warehousing http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/ • The Database Column, MapReduce: A major step backwards, By David DeWitt on January 17, 2008 […]
[…] MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS. […]
[…] Considering that MapReduce excels in aggregation and computation, data warehousing and business intelligence are the first to adopt MapReduce. A very interesting article on how MapReduce is relevant to Data Warehousing products is available at http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/. […]