Yahoo wants to do decapetabyte-scale data warehousing in Hadoop
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
Categories: Analytic technologies, Data warehousing, Hadoop, MapReduce, Open source, Oracle, Petabyte-scale data management, Web analytics, Yahoo | 6 Comments |
Facts and rumors
- Vertica is putting out a press release today touting its 100th customer, and talking of triple digit growth last year.
- Multiple sources have told me that the DATAllegro system is being thrown out of Dell, so evidently Dell is telling this to one and all. If that goes through, this would presumably leave TEOCO as DATAllegro’s single happy customer. (I haven’t checked with Microsoft for its view.)
- A rumor has it that Infiniband technology vendor Voltaire, Ltd. privately claims triple-digit sales of switches for Exadata 1 (I think that one would be one switch per Exadata installation, not per rack). Based just on a quick glance, this is far from confirmed by Voltaire’s earnings conference call transcripts or SEC filings. However, the most recent transcript does seem to indicate Voltaire got multiple Exadata deals in the telecommunications sector, and suggests some Exadata penetration in other sectors as well.
- I was told of a classified-agency user that has >1 petabyte of data on Exadata 1 and 600 terabytes or so on Netezza. My not-obviously-biased source says the agency is distinctly happier with Netezza than Exadata.
- Like ParAccel, Oracle just got dinged for TPC-related misbehavior.
- Rumor has it that Sun has no intention of helping ParAccel rerun its withdrawn TPC-H benchmark.
- ParAccel has withdrawn the claim from its home page to be the “CERTIFIED” price-performance leader. This seems to confirm that the claim was a reference to the TPC-H. In my opinion, that was a gross misrepresentation of what the TPC-H shows.
What Nielsen really uses in data warehousing DBMS
In its latest earnings call, Oracle made a reference to The Nielsen Company that was — to put it politely — rather confusing. I just plopped down in a chair next to Greg Goff, who evidently runs data warehousing at Nielsen, and had a quick chat. Here’s the real story.
- The Nielsen Company has over half a petabyte of data on Netezza in the US. This installation is growing.
- The Nielsen Company indeed has 45 terabytes or whatever of data on Oracle in its European (Customer) Information Factory. This is not particularly growing. Nielsen’s Oracle data warehouse has been built up over the past 9 years. It’s not new. It’s certainly not on Exadata, nor planned to move to Exadata.
- These are not single-instance databases. Nielsen’s biggest single Netezza database is 20 terabytes or so of user data, and its biggest single Oracle database is 10 terabytes or so.
- Much (most?) of the rest of the installations are customer data marts and the like, based in each case on the “big” central database. (That’s actually a classic data mart use case.) Greg said that Netezza’s capabilities to spin out those databases seemed pretty good.
- That 10 terabyte Oracle data warehouse instance requires a lot of partitioning effort and so on in the usual way.
- Nielsen has no immediate plans to replace Oracle with Netezza.
- Nielsen actually has 800 terabytes or so of Netezza equipment. Some of that is kept more lightly loaded, for performance.
Categories: Analytic technologies, Data mart outsourcing, Data warehouse appliances, Data warehousing, Netezza, Oracle, Specific users | 6 Comments |
Oracle gives a few customer database size examples
In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript):
AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&T, 250 terabytes; Yahoo!, 700 terabytes — just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.
I don’t know what’s being counted there, but I wouldn’t be surprised if those were legit user-data figures.
Some other notes:
- The Yahoo database is of course Yahoo’s first-generation data warehouse, which has been largely superseded by an internal system more than 10X that size. (Edit: Actually, Greg Rahn of Oracle says below that it’s a different database.)
- I’m keynoting the Netezza road show this month, and Nielsen is up there on stage touting Netezza. (Edit: Nielsen indeed does the overwhelming majority of its data warehousing on Netezza.)
- I’d be surprised if AT&T’s largest data warehouse were “only” 250 terabytes in size. (Edit: Actually, I am told the database in question is 310 TB of user data and growing. More later, hopefully.)
- Oracle didn’t exactly say that those were Exadata installations.
Categories: Analytic technologies, Data warehousing, Exadata, Netezza, Oracle, Specific users, Telecommunications, Web analytics, Yahoo | 10 Comments |
Introduction to the XLDB and SciDB projects
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture | 2 Comments |
Yahoo is up to 10 petabytes now?
According to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, Yahoo also gave more details of how the technology works.
Categories: Columnar database management, Data warehousing, Web analytics, Yahoo | 5 Comments |
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
An example of what’s wrong with big vendors’ approaches to BI (SAP in this case)
I just found Chris Kanaracus’ article about SAP’s rollout last month of its “clear enterprises” strategy. The money quote comes from Sara Lee, the user SAP seems to have trotted out:
But Sara Lee has not yet decided to purchase the software, and there are substantial underlying tasks to perform as well, he added.
“This is giving us the horsepower [to analyze data] but we need to have harmonized and structured data underneath it.”
This is from the leading test user of the product?
Business intelligence and the associated data management processes need to be reimagined, and I’m increasingly coming to suspect that the big BI conglomerates aren’t up to the task.
Categories: Analytic technologies, Business intelligence, SAP AG, Specific users, Theory and architecture | Leave a Comment |
MMO games are still screwed up in their database technology
Two years ago I wrote about the database technology of Guild Wars. Not coincidentally, Guild Wars was the MMO RPG (Massively Multiplayer Online Role-Playing Game) I then played. I had the chance to interview Guild Wars’ lead developers. While much else they had to say was impressive, Guild Wars’ database architecture was — er, it was rather mind-boggling.
Since then, Linda and I have taken to playing Lord of the Rings Online, commonly known as LOTRO, developed by Turbine, Inc.. I haven’t had the chance to interview any Turbine folks, despite repeated requests. But from afar, it would seem that Turbine’s technology choices leave quite a bit to be desired, in enterprise-like IT areas such as authentication, database management, and storage.
LOTRO and other Turbine games commonly are down, for scheduled maintenance or in some cases otherwise. There is scheduled multi-hour downtime to start many weeks. There are fairly frequent server restarts in addition to that. Lag and congestion are frequent. And so on and so forth. By way of contrast, Guild Wars very rarely goes down, and other technical difficulties are less common as well. Reliability is a key design goal for Guild Wars’ developers, and in my opinion they’ve achieved it.
Some of the reasons for Turbine’s difficulties seem related to the stresses of MMOs — e.g., they’re probably due to the problems caused by having many fictional characters moving through the same fictional space at once, with graphical detail much richer than Guild Wars’. But a couple of head-scratchers make me really wonder about how Turbine manages data. Read more
Categories: Fun stuff, Games and virtual worlds, Specific users | 18 Comments |
Aster Data sticks by its SQL/MapReduce guns
Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:
- Aster announced today that it’s providing .NET support for SQL/MapReduce. Perhaps not coincidentally, Aster’s biggest customer is MySpace, which is apparently a big Microsoft shop. (And MySpace parent Fox Interactive Media is a SQL/MapReduce fan, albeit running on Greenplum.)
- Aster generally puts more emphasis on MapReduce than SQL/MapReduce rival Greenplum. That’s a non-trivial comparison, because Greenplum is making progress in SQL/MapReduce itself.
- When talking with Aster folks, I can’t get them to shut up hear a lot about SQL/MapReduce.
I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential. But I do wish more successful production examples would become visible …