Petabyte-scale data management

Posts about database management for databases with petabytes of user data.

October 10, 2010

Partnering with Cloudera

After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis: Read more

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Database diversity, Hadoop, MapReduce, Parallelization, Petabyte-scale data management

11 Comments

October 6, 2010

eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included: Read more

Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata

30 Comments

June 30, 2010

Cloudera Enterprise and Hadoop evolution

I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say: Read more

Categories: Cloudera, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, Investment research and trading, MapReduce, Market share and customer counts, Petabyte-scale data management, Pricing, Specific users, Web analytics

7 Comments

May 23, 2010

More on Sybase IQ, including Version 15.2

Back in March, Sybase was kind enough to give me permission to post a slide deck about Sybase IQ. Well, I’m finally getting around to doing so. Highlights include but are not limited to:

Slide 2 has some market success figures and so on. (>3100 copies at >1800 users, >200 sales last year)
Slides 6-11 give more detail on Sybase’s indexing and data access methods than I put into my recent technical basics of Sybase IQ post.
Slide 16 reminds us that in-database data mining is quite competitive with what SAS has actually delivered with its DBMS partners, even if it doesn’t have the nice architectural approach of Aster or Netezza. (I.e., Sybase IQ’s more-than-SQL advanced analytics story relies on C++ UDFs — User Defined Functions — running in-process with the DBMS.) In particular, there’s a data mining/predictive analytics library — modeling and scoring both — licensed from a small third party.
A number of the other later slides also have quite a bit of technical crunch. (More on some of those points below too.)

Sybase IQ may have a bit of a funky architecture (e.g., no MPP), but the age of the product and the substantial revenue it generates have allowed Sybase to put in a bunch of product features that newer vendors haven’t gotten around to yet.

More recently, Sybase volunteered permission for me to preannounce Sybase IQ Version 15.2 by a few days (it’s scheduled to come out this week). Read more

Categories: Analytic technologies, Application areas, Columnar database management, Data mart outsourcing, Data warehousing, Database compression, Investment research and trading, Market share and customer counts, Petabyte-scale data management, Sybase, Telecommunications, Text

1 Comment

April 12, 2010

Greenplum Chorus and Greenplum 4.0

Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign.

Greenplum 4.0 highlights and related observations include: Read more

Categories: Analytic technologies, Benchmarks and POCs, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Greenplum, Market share and customer counts, Petabyte-scale data management, Specific users, Telecommunications, Theory and architecture

5 Comments

March 19, 2010

Vertica update

I caught up with Jerry Held (Chairman) and Dave Menninger (VP Marketing) of Vertica for a chat yesterday. The immediate reason for the call was that a competitor had tipped me off to the departure of Vertica CEO Ralph Breslauer, which of course raises a host of questions. Highlights of the call included:

Vertica had a “killer” Q4 and is doing very well in Q1 again.
Vertica burned hardly any cash last year; i.e., it was close to cash-flow neutral in 2009.
Vertica is hiring aggressively, e.g., in sales.
Vertica is well down the path with several CEO candidates who Jerry regards as outstanding. He is hopeful there will be a new CEO in April. (But I bet that would be late April, given what Jerry mentioned about his own travel plans.)
Absent a full-time CEO, Jerry and Andy Palmer are spending a lot more time with Vertica.
One Vertica customer is approaching a petabyte of user data. The last time Vertica had checked, that customer had been more in the ¼ petabyte range.
Other multi-hundred terabyte Vertica databases were mentioned, including one where Vertica claims to have beaten Teradata and perhaps other competitors in a head-to-head competition (it sounds like that one’s too recent to be deployed yet).
Vertica sees Aster and Greenplum competitively more often than it sees ParAccel.
Vertica sees Sybase IQ competitively a lot in financial services (in new-name accounts for Sybase as well as where some kind of Sybase DBMS is an incumbent), and more occasionally in other sectors.

NDA parts of the conversation also gave me the impression that Vertica is moving forward just as eagerly as its peers. I.e., I didn’t uncover any reason to think that Ralph’s departure is a sign of trouble, of the company being shopped, etc. Read more

Categories: Analytic technologies, Data warehousing, Investment research and trading, Market share and customer counts, ParAccel, Petabyte-scale data management, Sybase, Vertica Systems

6 Comments

October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.

6 Comments

September 30, 2009

Facts and rumors

Vertica is putting out a press release today touting its 100th customer, and talking of triple digit growth last year.
Multiple sources have told me that the DATAllegro system is being thrown out of Dell, so evidently Dell is telling this to one and all. If that goes through, this would presumably leave TEOCO as DATAllegro’s single happy customer. (I haven’t checked with Microsoft for its view.)
A rumor has it that Infiniband technology vendor Voltaire, Ltd. privately claims triple-digit sales of switches for Exadata 1 (I think that one would be one switch per Exadata installation, not per rack). Based just on a quick glance, this is far from confirmed by Voltaire’s earnings conference call transcripts or SEC filings. However, the most recent transcript does seem to indicate Voltaire got multiple Exadata deals in the telecommunications sector, and suggests some Exadata penetration in other sectors as well.
I was told of a classified-agency user that has >1 petabyte of data on Exadata 1 and 600 terabytes or so on Netezza. My not-obviously-biased source says the agency is distinctly happier with Netezza than Exadata.
Like ParAccel, Oracle just got dinged for TPC-related misbehavior.
Rumor has it that Sun has no intention of helping ParAccel rerun its withdrawn TPC-H benchmark.
ParAccel has withdrawn the claim from its home page to be the “CERTIFIED” price-performance leader. This seems to confirm that the claim was a reference to the TPC-H. In my opinion, that was a gross misrepresentation of what the TPC-H shows.

Categories: Benchmarks and POCs, Data warehouse appliances, Data warehousing, DATAllegro, Exadata, Market share and customer counts, Microsoft and SQL*Server, Netezza, Oracle, ParAccel, Petabyte-scale data management, Specific users, Telecommunications

3 Comments

September 12, 2009

Introduction to the XLDB and SciDB projects

Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more

Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture

2 Comments

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more

Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo

48 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in