July 1, 2008

The IRS data warehouse

According to a recent Eric Lai Computerworld story and a 2006 Sybase.com success story,

I can’t entirely reconcile those numbers, but in any case the database sounds plenty big.

Computerworld also said:

the research division also uses Microsoft Corp.’s SQL Server to store all of the metadata for the data warehouse and the rest of the agency. Managing and cleaning all of that metadata — 10,000 labels for 150 databases — is a huge task in itself,

May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

May 23, 2008

Data warehouse appliance power user TEOCO

If you had to name super-high-end users of data warehouse technology, your list might start with a few retailers, credit data processors, and telcos, plus the US intelligence establishment. Well, it turns out that TEOCO runs outsourced data warehouses for several of the top US telcos, making it one of the top data warehouse technology users around.

A few weeks ago, I had a fascinating chat with John Devolites of TEOCO. Highlights included:

May 8, 2008

Outsourced data marts

Call me slow on the uptake if you like, but it’s finally dawned on me that outsourced data marts are a nontrivial segment of the analytics business. For example:

To a first approximation, here’s what I think is going on. Read more

March 25, 2008

The eBay analytics guys have a blog now

Oliver Ratzesberger and his crew have started a blog, focusing on xldb analytics. Naturally, one of the early posts gives a quick overview of their system stats. Highlights include:

Incoming data volumes exceed 40TB per day, with more than 10^11 new items/lines/records being added per day. Our analytical processing infrastructure exceeds 6PB of physical storage with over 2.9PB(1.4+1.5) in our largest cluster.

We leverage compression technologies wherever possible and are achieving compression ratios as high as 99% on our highest volume data feeds.

On any given day our massive parallel systems process more than 27PB of data, not factoring in various levels of caches that serve similar activities or processes and reduce the amount of physical IOs significantly.

We execute millions of requests on a daily basis, spanning from near realtime highly localized access to enormous jobs that span 100s of TB in a single or series of models.

March 13, 2008

More Twitter weirdness

Twitter commonly has the problem of duplicate tweets. That is, if you post a message, it shows up twice. After a little while, the dupe disappears, but if you delete the dupe manually, the original is gone too.

I presume what’s going on is that tweets are cached, the tweets are eventually batched to disk, and they don’t always get deleted from cache until some time after they’re persisted. If you happen to check the page of your recent tweets inbetween — boom, you get two hits. But what I don’t understand is why the two versions have different timestamps.

Presumably, this could be explained at a MySQL User Conference session next month, one of whose topics will be Intelligent caching strategies using a hybrid MemCache / MySQL approach. I’m so glad they don’t use stupid strategies to do this … Read more

March 4, 2008

Odd article on Sybase IQ and columnar systems

Intelligent Enterprise has an article on Sybase IQ and columnar systems that leaves me shaking my head. E.g., it ends by saying Netezza has a columnar architecture (uh, no). It also quotes an IBM exec as saying only 10-20% of what matters in a data warehouse DBMS is performance (already an odd claim), and then has him saying columnar only provides a 10% performance gain (let’s be generous and hope that’s a misquote).

Also from the article — and this part seems more credible — is:

“Sybase IQ revenues were up 70% last year,” said Richard Pledereder, VP of engineering. … Sybase now claims 1,200 Sybase IQ customers. It runs large data warehouses powered by big, multiprocessor servers. Priced at $45,000 per CPU, those IQ customers now account for a significant share of Sybase’s revenues, although the company won’t break down revenues by market segment.

Read more

February 27, 2008

eBay OLTP architecture

I’ve posted a couple times about eBay’s analytics side. As a companion, Don Burleson pointed me at a fascinating November, 2006 slide presentation outlining eBay’s transactional architecture and evolution. Highlights include:

The presentation has a bunch of specific numbers, in case anybody wants to dive in.

February 26, 2008

Introduction to Exasol

I had a non-technical introduction today to Exasol, a data warehouse specialist that has gotten a little buzz recently for publishing TPC-H results even faster than ParAccel’s. Here are some highlights:


February 26, 2008

The biggest eBay database

There’s been some confusion over my post about eBay’s multiple petabytes of data. So to clarify, let me say:

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.