Facebook

Discussion of Facebook’s data management technologies. Related subjects include:

Cassandra, which was originated at Facebook
Hadoop, one of whose largest supporters is Facebook
Google’s data management technologies
Amazon’s data management technologies

October 3, 2009

Issues in scientific data management

In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:

A data model based on multidimensional arrays, not sets of tuples
A storage model based on versions and not update in place
Built-in support for provenance (lineage), workflows, and uncertainty
Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists

Categories: Analytic technologies, Data integration and middleware, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, GIS and geospatial, Hadoop, Open source, SciDB, Scientific research, Specific users, Web analytics

7 Comments

July 1, 2009

NoSQL?

Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:

In most cases, no.
Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
There’s one big countervailing factor to all these generalities — schema flexibility.

As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more

Categories: Analytic technologies, Data models and architecture, Data warehousing, Database diversity, DBMS product categories, Facebook, Fox and MySpace, Hadoop, MapReduce, Michael Stonebraker, NoSQL, OLTP, Parallelization, Streaming and complex event processing (CEP), Theory and architecture, VoltDB and H-Store

30 Comments

May 14, 2009

Facebook’s experiences with compression

One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:

Facebook uses gzip, and gets a little bit more than 6X compression.
Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.

Categories: Data warehousing, Database compression, Facebook, Hadoop

2 Comments

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more

Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo

55 Comments

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

2 1/2 petabytes of data managed via Hadoop
10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
Ad targeting queries run every 15 minutes in Hadoop
Dashboard roll-up queries run every hour in Hadoop
Ad-hoc research/analytic Hadoop queries run whenever
Anti-fraud analysis done in Hadoop
Text mining (e.g., of things written on people’s “walls”) done in Hadoop
100s or 1000s of simultaneous Hadoop queries
JSON-based social network analysis in Hadoop

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

Categories: Analytic technologies, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Petabyte-scale data management, RDF and graphs, Specific users, Web analytics

27 Comments

July 21, 2008

Project Cassandra — Facebook’s open sourced quasi-DBMS

Edit: I posted much fresher information about Cassandra in July, 2010.

Facebook has open-sourced Project Cassandra, an imitation of Google’s BigTable. Actual public information about Facebook’s Cassandra seems to reside in a few links that may be found on the Cassandra Project’s Google code page. All the discussion I’ve seen seems to be based solely on some slides from a SIGMOD presentation. In particular, Dare Obasanjo offers an excellent overview of Cassandra. To wit: Read more

Categories: Cassandra, Cloud computing, Data models and architecture, Facebook, NoSQL

11 Comments

← Previous Page

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in