What those nested data structures are about
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Edit: Oliver subsequently moved on to Sears and then Teradata.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
Comments
7 Responses to “What those nested data structures are about”
Leave a Reply
Well Curt, here’s your answer… It’s Sears Holdings.
http://www.linkedin.com/in/oliverratzesberger
Hi Michael!
That was my first guess. Oliver was only modestly taken aback when I mentioned it. 🙂
Never mind Sears; KMart had an industry-leading, publicly visible CIO in the 90s or so, Dave Carlson. Oliver is stepping into a rich tradition.
This might be a naïve question, but regarding “reconstructing all this information via joins would be brutally expensive,” what about materialized joins? I’ve never actually had the chance to use materialized joins (my current company uses Mongo and previous places I’ve worked never had this problem), so I have no idea how practical they are. But when I read about them, they seemed like a possible solution to this problem.
[…] the nested data structure story? (It seems there is […]
[…] addition to ordinary tables, Parquet can handle nested data structures, ala Dremel. That is, a field can be array-valued, a cell in the array can itself be array-valued, […]
[…] model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the […]
[…] is introducing SQL extensions so that Impala can query nested data structures. More on that […]