Google has thousands of internal data formats, mostly simple ones
In connection with the release of Protocol Buffers, Kenton Varda of Google wrote:
At Google, our mission is organizing all of the world’s information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises an important question: How do we encode it all?
That sounds like a lot. On the other hand, if “data format” is just a synonym for “table structure,” “file structure,” and/or “schema,” it sounds more plausible. Varda goes on to say
a simple lists-and-records model … solves the majority of problems
Come to think of it, that sounds very consistent with the idea that MapReduce solves a large fraction of Google’s data management issues.
Comments
2 Responses to “Google has thousands of internal data formats, mostly simple ones”
Leave a Reply
The printed representation looks an awful lot like JSON (http://en.wikipedia.org/wiki/JSON). I wonder why not just use JSON, which is well-known and precisely specified? Anyway, this and JSON are very useful for many applications.
I agree that it’s another IDL. It’s not all THAT simple. But I haven’t used IDL’s too much in practice and probably it’s simpler than CORBA’s IDL! So, it looks nice; no major breakthrough or anything like that, just an incremental improvement on what we all know about. That’s fine; incremental improvements are perfectly respectable.
Dan,
They addressed the JSON point directly, albeit briefly, in the comment thread.