How big are the intelligence agencies’ data warehouses?
Edit: The relevant part of the article cited has now been substantially changed, in line with Jeff Jonas’ remarks in the comment thread below.
Joe Harris linked me to an article that made a rather extraordinary claim:
At another federal agency Jonas worked at (he wouldn’t say which), they had a very large data warehouse in the basement. The size of the data warehouse was a secret, but Jonas estimated it at 4 exabytes (EB), and increasing at the rate of 5 TB per day.
Now, if one does the division, the quote claims it takes 800,000 days for the database to double in size, which is absurd. Perhaps this (Jeff) Jonas guy was just talking about a 4 petabyte system and got confused. (Of course, that would still be pretty big.) But before I got my arithmetic straight, I ran the 4 exabyte figure past a couple of folks, as a target for the size of the US government’s largest classified database. Best guess turns out to be that it’s 1-2 orders of magnitude too high for the government’s largest database, not 3. But that’s only a guess …
Comments
5 Responses to “How big are the intelligence agencies’ data warehouses?”
Leave a Reply
I agree that he must have meant “petabytes” which is, to paraphrase, still pretty freakin’ huge.
However I did some digging around (i.e. Googling…) and came across some figures for the total digitized size of global voice communication being 12 exabytes per year.
Given the NSA’s penchant for listening in to everyone’s phone calls I can well imagine a *data store* of this size existing somewhere.
It’s a bit of stretch to call it a database but it’s still one hell of a lot of data.
As an aside, I understand that the NSA are the only agency who can decrypt Skype calls but it takes them a long time to do it.
If that’s true they would need to first store them somewhere while they decide which ones to decrypt.
{ What’s that black helicopter doing over my house? 😉 }
Joe
For the record the writer of the original article got a number of facts twisted. Actually, in this case he simply misquoted me. With respect to use of the word Exabyte … I suggested this verbiage to correct for the errors:
=======================
Jonas got to thinking what if they had 4 exabytes (EB) of data in the basement, and some have said they get new data in through the pipes at 5 TB a minute! “You sit there and realize you don’t get to Friday night and run a batch job to answer the question what does all this mean?,” he says. “You could use all the computing power and energy on Earth and you wouldn’t be able to do it.”
=======================
Note … I did never said in a database nor did I even imply in one system – in my mind probably lots of piles of many different kinds of data and in many different forms. I did use the term Exabytes … but more as an expression of gobs of data. Point being … batch periodic processing ain’t going to cut it. As I think the smartest and fastest a system can be involves sensemaking on streams. I blogged a bit more about this here: http://jeffjonas.typepad.com/jeff_jonas/2006/08/accumulating_co.html
Unfortunately for me, there we even more problematic discrepancies between what I said and the story. I hate it when that happens.
Jeff,
I hate it when that happens. Thanks for stopping by with the clarification!
Now, let’s check orders of magnitude. In your correction, you’re hypothesizing 5 TB/minute — not per day! That reduces the 800,000 day figure I was mocking to something under 2 years, which makes a lot more sense. 🙂
Thanks,
CAM
Curt,
heh, so you are hinting back at the 4 EB hypothesis. That order of magnitude is obviously wrong though, as you said already. By the way, the idea that someone would store all the world’s voice communication, in the amount of 12 EB (or even any visible fraction of it), seems absurd too.
Before going into any speculation on what government agencies are up to, there is a quicker way to impose realistic upper boundary here: how much storage has been produced worldwide in the past year?
I tried to estimate this:
http://bigdatamatters.com/bigdatamatters/2009/05/he-with-the-most-data-wins-.html
I’d be very interested to see someone’s more precise estimate than my 50 EB.
Another other way to impose an upper boundary here would be to divide the federal budget by the cost of 1 GB of storage…Either way, there’s little chance that in 2009 a federal agency, or anyone else would own exabytes.
Pawel,
Think you may have missed the intended humour in my comment… 😉
Joe