Examples of machine-generated data
Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include:
- Computer, network, and other equipment logs
- Satellite and similar telemetry (whether for espionage or science)
- Location data such as RFID chip readings, GPS system output, etc.
- Temperature and other environmental sensor readings
- Sensor readings from factories, pipelines, etc.
- Output from many kinds of medical device, in hospitals and (increasingly) homes alike
The core idea here is that human-generated data can grow only as fast as human data-generating activities allow it to, but machine-generated data is limited only by capital budgets and Moore’s Law. So machines’ ability to generate data is growing a lot faster than humans’.
Up to this point, I think there’s broad agreement, at least on the part of anybody who’s thought about it this way for very long. But that still leaves open questions as to which kinds of machine-generated data will matter first. The big five that matter right now are:
- Web logs (partially machine-generated, but tied to human actions)
- Call detail records (CDRs — ditto)
- Financial instrument trades (some purely machine-generated, some human-based)
- Network event logs (commonly associated with web logs)
- Telemetry collected by the government (especially for intelligence purposes)
A large fraction of all the 100 TB+ or petabyte+ data warehouse activity I know of falls into those areas.
Following along quickly are:
- Online game data. Since late last year, online game companies have come up over and over again as an important category of data warehousing/analytics users. Like most of the categories above, the gaming area actually features a hybrid between human- and machine-generated data.
- Genetic research data, although I don’t know to what extent the investment in data gathering is concentrated among the few obvious big pharmaceutical companies. Other health care data (research or clinical) will come along too, but doesn’t seem to be there yet.
Until recently I would have added:
- Energy exploration, energy production, energy refining, and/or utility network data
But while those areas seemed poised to get hot last year, I haven’t heard much about them the past few months, with a few exceptions:
- Accenture’s observation that new smart grids will generate up to eight orders of magnitude more data than old dumb grids do
- The recent article about the Terralliance fiasco (new kinds of oil exploration analytics, going beyond seismological data)
- Lots of concern about security flaws in utility smart grids.
Finally, I’ve been assuming that a big area going forward is location data, especially personal movement data. The data volumes involved could be similar to or even greater than those of CDRs. But privacy concerns with that are obviously immense. (Of course, in the case of Foursquare, this sort of overlaps with freely-shared game data.)
If you want to make all this more tangible in your mind, one area to look for ideas is in the huge amount of news about various kinds of innovative sensors. Sources include:
- Somebody named Landon Cox, who maintains a couple feeds of sensor news.
- A Twitter feed, apparently associated with a Sensor Expo.
- Another Twitter feed, this one from Sun Labs. (I have no idea what Oracle is or isn’t doing with the Sun SPOT project that links to.)
- Yet another Twitter feed.
Comments
27 Responses to “Examples of machine-generated data”
Leave a Reply
Machine-generated text-and-number data is widely used and collected en masse but sensor data has additional problems.
I’ve worked with a number of non-military sensor-based data sources, both terrestrial and satellite. Tens of petabytes of working set and tens of terabytes of new ingest per day for single sources is becoming normal. Managing this data is a serious problem since conventional spatio-temporal indexing of the type that would often be useful doesn’t scale; big file systems and something only slightly better than brute-force search are the norm. If you add in latent network-connected data sources that have secondary value outside of their primary non-aggregated purpose, I would guess new sensor data being systematically generated annually by machines is in the low exabyte range and growing fast, most of which is discarded.
In my own conversations with organizations that have valuable latent sensor data sets, there is a lot of interest in aggregating and analyzing this data but no cost-effective way of doing so. These are the myriad machine-generated data sources that no one ever hears about. Software like Hadoop doesn’t solve this particular problem very well yet. Once the benefit exceeds the cost of dealing with this data and turning it into an analyzable form I expect we will “discover” sensor-based data sets popping up like weeds as organizations attempt to monetize them.
Andrew,
Would SciDB address some of the issues you’re talking about?
SciDB looks like a big improvement, it would appear to provide a pretty rich toolset for basic data handling. At a minimum, it looks like it can do most of the first-pass processing of the input data.
The single most common bottleneck (in my past experience) beside raw computational power was indexing the features pulled out of the raw data. R-tree-like data structures at that scale have really poor ingest and update rate for the dynamic polygonal features that were the usable end product. It doesn’t look like this is addressed. Point clouds scale out very nicely for basic read-write but are compute intensive for contextual/analytical queries relative to polygons and there is a modest practical upper bound on the number of records for conventional polygon handling. It is a tradeoff.
Andrew,
I’m not sure that there’s any reasonable solution to polygons other than “Scan a lot of data and do what you have to do.” I.e., while R-trees and the like don’t sufficiently help w/ polygon-style analysis, they also don’t particularly get in way. The same would likely be true of SciDB.
One thing SciDB is designed for is to store “cooked” post-processed results in line w/ the raw data. That could actually help with the kind of problems you’re raising.
At least in theory …
The data issue we are having is related to this as well. Looking to expand a current 4 TB Oracle RAC DB hosting various aviation data. Growth rates of 50/100 ? TB year of radar, GPS, terminal, weather, and other data feeds. Our analysts want to be able to take any flight and perform end to end analysis, model all airspace — whatever else they can think of. We have a small hadoop cluster as a test to see if this can meet our needs. Right now we are willing to experiment with almost any software to find a solution as the current Oracle just is not performing as the customer expects.
We all knew the time would come when computing power would generate more data than can be physically handled by humans. Working in the treasury and capital markets world, it’s no surprise to see that the data generated by financial instrument trades makes your top 5 of machine-generated data that ‘matters’. However, it’s worth noting that computer-generated data comes not only from algo trading but also from the production of simulation scenarios, which are produced by large grid computing infrastructures, so traders can compute Value at Risk, Sensitivities and Counterparty Risk. Simulations of this nature will be relied upon more and more as financial institutions desperately try to avoid a repetition of the recent events in the financial markets. Ultimately, the challenge lies not only in managing this explosion of computer-generated data but also in having the ability to effectively analyse and act on it.
George,
To what extent are those simulations ever STORED?
I understand that simulations are huge in various parts of high-performance computing, but I’ve never gotten more than throwaway comments helping me understand whether actually storing the output of the simulations is a big deal.
Thanks,
CAM
Hi Curt,
It is true that most of the data generated during simulation process for computing indicators such as Value at Risk (VaR), Potential Exposure etc used to be thrown away once key statistics where computed ( loss quartile, closed form VaR, MPE etc).
However, there is now, after the financial crisis, a requirement ( business and regulatory) to go beyond a single number to understand and manage risk. Furthermore, one needs to monitor changes in risk, and be able able to look back in the past, drill thru down to individual scenarios or trades. For all those reasons, the results of those simulations are now stored and need to be analyzed. You might want to look at this benchmark, it describes a realistic volume scenario for an international bank.
http://www.quartetfs.com/benchmark.php
I hope this helps,
Georges Bory
[…] Examples of machine-generated data | DBMS2 — DataBase Management System Services […]
We’ve been involved with numerous military programs collects large volumes of sensor data, and the volumes are certainly increasing all the time. The big issue for some of these folks is the big gains in hardware to collect higher quality data.
Collecting large amounts of data for our customers probably isn’t as big a problem as to how to figure out what’s important (or how to make decisions on the data you get). What is increasingly important is how to fuse multiple sources of data so you can increase the value of that data. Petabytes of data is no longer an issue for many data stores, its the complexity of the data that will be the real challenge over the next few years.
[…] may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair […]
[…] those four categories is the first one. That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair assessment at the present, […]
[…] Not many terms I coin gets marketing traction, but machine-generated data has grown some legs. Clients (Infobright, Cloudera) and non-clients alike have adopted it. I need […]
[…] posts made last December, January, and April, I […]
promotionvoucher.co.uk
Examples of machine-generated data | DBMS 2 : DataBase Management System Services
[…] ข้อมูลอื่นๆที่น่าสนใจ เช่น online game data, online research data (reference) […]
[…] Example: http://www.dbms2.com/2010/04/08/machine-generated-data-example/ […]
Thanks for finally talking about > Examples of machine-generated data | DBMS 2 :
DataBase Management System Services said vs stated
This is the perfect blog for everyone who hopes to find out
about this topic. You understand recommends a link whole lot its almost tough to
argue with you (not that I really will need to…HaHa).
You certainly put a fresh spin on a subject that has
been written about for many years. Excellent stuff, just excellent!
http://theodoreutunas.mee.nu/?entry=3448885
watches shoes
blog topic
dedicated server protection
Examples of machine-generated data | DBMS 2 : DataBase Management System Services
You said it adequately..
my webpage https://radio4000.com/stephenhendel
Really a lot of very good tips.
Finally, the as soon as-chiseled Fitbit Flex has begun to how its age subsequent to its younger,
more agile brother, the Fitbit Drive.
Alsoo vieit my website :: judi slot deposit Pulsa tanpa potongan
Why users still use to read news papers when in this technological globe all
is available on net?
First off I would like to say awesome blog! I had a quick question that I’d like to ask if you do not mind.
I was curious to know how you center yourself and clear your mind before
writing. I’ve had a hard time clearing my mind in getting my ideas out there.
I truly do take pleasure in writing however it just seems
like the first 10 to 15 minutes are wasted just trying to figure out how to begin. Any ideas or hints?
Kudos!
I have the utmost confidence that you are well aware of the huge impact that you have had on both my personal life and my professional life. Regarding you, I have a great deal of admiration, and I am thankful for the fact that such admiration exists.