Examples and definition of machine-generated data
In posts made last December, January, and April, I argued:
- Much of the growth in analytic data volumes will come in the form of machine-generated data.
- Unlike human-generated data, machine-generated data will grow at Moore’s Law kinds of speeds.
- Thus, unlike human-generated data, which I advocate keeping pretty much in all its detail, machine-generated data will continue to be in large part thrown away.
Recently and somewhat belatedly, I added a somewhat obvious point — if we don’t keep all or even most of our machine-generated data, then what we keep is likely to be in some way massaged, extracted, or derived. The purpose of this post is to address a second oversight — giving a hopefully clear definition of what I actually mean by “machine-generated data.”
In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.
To a first approximation, machine-generated data is data that is not human-generated. I.e.,
Provisional definition: Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.
(That is definitely an inclusive OR.) Suggestions for slicker wording will be gratefully received — but in making them, please try not to run afoul of Monash’s First Law of Commercial Semantics.
Let’s elucidate this definition by means of examples. Some cases of machine-generated data are fairly straightforward. Two of the posts linked above feature the list:
- Computer, network, and other equipment logs
- Satellite and similar telemetry (whether for espionage or science)
- Location data such as RFID chip readings, GPS system output, etc.
- Temperature and other environmental sensor readings
- Sensor readings from factories, pipelines, etc.
- Output from many kinds of medical device, in hospitals and (increasingly) homes alike
Only the first of those items is problematic. Otherwise, these are essentially cases of machine data all the way down.
So let’s consider some of the leading hybrid cases. Web logs mix together a wide variety of data, including:
- Things the user types in.
- Clicks the user makes.
- Other indicators of the user’s attention.
- Records of what was on the page when the user made these choices.
- Large amounts of purely technical web server and network information.
Parsing these into reliable records of human activity — e.g. event extraction or sessionization — is an important computational task, and a precursor to almost any kind of analysis. Thus, raw records of human choices aren’t the essence of the database. Also, the network log part is typically 5x or more bigger than the pure web log. Putting that together, I’d say the whole thing feels largely like a machine-generated data challenge, but admittedly it’s in a bit of a gray area.
Call detail records (CDRs) initially feel machine-generated too, but it may be a bit misleading to view them as such. 1/2 a kilobyte of data (a typical length) for a several-minute human activity is not a whole lot. Obviously, if lots of network routing data gets attached — or if some intelligence agency parses the call’s contents — it could be a different matter. But for now I’m inclined to leave CDRs along with, say, in-store point-of-sale (POS) data as a category of particularly large human-generated data sets.
Social media and gaming records seem more like weblogs than CDRs — products of human choices so casual that they might as well be machine-generated. Obviously I’m not referring to WordPress authoring here, but rather to users who click or tap through a dizzying array of choices at ever higher speeds, with ever more log-style data created as byproducts of every user action.
And finally, there’s a different kind of edge case. Many stock trades are human-generated in the usual way. Even so, trade volume these days is dominated either by purely algorithmic trades, or else trades in which an algorithm turns one human decision into a dizzying array of individual trades. So I think stock trades can be fairly counted as machine-generated data. But I may reverse my opinion if rate-limiting regulations serve to limit or reduce their algorithmic aspect.
If you’ve noticed ways in which my definition of “machine-generated data” is less than ideal, please be so kind as to recall one thing — no product category definition can ever be perfect.
Comments
28 Responses to “Examples and definition of machine-generated data”
Leave a Reply
This is an important topic — I am glad you are giving it some thought and visibility. I think I have more to say on this subject that can fit in a blog comment, so I wrote a response post on my blog:
http://dbmsmusings.blogspot.com/2010/12/machine-vs-human-generated-data.html
Hi Daniel,
Thanks for the rapid response!
As per our Twitter exchange, I stand by my points that you disagreed with in your post. My reason is that I think data/(human action) or data/(minute of human use) will continue to increase rapidly in line with advancing technology.
Also, you called out a good point when you added, in effect, “If a data set is so big that a lot of it will get thrown away, then a lot of the rest will be kept on cheap storage, e.g disk.”
[…] The Wikipedia article on same doesn’t get the job done yet. (Edit: Here’s my take on defining machine-generated data. Be sure to read through to Daniel Abadi’s […]
Hi Curt,
Fair enough. The particular categories of individual applications are not the types of disagreements that need to be resolved.
I think we are in agreement on the basic point, which is that as long as there is machine-generated data, there will be “Big Data”, even though the definition of “Big” will change over time.
[…] from Google and Facebook to Walmart and eBay. There is some debate about what big data means, with Curt Monash and Dan Abadi having recent posts on the […]
[…] Monash has been trying to define Machine Generated Data (but Daniel Abadi doesn’t fully agree) because machine generated data is what’ll be […]
Maybe this will help to clarify things a bit, or maybe it will make things worse… With regard to business intelligence, “machine generated” data typically describes “what is or has happened”. Machine generated data typically cannot answer the “why did something happen” question because of its inherently “narrow” (as opposed to wide) context.
Alan,
Huh?? If you want to know why the machine stopped working, its machine-generated log file contains the most important data to help you figure out why.
You also seem to be assuming that some kinds of raw data answer “why” questions just by virtue of having been collected, without further analysis. Except in very special cases — e.g., answers to survey questions that have “Why” in them — I don’t see what your basis for that assumption is.
[…] Machine-generated, such as web log or sensor data. […]
[…] been on a terminology binge recently, defining terms such as machine-generated data, analytic platform, internet request processing, and transparent sharding. So perhaps this is a […]
[…] by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to […]
[…] data is a whole other can of worms. Paradigmatic examples of what I mean by machine-generated data […]
[…] Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of […]
[…] “people data” — customer loyalty, health care, etc . — rather than purely machine-generated data, with the paradigmatic target application being personalized […]
[…] Terms I’ve recently sponsored, such as investigative analytics or machine-generated data. […]
[…] said, Infobright is small and focused on machine-generated data. So I wouldn’t be confident in Infobright’s future technology path for human-generated […]
[…] out that this is all supported by cheap data creation and acquisition, specifically in the area of machine-generated data, which gets the full benefit of Moore’s […]
[…] All human-generated data should be retained. […]
[…] other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several […]
[…] machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In […]
[…] Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently … […]
[…] Glassbeam has an analytic technology stack focused on poly-structured machine-generated data. […]
[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
[…] of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit […]
[…] In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data. […]
[…] IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find […]