Terminology: Data mustering
I find myself in need of a word or phrase that means bring data together from various sources so that it’s ready to be used, where the use can be analysis or operations. The first words I thought of were “aggregation” and “collection,” but they both have other meanings in IT. Even “data marshalling” has a specific meaning different from what I want. So instead, I’ll go with data mustering.
I mean for the term “data mustering” to encompass at least three scenarios:
- Integrated (relational) data warehouse.
- Big bit bucket.
- Big bit stream.
Let me explain what I mean by each.
“Integrated data warehouse” is a phrase Teradata has started using for enterprise data warehouses that, like approximately every other EDW in the entire history of data warehousing, aren’t truly enterprise-wide. In other words, it means “not just a data mart”. No category name is perfect, but I think that one works reasonably well.
I previously described the big bit bucket use case as
Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing.
and quickly added
Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.
I think I’ll stand pat on that explanation. 🙂
By analogy, a big bit stream is various streams of data, assembled in the custody of a streaming engine. Sybase told me Wednesday that this scenario appears in both of the traditional markets for CEP/streaming — national intelligence, where it is a major use of streaming, and capital markets in some use cases as well. And it’s consistent with what I’ve heard from other CEP/streaming vendors as well.
As for where I got the word “mustering” — it’s a military term, for when you assemble your troops and their gear either for inspection or for actual use. The main modern usage I know of the word is as part of the phrase “pass muster”, which originally referred to the concept that the person being paid to put a regiment together should from time to time demonstrate that the regiment physically existed in the form that regimental records seemed to show.
Comments
12 Responses to “Terminology: Data mustering”
Leave a Reply
[…] Data mustering for the analysts. […]
Interesting thoughts as usual, Curt.
For what it’s worth, I’ve heard Cloudera refer to a similar concept (by my estimation) as “Data Stewardship” and the person who performs the role as a Data Steward.
Have you come across that term in your talks with them, and is it the same thing you’re describing here?
“Data stewardship” doesn’t ring much of a bell, but sounds too close to “data governance” for my tastes.
As far as I could tell, the closest terms I’ve heard and used to describe it are “data federation” or “federated database”.
A://
Curt, how does “data integration” and the decades old industry behind it fit in here ? “Data integration” is also a term that is widely used by the data management research community and the wikipedia entry of the same provides a good summary.
Alex,
Data federation usually refers to data being physically in different systems, but them being viewed as a logical whole. In contrast, what I’m refer to usually involves data all being in one place.
Dawit,
Thank you for pointing me at that hideously incorrect Wikipedia entry, which seems to use “data integration” as a synonym for “data federation”.
Please see my previous comment for the contrast to data federation.
The word mustering is still used (at least in Australia) for bringing together cattle before taking them off to become tasty steaks etc.
Hah! Chalk one up for the Australians! Even if the whole country does mispronounce my last name. 🙂
I like the term data mustering, but data wrangling and data herding would work also.
[…] much of its technical differentiation in the area of data mustering […]
[…] have a “collect all your data in one place” part to their stories — which I call data mustering — and Hadoop is a data transformation tool as […]