Dataset management
I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:
- Metadata management in a structured-file context.
- Lineage/provenance, auditing, and similar stuff.
Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. 🙂 Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.
My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
- A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
- But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.
As for the specific products, both of which you might want to check out:
- Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
- Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.
Comments
6 Responses to “Dataset management”
Leave a Reply
It is interesting to look at the two (oddly segregated) streams:
– imposing structure on Hadoop data (HCatalog, HBase catalog and catalog tracker, etc. and many more)
– looking at the Hadoop ontology such as the above
The convergence would be at the app level if the data doesn’t have strict single desired semantic understanding.
It is also fascinating how useless most of these tools seem where the data is semi-structured or has multiple interpreters attempting to understand it in distinct ways.
It’s a trade-off.
In a traditional relational world, the schemas for different apps are integrated into one hairball, but there are clean interfaces between the database and the app.
In the dynamic schema world, the database schema is tightly integrated into the app, but is much more independent of other apps’ schemas.
Hi Curt,
How’s a ‘dataset’ different from a ‘data mart’ form the old data-warehousing days?
Why are you introducing a new term?
Thanks,
Hari
Hari,
In (overly) simple terms:
A dataset would most typically be a single file in some format or other, managed by Hadoop.
A data mart would most typically be a set of relational tables, managed by an RDBMS.
[…] out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data […]
[…] Cloudera Navigator for object storage is a roadmap item. […]