March 18, 2013

Dataset management

I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:

Metadata management in a structured-file context.
Lineage/provenance, auditing, and similar stuff.

Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. 🙂 Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.

My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:

A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.

As for the specific products, both of which you might want to check out:

Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.

Categories: Cloudera, Hadoop

Subscribe to our complete feed!

Comments

6 Responses to “Dataset management”

aaron on March 18th, 2013 2:33 pm

It is interesting to look at the two (oddly segregated) streams:
– imposing structure on Hadoop data (HCatalog, HBase catalog and catalog tracker, etc. and many more)
– looking at the Hadoop ontology such as the above
The convergence would be at the app level if the data doesn’t have strict single desired semantic understanding.

It is also fascinating how useless most of these tools seem where the data is semi-structured or has multiple interpreters attempting to understand it in distinct ways.
Curt Monash on March 18th, 2013 8:12 pm

It’s a trade-off.

In a traditional relational world, the schemas for different apps are integrated into one hairball, but there are clean interfaces between the database and the app.

In the dynamic schema world, the database schema is tightly integrated into the app, but is much more independent of other apps’ schemas.
Hari on March 29th, 2013 7:56 am

Hi Curt,

How’s a ‘dataset’ different from a ‘data mart’ form the old data-warehousing days?

Why are you introducing a new term?

Thanks,
Hari
Curt Monash on March 29th, 2013 10:41 am

Hari,

In (overly) simple terms:

A dataset would most typically be a single file in some format or other, managed by Hadoop.

A data mart would most typically be a set of relational tables, managed by an RDBMS.
Teradata bought Hadapt and Revelytix | DBMS 2 : DataBase Management System Services on July 23rd, 2014 4:29 am

[…] out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data […]
Cloudera in the cloud(s) | DBMS 2 : DataBase Management System Services on January 22nd, 2016 6:57 am

[…] Cloudera Navigator for object storage is a roadmap item. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Dataset management

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin