Data messes
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including:
- Variant names.
- Variant spellings.
- Variant data structures (not to mention datatypes, formats, etc.).
Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.
So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.
All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.
Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.
Let’s now consider missing data. In operational cases, there are three main kinds of missing data:
- Missing values, as a special case of inaccuracy.
- Data that was only collected over certain time periods, as a special case of changing data structure.
- Data that hasn’t been derived yet, as the main case of a need for data enhancement.
All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.
Further analytics-mainly data messes can be found in three broad areas:
- Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
- Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
- Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.
That last area is the domain of a lot of analytics innovation. In particular:
- It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
- It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
- It’s a big part of the story of startups such as Alation or Tamr.
- In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
- It led to some of the earliest differentiated features at Gooddata.
- It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.
Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. 🙂 So I’ll leave that subject area for another time.
Comments
11 Responses to “Data messes”
Leave a Reply
Curt,
As usual an insightful, pithy and well timed piece.
But it is worse than that. We still haven’t got to grips with vocabulary – especially in an industry like travel. One might be forgiven for thinking of a direct flight as going directly from point A to point B. But it ain’t necessarily so. If there is an intermediate point, and the flight number doesn’t change and you don’t have to get off, then the flight can be described as direct. If you want to go somewhere directly, you have to go non-stop. If you don’t mind stopping then you might have a connection.
And that’s a simple case.
There’s no substitute for knowing what your terms mean – and realizing that others may have different terms that overlap, contradict or extend yours.
Corporate data definitions are all well and good, but people abuse the terms, use them in outmoded ways. We treat corporate language like natural language – meaning changes with time. Mostly we hope we get the meaning right, but when dealing in situations where accuracy and precision are required, we need to make sure we have standard definitions – and we use them.
Chris — just to be clear, that’s a data value problem you’re focusing on, not metadata, correct?
Just the sort of addition I was hoping for. I indeed assumed that fielded data had unambiguous meanings, even if they were vague. (Example of vague but not ambiguous — a number that clearly means what it means, EXCEPT that the intended precision is unclear.)
Much of the data in organizations is messy mainly because the data was not designed following any rigorous practices. This is an example of data illiteracy.
Few data modelers and users apply naming conventions or follow standards for writing descriptions. Few organizations maintain a controlled vocabulary of terms. Data design is a random act each time new data is added to a database or a new database is developed.
We then try to use technology such as MDM and data cleansing to attempt to resolve the variations in this randomly created data. Mapping data silos itself becomes an unstructured exercise since most organizations do not have rigorous harmonization practices. How do you harmonize the semantics and pragmatics (context of use) of the data?
The data messes is not a technical problem but a Data Literacy problem. We keep throwing technical solutions at human behavioral problem.
Richard,
While being long time lover of RDBMS I admit that today’s data velocity and variety do not allow to follow relational model… I believe that following more “high level” practices you advocate became hardly possible.. I think our software must became smarter to be able to make sense from data it presented.
Relevant material. Note that there is a huge gap in the taxonomy here – data semantics.
Consider a sale going from an online store to various targets, including GL, a sales mart, and an operational BI mart. In the process we create three distinct truths, since often the timing of the data and what is used to enrich it can differ between them (so one may know that the customer is part of a family that shops at the site, and another may know about a refund before others. They may or may not eventually become consistent – often sales recognizes revenue based on different rules than a GL.)
This is a fundamental justification for EDW, where a company can institute data governance and policy and come up with official truths and standards, where data is reconciled and integrated and such.
The justification for ETL is compelling for master data and reference data, imposing standards (though awkward for an early startup, really you can’t redefine everything midstream – so the data semantics has longevity.) For transactional and detail and other data with more velocity ETL is often awkward and problematic –shoehorning data that rapidly changes, often hiding semantics or other work that doesn’t feel right, and which creates bottlenecks if the dev velocity is high.)
Note how similar this issue is to big data scenarios, where you have a choice of standardizing data and sharing vs. having each application do ETL-like activities on the fly.
Richard – I think data illiteracy is just one reason for data issues. Another example is time-to-market decisions, another is acquisition or packaged software with different semantics.
David – I suspect the RDBMS discussion is a shibboleth – almost all big data stuff can happen in a RDBMS without *maintenance and process centric* management. Mainframes became 95% stable and got taken over by operations and froze out iterative development and innovation, and the same has happened in many companies for RDBMS. Many big data projects are ways to subvert that control and get performance and flexibility rather than do something technically difficult without clusters.
Aaron, it is exactly my point, that RDBMS are not flexible enough for many cases. In the samе time, loading of let say 5TB a day or joining two tables with one billion records each is hardly possible without a clusters…
[…] my recent post on data messes, I left an IOU for a discussion of application databases. I’ve addressed parts of that […]
Doesn’t an Object Store, a NoSQL database with schema agnostic indexing and continuous map-reduce ‘heal this’…
https://github.com/nestorpersist/ostore
http://www.meetup.com/Seattle-Scalability-Meetup/photos/26152701/#437982894
[…] Popular problem types include chaos, confusion, and mess. […]
There isn’t a silver bullet for solving data integration issues. Data is always subjectively messy. Thus, you need “context lenses” through which relevant data is viewed en route to final processing by data consuming applications.
The following principles provide an effective basis for addressing these challenges:
1. Use HTTP URIs as Entity Names (Identifiers) — these implicitly resolve to entity description documents
2. Use an abstract daa representation language (e.g., RDF) to describe entities using subject, predicate, object based sentences — in a notation of your choice (i.e., it doesn’t have to be JSON, JSON-LD, Turtle, RDF/XML etc..).
3. Ensure the nature of Entity Types and Relationship Types are also described using same abstract data representation language — this relates to Classes, Sub Classes, Transitivity, Symmetry, Equivalence etc..
The links that follow include live examples of this approach to data integration across disparate data sources:
Links:
[1] http://kidehen.blogspot.com/2015/07/conceptual-data-virtualization-across.html — deals with integrating data across disparate RDBMS systems
[2] http://kidehen.blogspot.com/2015/07/situation-analysis-never-day-goes-by.html — show how controlled natural language can be used to harmonize disparate data
[3] http://kidehen.blogspot.com/2014/01/demonstrating-reasoning-via-sparql.html — demonstrating reasoning and inference based on entity relationship type semantics, using SPARQL .
[…] the biggest mess in all of IT is the management of individual consumers’ data. Our electronic data is […]