Derived data
The management of data that is derived, augmented, enhanced, adjusted, or cooked — as opposed to just the raw stuff.
Technology implications of political trends
The tech industry has a broad range of political concerns. While I may complain that things have been a bit predictable in other respects, politics is having real and new(ish) technical consequences. In some cases, existing technology is clearly adequate to meet regulators’ and customers’ demands. Other needs look more like open research challenges.
1. Privacy regulations will be very different in different countries or regions. For starters:
- This is one case in which the European Union’s bureaucracy is working pretty well. It’s making rules for the whole region, and they aren’t totally crazy ones.
- Things are more chaotic in the English-speaking democracies.
- Authoritarian regimes are enacting anti-privacy rules.
All of these rules are subject to change based on:
- Genuine technological change.
- Changes in politicians’ or the public’s perceptions.
And so I believe: For any multinational organization that handles customer data, privacy/security requirements are likely to change constantly. Technology decisions need to reflect that reality.
2. Data sovereignty/geo-compliance is a big deal. In fact, this is one area where the EU and authoritarian countries such as Russia formally agree. Each wants its citizens’ data to be stored locally, so as to ensure adherence to local privacy rules.
For raw, granular data, that’s a straightforward — even if annoying — requirement to meet. But things get murkier for data that is aggregated or otherwise derived. Read more
Categories: Derived data, Public policy | 6 Comments |
Data messes
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more
ClearStory, Spark, and Storm
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy. 🙂
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
ClearStory:
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
Layering of database technology & DBMS with multiple DMLs
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more
Data model churn
Perhaps we should remind ourselves of the many ways data models can be caused to churn. Here are some examples that are top-of-mind for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past several years..
Just to confuse things further — some of these examples show the importance of RDBMS, while others highlight the relational model’s limitations.
The old standbys
Product and service changes. Simple changes to your product line many not require any changes to the databases recording their production and sale. More complex product changes, however, probably will.
A big help in MCI’s rise in the 1980s was its new Friends and Family service offering. AT&T couldn’t respond quickly, because it couldn’t get the programming done, where by “programming” I mainly mean database integration and design. If all that was before your time, this link seems like a fairly contemporaneous case study.
Organizational changes. A common source of hassle, especially around databases that support business intelligence or planning/budgeting, is organizational change. Kalido’s whole business was based on accommodating that, last I checked, as were a lot of BI consultants’. Read more
Categories: Data warehousing, Derived data, Kalido, Log analysis, Software as a Service (SaaS), Specific users, Text, Web analytics | 3 Comments |
It’s hard to make data easy to analyze
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
Three quick notes about derived data
I had one of “those” trips last week:
- 20 meetings, a number of them very multi-hour.
- A broken laptop.
- Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.
So please pardon me if things are a bit disjointed …
I’ve argued for a while that:
- All human-generated data should be retained.
- The more important kinds of machine-generated data should be retained as well.
- Raw data isn’t enough; it’s really important to store derived data as well.
Here are a few notes on the derived data trend. Read more
Categories: Derived data, Hadoop, Hortonworks, KXEN, Predictive modeling and advanced analytics | 8 Comments |
WibiData, derived data, and analytic schema flexibility
My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to how WibiData works, in what is essentially a follow-on to my original WibiData post last October. Among other virtues, WibiData turns out to be a poster child for my views on derived data and the corresponding schema evolution.
Interesting quotes include:
WibiData is designed to store … transactional data side-by-side with profile and other derived data attributes.
… the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well.
schemas can vary over time; you can easily add a field to a record, or delete a field. … But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don’t yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time.
schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data.
Interesting aspects of the post that don’t lend themselves as well to being excerpted include:
- How the Produce-Gather “analysis calculus” — i.e. framework — works.
- How this all ties into Apache projects (and sub-projects) such as Hadoop, HBase, and Avro.
Categories: Data models and architecture, Data warehousing, Derived data, NoSQL, WibiData | 3 Comments |
Derived data, progressive enhancement, and schema evolution
The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:
- Derived data.
- Many-step processes to produce derived data.
- Schema evolution.
- Temporary data constructs.
So let’s dive in. Read more
Categories: Data models and architecture, Data warehousing, Derived data, MarkLogic, Text | Leave a Comment |
HBase is not broken
It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.
After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details. Read more