What needs to be updated anyway?
Shayne Nelson is posting some pretty wild ideas on data architecture and redundancy. In the process of doing so, he’s reopening an old discussion topic:
Why would data ever need to be erased?
and the natural follow-on
If it doesn’t need to be erased, what exactly do we have to update?
Here are some quick cuts at answering the second question:
- “Primary” data usually doesn’t really need to be updated, exactly. But it does need to be stored in such a way that it can immediately be found again and correctly identified as the most recent information.
- Analytic data usually doesn’t need to be updated with full transactional integrity; slight, temporary errors do little harm.
- “Derived” data such as bank balances (derived from deposits and withdrawals) and inventory levels (derived from purchases and sales) commonly needs to be updated with full industrial-strength protections.
- Certain kinds of primary transactions, such as travel reservations, need the same treatment as “derived” data. When the item sold is unique, the primary/derived distinction largely goes away.
- Notwithstanding the foregoing, it must be possible to update anything for error-correction purposes (something Nelson seems to have glossed over to date).
Respondents to Nelson’s blog generally argue that it’s better to store data once and have redundant subcopies of it in the form of indexes. I haven’t yet seen any holes in those arguments. Still, it’s a discussion worth looking at and noodling over.
Comments
2 Responses to “What needs to be updated anyway?”
Leave a Reply
Hi Curt – I found this interesting:
Perhaps true (difficult to say with precision), but the terms “usually,” “slight,” and “little” give me pause. It would seem that constraints on derived analytical data would follow from the needs which drove the data to be ETLed; wouldn’t you want to know whether your ETL process actually produced the structures you thoguht it would produce?
Eric,
If we leave out what is called “operational BI,” and also leave out planning, it is overwhelmingly the case that analytic data is used to identify ratios, trends, statistical correlations, and the like. Small errors simply aren’t important to those uses, because they don’t change the outcome in any detectable way.
In most non-planning analytics, operational or otherwise, a lot of historical data is being examined. Small latency is rarely an issue.
Planning applications are almost by definition non-urgent. A small amount of latency is not a big problem.
There are a few application areas that we call “analytic” which truly require the same near-instantaneous data integrity that an order processing system would. For example, fraud prevention comes to mind, in a variety of industries. (Interestingly, that really is an aspect of order processing.)
But most analytics are done today with a latency of at least a day, and often a week or more. I rarely find analytic applications that truly need sub-day latency. And it’s very rare to find ones where, to use the most common figure cited to me for “real-time” analytics, 15 minute latency would be a problem.
Hell, most individuals who dabble in the stock market do so off of price data with 20 minute delays …