Chris Bird’s blog is brilliant, and update-in-place is increasingly passe’
I wouldn’t say every post in Chris Bird’s occasionally-updated blog is brilliant. I wouldn’t even say every post is readable. But I’d still recommend his blog to just about anybody who reads here as, at a minimum, a consciousness-raiser.
One of the two posts inspiring me to mention this is a high-level one on “technical debt“, reminding us why things don’t always get done right the first time, and further reminding us that circling back to fix them sooner rather than later is usually wise. The other connects two observations that individually have great merit (at least if you don’t take them to extremes):
- Update-in-place is passe’
- So is elaborate up-front database design
Specific points of interest here include:
- Most data never gets changed after being written. Update-in-place doesn’t save all that much in storage hardware.
- Update-in-place interferes with a lot of modern optimizations in analytic DBMS design.
- Knowing what values data had in the past is interesting in and of itself.
- So, potentially, is knowing what “dirty” data end-users — especially customers and prospects — decided to enter.
- The “right” amount of data validation is application-dependent. For example, if data validation involves torturing your customers, maybe it’s not such a good idea. (Great observation by Chris.)
- If you have the old data as well as the new, the harm of having “bad” updates is lessened. (Central connecting observation by Chris.)
- People enter data inconsistently. MDM (Master Data Management) and data cleansing tools fix much (admittedly not all) of the harm. Computers are cheaper than people. You do the math.
- Data is increasingly being managed in non-relational and/or non-persistent ways. Get used to it.
- As the NoSQL guys point out, some of today’s most demanding applications have extremely simple schemas.
Comments
7 Responses to “Chris Bird’s blog is brilliant, and update-in-place is increasingly passe’”
Leave a Reply
Part of me says, “oh, this is just MVCC.” Which
But that usually is treated as being an invisible layering, just like the updates in a Log Structured Filesystem. ()
It’s clear that in the later parts of your comments, you’re describing the notion that the sequence of update history is actually intended to be visible.
It seems worthy of some thought, for sure.
Odd… Bits of my comment seem to have gotten lost, notably URLs for MVCC & LSF.
Perhaps your MDM cleansing is a bit overexuberant? 🙂
Yes, I think time-travel is a useful feature. And I suspect Chris feels more emphatically about that than I do — but then he has an outstanding track record of catching on early to technical trends.
A case for update-in-place: http://blog.mongodb.org/post/248614779/fast-updates-with-mongodb-update-in-place and http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
RC,
I’m not sure in what cases I’d endorse the application design being emphasized there. You keep some of the incoming data. You throw away the rest. What you keep you send to disk in the forms of counters that are constantly changed. The point of the exercise is that you want access in real time.
Huh? Why not keep the small amount you want in real time in memory, and send a complete record of everything to disk however fast you can get it there?
Of course this isn’t quite as black and white as I made it out to be on the original blog posting. There is always a careful balance between what you are throwing away and what performance you can afford (or think you need).
The idea of the original post is to look with healthy skepticism every time you are tempted to use update logic. There are good cases for doing updates in place – but I don’t think it should be the default case. I will shortly be writing some responses to observations made against the original post.
Some excellent points. I am more practitioner than theorist and tend to take a very pragmatic approach. As a result I have tended to move away from these two items as a matter of necessity in our ever changing world. However, can you elaborate on the point about data is increasingly being managed with non-relational and non-persistent methods. As I have come to understand that data by it’s existence is relational. Meaning that all data has inherent within it relationships and is only useful in relation to other data and concepts that it either supports, is neutral to (no relationship or passive relationship), or disproves. Maybe I misunderstood and you simply meant traditional normalized relational models? Thanks for the great posts (I have subscribed to the RSS feeds for both blogs).