February 1, 2010
Open issues in database and analytic technology
The last part of my New England Database Summit talk was on open issues in database and analytic technology. This was closely intertwined with the previous section, and also relied on a lot that I’ve posted here. So I’ll just put up a few notes on that part, with lots of linkage to prior discussion of the same points.
- The most important issue in database and analytic technology, in my opinion, isn’t technological at all – rather, it’s the legal and political steps needed to preserve liberty in the face of advancing, intrusive technology.
- Another important issue for society – and this one does involve a lot of technology – is scientific number crunching. In particular, database technology for scientific computing needs to be developed much further. I’ll have more to say on all this soon.
- More generally, technology needs to keep advancing for parallel analytics. Fortunately, it is. Watch this space over the next few weeks.
- Oracle has said, in effect, that its most important technological challenge of the decade is getting solid-state memory right. I agree.
- Data volumes will keep going up, up, up. Technology needs to keep evolving accordingly. Much of what I write is on that subject.
- Data needs to be processed and analyzed at very different latencies. And there’s much further to go in integrating disparate latencies.
- Analytic database management in the cloud hasn’t been solved yet, especially for Big Data. Among the reasons are the difficulty of moving data into the cloud (unless it originated there), the slowness of moving it from node to node in shared-nothing architectures (which reduces the elasticity benefit), and above all the long and unpredictable latencies of interprocessor communication while queries are running (a key subject of discussion at the Boston Big Data Summit).
- Better business intelligence user interfaces are increasingly available. I’m thinking particularly of approaches with buzzwords like visualization/interactive exploration or faceted. But they aren’t well-integrated into the overall analytic stack, as big BI vendors are trailing the smaller ones in this regards. (Part of the problem relates to my previous point.)
- Application development over text search isn’t in the same league as application development over relational DBMS. The choices are mainly XML (e.g., MarkLogic), SQL for text integrated into RDBMS (limited by the weakness of those integrations), and something like Attivio’s Java SDK. There’s a major conceptual barrier in building those apps, namely the unpredictability of query results. Still, it should be possible to do better.
- Similarly, text analytics and conventional analytics exist well side by side. They can even be in the same database and/or dashboard, although in practice that is limited by the strong SaaS focus of text mining vendors and users. But analytic integration of them is really hard. Linguistic imprecision is, in my opinion, only the #2 reason for this difficulty. The #1 reason is that trends detected by text analytics are much less precise than trends on tabular data – e.g., a 50% increase in a certain kind of complaint may be no more significant than a 5% change in a revenue variable.
- I’m increasingly persuaded that graph analytics can be handled without a graph-centric data model. But right now, it isn’t being handled well at all. Lots more needs to be done – although when it is, it will just exacerbate the privacy/liberty dangers that so concern me.
Other posts based on my January, 2010 New England Database Summit keynote address
- Data-based snooping — a huge threat to liberty that we’re all helping make worse
- Flash, other solid-state memory, and disk
- Interesting trends in database and analytic technology
Categories: Analytic technologies, Business intelligence, Cloud computing, Data warehousing, Presentations, RDF and graphs, Software as a Service (SaaS), Solid-state memory, Theory and architecture
Subscribe to our complete feed!
Comments
4 Responses to “Open issues in database and analytic technology”
Leave a Reply
You think it is possible to make the RDBMS more suitable for relationships analytics by adding a new kind of index type?
I agree that I/O is the big bottleneck for most analytic. Whether that’s in the cloud or a big Oracle database with solid-state disks. Getting at terabytes or petabytes of data isn’t easy and as core counts go up processing that data gets harder if you can’t feed the cores fast enough.
RC,
Please see some of the other posts in the category here for graph datatypes.
An index for graphs would probably be a big materialized view that concatenates 2 or 3 or whatever edges at once. Better than nothing, but exponential in the number of hops.
Whether real-life graphs let you store them in clever ways to get significantly better performance than that is an open question. Cogito used to claim it could, but didn’t do too much in the way of delivering.
[…] is a potentially powerful combination, if they can effectively address the point I just made about integrating disparate latencies. That said, I’m not expecting a lot, because the CEP industry always disappoints […]