Where the innovation is
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. 🙂 But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
- Graph analytics and data management are still confused.
3. As I suggested last year, data transformation is an important area for innovation.
- MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
- The smart data preparation crowd is deservedly getting attention.
- The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.
4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:
- Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
- Similarly, more complex clustering.
- Predictive experimentation.
- The use of business intelligence and predictive modeling to inform each other.
Beyond that:
- Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
- I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
- While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.
5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:
- “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
- Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
- … the cost of such process isolation may need to be borne.
- Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.
More specifically,
- It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
- It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
- On servers, we may in many cases be talking about lightweight virtual machines.
And to be clear:
- What I’m talking about would do little to help the authentication/authorization aspects of security, but …
- … those will never be perfect in any case (because they depend upon fallible humans) …
- … which is exactly why other forms of security will always be needed.
6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.
7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:
- Application development technology, languages, frameworks, etc.
- The integration of analytics into old-style operational apps.
- The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.
8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.
Related links
- In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
- Edit: I followed up on the last point with a post about soft robots.
Comments
11 Responses to “Where the innovation is”
Leave a Reply
A problem with the Vs is that they are almost always connected by “or” in real platforms. Unfortunately, many applications require e.g. Volume *and* Velocity simultaneously. This is not a technical limitation per se, it reflects architectural limitations of platforms due to narrower original use cases. This has made, for example, popular open source (and most closed source) platforms unsuitable for emerging sensor and machine-generated data applications, which often require extremely high continuous ingest of live data concurrent with storage and operational queries (not summarizations) on a rolling window that spans days or weeks. This is the canonical Internet of Things workload in a nutshell and in my experience the typical working set size is 100 TB, give or take an order of magnitude. In-memory is too small, and most on-disk storage behaviors are too archival-like.
High-end SQL environments have sophisticated I/O scheduling that connects real-time execution and ingest processing with storage, but this has been absent in virtually all “big data” platforms. The challenge for the Hadoop ecosystem is that addressing Volume *and* Velocity *and* Variety simultaneously requires a much tighter coupling of components and I/O management than the model they are used to. It is not just Hadoop/Spark either; I’ve heard many similar stories from people using platforms like MongoDB, Cassandra, and MemSQL, which while monolithic still rely on primitive and relatively decoupled internal models that make it difficult to seamlessly connect all of the execution paths under load.
Internet of Things workloads are exposing weaknesses in many of the existing Big Data architectures by requiring them to be more general than they have been. It is not something that can be trivially added to an architecture after the fact so this will drive more evolution in the platform market.
Hi, and thanks for your comment!
I may be a little more optimistic about the consequences of current technology efforts than you are, or perhaps we’re just emphasizing slightly different things. Solving Volume and Velocity in the same system, while far from trivial, is straightforward — you have a low-latency store to first receive the data (presumably in-memory), a fat and efficient set of (parallel) pipes to persistent storage, and good federation between the two at the time of query.
Meanwhile, Variety and Variability are almost orthogonal to those issues, with the big exception being that decades of superb engineering in analytic performance for tabular (relational or MOLAP) data stores may be marginalized when addressing new kinds of data challenges.
Under the covers the application innovators are combining multiple databases into web-scale consumer internet sites or into web-scale platforms for vertical enterprise app builders (case in point J.Andrew Rodgers above).
The engineering skill is combining a collection of the V’s into a polystore and sales skill quickly getting traction on slither of application functionality. Leaving complete ERP business solutions to run modern manufacturing are still stuck in pre-Y2K database architectures, even in SaaS.
Missing innovations seem to be in horizontal licencing to seamlessly connect systems and provide a consistent UI across a collection of sites for rich web-clients. http://component.kitchen/components
Or perhaps on-premises makes a comeback and connects to the “V’s” via a SaaS-appliance hybrid model?
Coda: Something seems missing to kick off an enterprise app boom like consumer site have over the last 20 years.
Curt, the recurring challenge customers are having with high-velocity data is that they need the on-disk data to be fully indexed for fast queries and online as well at this scale, not just stored. Big data platforms tend to be designed with the assumption that if it is on disk, it is for offline processing only. Query performance falls off a cliff the minute you touch disk but the use cases for this data tends to be operational. It is a showstopper for many IoT analytic applications.
Variety is solved but with one qualification. Data platforms, big or small, have a difficult time scaling data models built around interval data types; you can prove these can’t be scaled out with either hash or range partitioning, particularly for online data models. This notoriously includes geospatial and constraint data models, hence the conspicuous absence of geospatial big data platforms even though these are the largest data sets that exist.
These touch on two things that make SpaceCurve’s architecture unique. It is the first database platform built around a pure interval computational model, based on some computer science research I did back when I was designing back-ends for Google Earth; even traditional SQL elements are represented as hyper-rectangles under the hood. At least as important, the storage and execution engine are a novel design, borrowing some ideas from my supercomputing days, that allow us to take 10 GbE through processing, indexing, and storage per node on cheap Linux clusters concurrent with low-latency parallel queries at very large scales.
When we first went into the market, we thought our obvious differentiation would be based on our effortless scaling of interval data models and analytics. Ironically, customers value the platform at least as much for the smooth wire speed performance when disk is involved, which I would expect more platforms to do well. It seems trivial but the rough transition from memory to disk is turning out to be a killer for applications involving high velocity data.
Curt
On graph analytics and data management being confused…
We have some experiential learning in the space the last couple of years with Urika (starting with changing the vexing capitalization of the name from uRiKa). Essentially we learned that ‘graph for analytics’ use cases are distinct from ‘graphs for data management’. I think the big reason for the confusion is that graphs are an abstract data structure which could be stored and analyzed in any number of different ways, and lumping them into ‘Graphs’ as a category is somewhat meaningless.
Basically you don’t need graph databases to do graph analytics. If you’re building graph databases, then the primary use case is implicitly or explicitly related to traversal or pattern matching/isomorphism. While these are more common than people think (eg. every 3rd normal form relational data model is easily viewable as a composite of multiple graphs), people don’t think that way because there’s no major value-add beyond the relational model in doing so. Graph db’s are further confused because of the RDF/semantic graph vs. property-graph data model ‘fork’, and the attendant rise of multiple query languages (declarative SPARQL, imperative Gremlin, Cypher etc).
OTOH, if you’re doing analytics (loosely defined as characterizing the graph via computed quantities like clustering coefficient or centrality measures), you may not even need a ‘proper’ graph. Several efficient techniques exploit the duality between graphs and matrices so you can treat these as linear algebra problems. As a simple example, you can download stock price data from yahoo, compute pairwise correlations of returns (a half-matrix) and you’d have an edge-weighted graph. In such cases, the graph is a ‘lazily’ materialized analytic data structure – you only create it when you need it.
The graph analytics area has momentum because it provides a very distinct and powerful analytic toolbox from the usual set. GraphLab/Dato is a great example. The graph database side has a tougher road since it requires changing how people think about data, and has to additionally get past category overload (‘NoSQL’) and inertia (relational data model). Finally in either case, the attendant algorithmic problems don’t go away, making for lots of lurking performance monsters.
What we also know(being Cray after all) is that the underlying system engineering has a huge impact here – eg. latency tolerant multithreaded hardware or software runtimes, shared memory, partitioning, graph-parallel processing models being prime examples.
[…] This series partially fulfils an IOU left in my recent post on IT innovation. […]
Venkat’s comment. GraphLab/Dato is a great example. GL’s realization that folks don’t think in graphs they think in tabular data Dato forms. Graphs are fine with Donald Knuth. Mere mortals go into overload. Platforms that ingest tabular, optimize for velocity, output intelligence, and (I’d add) provide means to add ERP orders / process transactions will strike new gold IMHO.
[…] a need and/or desire for more sophisticated analytic tools, in predictive modeling and […]
[…] overview of innovation opportunities ended by saying there was great opportunity in devices. It also offered notes on predictive […]
In my opinion, modeling databases in the cloud is a significant innovation of the last few years. Until very recently, the market was completely dominated by desktop database modelers. You had to buy a license and your modeling application was tied up to your desktop computer. There were no alternatives. But a few years ago first online modelers were released. They were very simple and couldn’t be used in professional app development, but the idea itself was a breakthrough. No desktop licenses, no installation, no upgrades. The only requirement is to have a suitable web browser and access to the internet. Just open a browser, log in, and you can get to work. All your models are stored in the cloud, so you can access them anywhere and anytime. Moreover, a web-based tool gives you brand-new possibilities to collaborate within your team, and to work remotely. And it must be pointed out that the new commercial generation of these tools provides almost similar features and capabilities as their leading desktop counterparts. And they have supreme collaboration features that allow you to build and manage your team, and to work together on your db models. I tested both most significant online tools for database modeling – GenMyModel (www.genmymodel.com) and Vertabelo (www.vertabelo.com), and I must say that they are the future of database design. As these applications run in a browser, their interfaces need to be light. That makes them far more intuitive and user friendly in comparison with heavy and overloaded UIs of desktop modelers.
What are your opinions? Do you think that the future of database modeling tools is on the internet?
This is for modeling of what kinds of databases, that run where?