Evolving definitions and technology categories for 2011
It seems my prediction of a limited blogging schedule in December came emphatically true. I shall re-start with a collection of quick thoughts, clearing the decks for more detailed posts to follow. If you’d like to contribute thoughts on these subjects, now might be a really good time.
1. Not many terms I coin gets marketing traction, but machine-generated data has grown some legs. Clients (Infobright, Cloudera) and non-clients alike have adopted it. I need to follow up with a more official description/definition of the concept. The Wikipedia article on same doesn’t get the job done yet. (Edit: Here’s my take on defining machine-generated data. Be sure to read through to Daniel Abadi’s response.)
2. Merv Adrian is going to Gartner Group. Expect great improvement in Gartner’s DBMS coverage, in areas beyond the straightforward “This is what users say they are doing” Gartner already excels at. That said, Merv is probably not starting at Gartner soon enough to help make the 2010 analytic DBMS Magic Quadrant any better than the Gartner 2009 data warehouse database management system magic quadrant, the Gartner 2008 data warehouse database management system magic quadrant, and so on.
In particular, Merv has a good understanding of trends and technology on analytic DBMS and related markets. Judging by his Twitter stream, James Kobielus at Forrester if anything overrates the shift to general “analytic platforms.” And I of course am expected to help define the “analytic platform”/”advanced analytics”/whatever category. Taking all those analyst efforts together, it’s reasonable to expect a lot more market awareness — and also market confusion — around these areas.
3. All that plugs into a larger project I was working on before my family issues came crashing in. The enterprise data warehouse is a myth, and that’s just the first reason that the old EDW vs. data mart bifurcation is grossly inadequate for understanding analytic data management choices. So I’m working on some ideas to categorize types of data warehouse/mart/whatever according to what kind of data you have and how you use that data. Multiple industry players (OK, vendors) have offered interesting and useful feedback in this process, although I’m still waiting for Teradata and IBM. (Edit: My bad. Teradata actually had sent a helpful response some time ago.)
In connection with that effort, the last outline I did back in October of analytic data use styles read:
- Traditional BI
- Reporting, dashboards, & light-weight ad-hoc query
- (Even if you make this more into data exploration, you’re probably not stressing the underlying DBMS much more than traditional BI does)
- (If integrated into operational apps, your DBMS choice for this may be constrained by your choice of operational apps)
- Near-real-time BI
- E.g., dashboards w/ constant or 1-minute refresh
- (Actually, this isn’t a great fit for most analytic DBMS yet)
- (Also, it’s not a big market yet, except in specialized niches such as trading or network control)
- Budgeting & consolidation
- (MOLAP is still strong here)
- (I took out the word “planning” because it has several meanings)
- Investigative analytics*
- Can be but doesn’t have to be long-running
- Example technologies include:
- Heavy ad-hoc query
- Data mining/machine learning/predictive analytics modeling
- Simulation
- Other advanced analytics
- (Advanced) operational analytics
- Inputs to operational apps
- Technologically similar to investigative analytics
- Data mining/machine learning/predictive analytics scoring
- Simulation
- Other advanced analytics
- Example applications include:
- Customer classification or scoring
- Wholesale telecom pricing
- Basel 3 risk analysis
- Pre-processing, staging, and ETL
- Archive & compliance
- (Test/dev)
The data warehouse/mart categories weren’t in exact one-to-one correlation to those use styles, but the connection was of course pretty close.
*I’ve really struggled with terminology in the area of data exploration (over-used already)/discovery analytics (sounds weird)/research analytics (caused confusion when I tried it). Investigative analytics is my latest try.
4. And finally — like most people, I find the terms unstructured or semi-structured data to be misleading, for at least two reasons:
- When the data is human-generated, what’s really happening is usually that the structure is just in a different place — structured databases generally tend to hold unstructured data, and vice-versa.
- In the case of machine-generated data, you really can start out with unstructured sets of individually unstructured logs. So what do you do then? You derive data, which has some kind of structure, and do most of your operations on that.
So I’ve been playing for a couple of years with the thought of introducing the term polystructured data. This is not a finished concept, because there are at least three different things I could mean by it:
- “Polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.” That’s a direct quote from a draft of a never-published paper. The paper, conceived before the days of NoSQL, was meant to be very XML-centric.
- “Polystructured data is data whose structure is apt to be interpreted in different ways at different times” — e.g., data that will variously get referenced by free text and structured searches. The example I gave illustrates part of the problem with that version, as increasingly many software vendors think it’s a dandy idea to do free-text searches across various columns of relational tables.
- “Polystructured data is data that gets restructured over time.” That’s the derived data point.
It may take a while to find, but I think there’s a pony in there somewhere.
Edit: Here’s the definition of poly-structured database I eventually came up with.
Comments
6 Responses to “Evolving definitions and technology categories for 2011”
Leave a Reply
I would like to suggest a stronger differentiation between the process, workload, and requirements to build a model and the process, workload, and requirements to score data using a model.
A model can be built manually using BI… i.e. my model to select the “top” customers is based on a sliding scale I created that weighs current recurring revenue and tenure.
Or a model can be built using very sophisticated algorithms.
In either case scoring then is the application of a more-or-less complex SQL statement to rate/score the base.
Vendors want to claim advanced analytic capabilities if they score in-database. The differentiation I suggest would help clarify questions around who can actually execute complex algorithms and who can only score.
[…] Curt Monash of DBMS2 […]
Hi Rob,
I have several posts up on that point, and definitely plan to have more in the future.
See e.g. http://www.dbms2.com/2010/05/15/further-clarifying-in-database-mpp-sas/
Curt,
The terms structured and unstructured and the silo’d thinking around them are amongst the biggest culprits for constraining the value derived from information today.
The purest definition for structured I have found is one where you know and can easily manipulate the schema. If you don’t meet those conditions then it is unstructured and the question is – how much effort does it take to figure out the schema? A lot of what we call “unstructured” can have an easy to discover schema…and some it…nearly impossible.
Worth noting is that if someone has 3 SAP instances with 3 different schemas and they want to look at data across those…its unstructured because the consolidated schema isn’t known (until a new one is created). Not that different than documents with a relatively easily discoverable schema.
In any case….this is a great area for you to create frameworks and add value to the industry.
Regards,
Steve
[…] Recently and somewhat belatedly, I added a somewhat obvious point — if we don’t keep all or even most of our machine-generated data, then what we keep is likely to be in some way massaged, extracted, or derived. The purpose of this post is to address a second oversight — giving a hopefully clear definition of what I actually mean by “machine-generated data.” […]
[…] Merv Adrian is now at Gartner. […]