Analytic technologies

Discussion of technologies related to information query and analysis. Related subjects include:

Business intelligence
Data warehousing
(in Text Technologies) Text mining
(in The Monash Report) Data mining
(in The Monash Report) General issues in analytic technology

April 25, 2013

Analytic application themes

I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.

1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.

Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.

Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:

There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.

2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:

Customer interaction
Network and sensor monitoring
Game and mobile application back-ends

Also arising fairly frequently are:

Algorithmic trading
Anti-fraud
Risk measurement
Law enforcement/national security
Healthcare
Stakeholder-facing analytics

I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.

Categories: Aerospike, Application areas, Business intelligence, Cloudera, Games and virtual worlds, GIS and geospatial, Health care, Investment research and trading, Log analysis, MemSQL, Platfora, Predictive modeling and advanced analytics, Telecommunications, Web analytics, WibiData

2 Comments

April 15, 2013

Notes on Teradata systems

Teradata is announcing its new high-end systems, the Teradata 6700 series. Notes on that include:

Teradata tends to get 35-55% (roughly speaking) annual performance improvements, as measured by its internal blended measure Tperf. A big part of this is exploiting new-generation Intel processors.
This year the figure is around 40%.
The 6700 is based on Intel’s Sandy Bridge.
Teradata previously told me that Ivy Bridge — the next one after Sandy Bridge — could offer a performance “discontinuity”. So, while this is just a guess, I expect that next year’s Teradata performance improvement will beat this year’s.
Teradata has now largely switched over to InfiniBand.

Teradata is also talking about data integration and best-of-breed systems, with buzzwords such as:

Teradata Unified Data Architecture.
Fabric-based computing, even though this isn’t really about storage.
Teradata SQL-H.

Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, Pricing, SAS Institute, Teradata

3 Comments

April 15, 2013

Teradata SQL-H

As vendors so often do, Teradata has caused itself some naming confusion. SQL-H was introduced as a facility of Teradata Aster, to complement SQL-MR.* But while SQL-MR is in essence a set of SQL extensions, SQL-H is not. Rather, SQL-H is a transparency interface that makes Hadoop data responsive to the same code that would work on Teradata Aster …

*Speaking of confusion — Teradata Aster seems to use the spellings SQL/MR and SQL-MR interchangeably.

… except that now there’s also a SQL-H for regular Teradata systems as well. While it has the same general features and benefits as SQL-H for Teradata Aster, the details are different, since the underlying systems are.

I hope that’s clear. 🙂

Categories: Data integration and middleware, Data warehousing, Emulation, transparency, portability, Hadoop, SQL/Hadoop integration, Teradata

2 Comments

April 1, 2013

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

Categories: Benchmarks and POCs, Cassandra, Clustering, Couchbase, Data models and architecture, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, In-memory DBMS, Investment research and trading, Market share and customer counts, MarkLogic, Memory-centric data management, MongoDB, NewSQL, NoSQL, Tokutek and TokuDB

8 Comments

March 26, 2013

Platfora at the time of first GA

Well-resourced Silicon Valley start-ups typically announce their existence multiple times. Company formation, angel funding, Series A funding, Series B funding, company launch, product beta, and product general availability may not be 7 different “news events”, but they’re apt to be at least 3-4. Platfora, no exception to this rule, is hitting general availability today, and in connection with that I learned a bit more about what they are up to.

In simplest terms, Platfora offers exploratory business intelligence against Hadoop-based data. As per last weekend’s post about exploratory BI, a key requirement is speed; and so far as I can tell, any technological innovation Platfora offers relates to the need for speed. Specifically, I drilled into Platfora’s performance architecture on the query processing side (and associated data movement); Platfora also brags of rendering 100s of 1000s of “marks” quickly in HTML5 visualizations, but I haven’t a clue as to whether that’s much of an accomplishment in itself.

Platfora’s marketing suggests it obviates the need for a data warehouse at all; for most enterprises, of course, that is a great exaggeration. But another dubious aspect of Platfora marketing actually serves to understate the product’s merits — Platfora claims to have an “in-memory” product, when what’s really the case is that Platfora’s memory-centric technology uses both RAM and disk to manage larger data marts than could reasonably be fit into RAM alone. Expanding on what I wrote about Platfora when it de-stealthed: Read more

Categories: Business intelligence, Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Platfora, Workload management

13 Comments

March 24, 2013

Essential features of exploration/discovery BI

If I had my way, the business intelligence part of investigative analytics — i.e. , the class of business intelligence tools exemplified by QlikView and Tableau — would continue to be called “data exploration”. Exploration what’s actually going on, and it also carries connotations of the “fun” that users report having with the products. By way of contrast, I don’t know what “data discovery” means; the problem these tools solve is that the data has been insufficiently explored, not that it hasn’t been discovered at all. Still “data discovery” seems to be the term that’s winning.

Confusingly, the Teradata Aster library of functions is now called “Discovery” as well, although thankfully without the “data” modifier. Further marketing uses of the term “discovery” will surely follow.

Enough terminology. What sets exploration/discovery business intelligence tools apart? I think these products have two essential kinds of feature:

Query modification.
Query result revisualization.*

Categories: Business intelligence, Endeca, Memory-centric data management, QlikTech and QlikView, Tableau Software

8 Comments

February 22, 2013

Should you offer “complete” analytic applications?

WibiData is essentially on the trajectory:

Started with platform-ish technology.
Selling analytic application subsystems, focused for now on personalization.
Hopeful of selling complete analytic applications in the future.

The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post. Read more

Categories: Hadapt, HBase, Market share and customer counts, PivotLink, Predictive modeling and advanced analytics, WibiData

5 Comments

February 21, 2013

One database to rule them all?

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times. 🙂

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more

Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Database diversity, GenieDB, GIS and geospatial, Hadoop, IBM and DB2, MarkLogic, Michael Stonebraker, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, PostgreSQL, SAP AG, Solid-state memory, Storage, Structured documents, Text, Theory and architecture, Tokutek and TokuDB

20 Comments

February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

“We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
“Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
“Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start: Read more

Categories: Business intelligence, Data warehousing, Derived data, EAI, EII, ETL, ELT, ETLT, Hadoop, In-memory DBMS, Investment research and trading, Memory-centric data management, Microsoft and SQL*Server, MOLAP, NoSQL, Predictive modeling and advanced analytics, salesforce.com, Splunk, Streaming and complex event processing (CEP), Text

6 Comments

February 6, 2013

Key questions when selecting an analytic RDBMS

I recently complained that the Gartner Magic Quadrant for Data Warehouse DBMS conflates many use cases into one set of rankings. So perhaps now would be a good time to offer some thoughts on how to tell use cases apart. Assuming you know that you really want to manage your analytic database with a relational DBMS, the first questions you ask yourself could be:

How big is your database? How big is your budget?
How do you feel about appliances?
How do you feel about the cloud?
What are the size and shape of your workload?
How fresh does the data need to be?

Let’s drill down. Read more

Categories: Buying processes, Cloud computing, Clustering, Data integration and middleware, Data warehouse appliances, Data warehousing, Database compression, Exadata, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, Netezza, Oracle, Pricing, Teradata, Workload management

3 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in