Application areas

Posts focusing on the use of database and analytic technologies in specific application domains. Related subjects include:

Any subcategory
(in Text Technologies) Specific application areas for text analytics

May 21, 2012

Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more

Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics

6 Comments

May 13, 2012

Notes on the analysis of large graphs

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more

Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray

20 Comments

May 7, 2012

Relationship analytics application notes

This post is part of a series on managing and analyzing graph data. Posts to date include:

Graph data model basics
Relationship analytics definition
Relationship analytics applications (this post)
Analysis of large graphs

In my recent post on graph data models, I cited various application categories for relationship analytics. For most applications, it’s hard to get a lot of details. Reasons include:

In adversarial domains such as national security, anti-fraud, or search engine ranking, it’s natural to keep algorithms secret.
The big exception — influencer analytics, aka social network analysis — is obscured by a major hype/reality gap (so, come to think of it, is a lot of other predictive modeling).

Even so, it’s fairly safe to say:

Much of relationship analytics is about subgraph pattern matching.
Much of relationship analytics is about identifying subgraph patterns that are predictive of certain characteristics or outcomes.
An important kind of relationship analytics challenge is to identify influential individuals.

Categories: Predictive modeling and advanced analytics, RDF and graphs, Telecommunications

6 Comments

May 4, 2012

Notes on graph data management

This post is part of a series on managing and analyzing graph data. Posts to date include:

Interest in graph data models keeps increasing. But it’s tough to discuss them with any generality, because “graph data model” encompasses so many different things. Indeed, just as all data structures can be mapped to relational ones, it is also the case that all data structures can be mapped to graphs.

Formally, a graph is a collection of (node, edge, node) triples. In the simplest case, the edge has no properties other than existence or maybe direction, and the triple can be reduced to a (node, node) pair, unordered or ordered as the case may be. It is common, however, for edges to encapsulate additional properties, the canonical examples of which are:

Weight. Usually, the intuition here is that the weight is a number indicating the strength of the connection. This is generally derived from more basic data.
Kind. The edge can encapsulate one or more descriptors indicating the kind of relationship between the nodes.

Many of the graph examples I can think of fit into four groups: Read more

Categories: Neo Technology and Neo4j, RDF and graphs, Telecommunications, Workday

10 Comments

May 1, 2012

Thinking about market segments

It is a reasonable (over)simplification to say that my business boils down to:

Advising vendors what/how to sell.
Advising users what/how to buy.

One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.

Last June I wrote:

In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.

I’d further say that it matters whether the buyer:

Is a large central IT organization.
Is the well-staffed IT organization of a particular business department.
Is a small, frazzled IT organization.
Has strong engineering or technical skills, but less in the way of IT specialists.
Is trying to skate by without much technical knowledge of any kind.

Now let’s map those considerations (and others) to some specific market segments. Read more

Categories: Data mart outsourcing, Games and virtual worlds, IBM and DB2, Investment research and trading, Microsoft and SQL*Server, Open source, Software as a Service (SaaS), Telecommunications, Web analytics

9 Comments

March 21, 2012

DataStax Enterprise 2.0

Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.

My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:

Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.

DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:

You can manage the interactive data for a web site.
You can store the logs for that website.
You can analyze all of the above in Hadoop.

Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics

5 Comments

February 27, 2012

Translucent modeling, and the future of internet marketing

There’s a growing consensus that consumers require limits on the predictive modeling that is done about them. That’s a theme of the Obama Administration’s recent work on consumer data privacy; it’s central to other countries’ data retention regulations; and it’s specifically borne out by the recent Target-pursues-pregnant-women example. Whatever happens legally, I believe this also calls for a technical response, namely:

Consumers should be shown key factual and psychographic aspects of how they are modeled, and be given the chance to insist that marketers disregard any or all of those aspects.

I further believe that the resulting technology should be extended so that

information holders can collaborate by exchanging estimates for such key factors, rather than exchanging the underlying data itself.

To some extent this happens today, for example with attribution/de-anonymization or with credit scores; but I think it should be taken to another level of granularity.

My name for all this is translucent modeling, rather than “transparent”, the idea being that key points must be visible, but the finer details can be safely obscured.

Examples of dialog I think marketers should have with consumers include: Read more

Categories: Predictive modeling and advanced analytics, Surveillance and privacy, Web analytics

SAP HANA today

SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.

The HANA section of SAP’s website is a confusing and sometimes inaccurate mess. But an IBM whitepaper on SAP HANA gives some helpful background.

SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:

Mainly, HANA is an in-memory columnar DBMS, based on SAP’s confusingly-renamed BI Accelerator/BW Accelerator. Analytics and most OLTP (OnLine Transaction Processing) go against the columnar part of HANA.
The HANA DBMS also has an in-memory row storage option, used to store metadata, small tables, and so on.
SAP HANA talks both SQL and MDX.
The HANA DBMS is shared-nothing across blades or rack servers. I imagine that within an individual blade it’s shared everything. The usual-suspect data distribution or partitioning strategies are available — hash, range, round-robin.
SAP HANA has what sounds like a natural disk-based persistence strategy — logs, snapshots, and so on. SAP says that this is synchronous enough to give ACID compliance. For some hardware partners, those “disks” are actually Fusion I/O cards.
HANA is fault-tolerant “across servers”.
Text support is “coming soon”, which makes sense, given that BI Accelerator was based on the TREX search engine in the first place. Inxight is also in the HANA text mix.
You can put data into SAP HANA in a variety of obvious ways:
- Writing it directly.
- Trigger-based replication (perhaps from the DBMS that runs your SAP apps).
- Log-based replication (based on Sybase Replication Server).
- SAP Business Objects’ ETL tool.

SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.

Other SAP HANA components include: Read more

Categories: Analytic technologies, Business Objects, Data warehouse appliances, Data warehousing, In-memory DBMS, Investment research and trading, OLTP, Predictive modeling and advanced analytics, SAP AG, Software as a Service (SaaS), Text

12 Comments

February 11, 2012

Applications of an analytic kind

The most straightforward approach to the applications business is:

Take general-purpose technology and think through how to apply it to a specific application domain.
Produce packaged application software accordingly.

However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:

Analytic applications of that kind are rarely complete.
Incomplete applications rarely sell well.

I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.

Reasons that analytic applications are commonly less complete than the transactional kind include: Read more

Categories: Business intelligence, Business Objects, Data mart outsourcing, Investment research and trading, Log analysis, Metamarkets and Druid, Oracle, SAP AG, SAS Institute, Web analytics, WibiData

17 Comments

February 6, 2012

Sumo Logic and UIs for text-oriented data

I talked with the Sumo Logic folks for an hour Thursday. Highlights included:

Sumo Logic does SaaS (Software as a Service) log management.
Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
Sumo Logic’s main differentiation is automated classification of events.
There’s some kind of streaming engine in the mix, to update counters and drive alerts.
Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
Sumo Logic recently raised a bunch of venture capital.
Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.

What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more

Categories: Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Software as a Service (SaaS), Text

7 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Application areas

Cool analytic stories

Notes on the analysis of large graphs

Relationship analytics application notes

Notes on graph data management

Thinking about market segments

DataStax Enterprise 2.0

Translucent modeling, and the future of internet marketing

SAP HANA today

Applications of an analytic kind

Sumo Logic and UIs for text-oriented data

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin