Yarcdata and Cray

Yarcdata, a division of Cray specializing in graph analytics.

July 12, 2012

Approximate query results

In theory:

A database query is a predicate.
A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.

And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.

Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.

First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.

Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:

You do a SPARQL query.
You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.

Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.

Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.

As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.

Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:

Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
Metamarkets, which does fast cardinality estimates via HyperLogLog.
Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.

The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:

(Always) Well, there’s a lot of custom coding.
(Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.

Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.

Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray

3 Comments

July 2, 2012

Introduction to Yarcdata

Cray’s strategy these days seems to be:

Move forward with the classic supercomputer business.
Diversify into related areas.

At the moment, the main diversifications are:

Boxes that are like supercomputers, but at a lower price point.
Storage.
“(Big) data”.

The last of the three is what Cray subsidiary Yarcdata is all about. Read more

Categories: Data models and architecture, Health care, In-memory DBMS, Investment research and trading, Market share and customer counts, Parallelization, Petabyte-scale data management, RDF and graphs, Yarcdata and Cray

1 Comment

July 2, 2012

Catching up with Cray

Cray is a legendary name in supercomputing hardware. Cray CTO Bill Blake (Netezza’s early-rise VP Development) seem to be there in part because of Cray’s name and history. I’m now consulting to Cray largely because of Bill Blake, specifically to Cray subsidiary Yarcdata. Along the way, I’ve picked up enough about Cray in general — largely from Bill and from Cray president Pete Ungaro — to perhaps be worth splitting out as a separate post.

Cray business highlights include:

After a meandering and financially disappointing journey, Cray is again a stand-alone public company.
Cray is a computer systems company.
Cray makes a large fraction of its revenue from selling and supporting a small number of supercomputers, largely to scientific, technical, and government customers.
Even so, Cray sells systems at a broad range of price points. Storage products are in the mix as well.

I haven’t sorted through all the details in Cray’s SEC filings, but huge government contracts play a big role, as do the associated revenue recognition delays.

At the highest level, Cray’s technical story looks like: Read more

Categories: Intel, Market share and customer counts, Parallelization, Yarcdata and Cray

1 Comment

May 13, 2012

Notes on the analysis of large graphs

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more

Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray

20 Comments

May 7, 2012

Terminology: Relationship analytics

This post is part of a series on managing and analyzing graph data. Posts to date include:

Graph data model basics
Relationship analytics definition (this post)
Relationship analytics applications
Analysis of large graphs

In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this “relational analytics”, which I thought was a terrible name for something that they were trying to claim should NOT be done in a relational DBMS. On the spot, I coined relationship analytics as an alternative. A business relationship ensued, which included a short white paper. Cogito didn’t do so well, however, and for a while the term “relationship analytics” faltered too. But recently it’s made a bit of a comeback, having been adopted by Objectivity, Qlik Tech, Yarcdata and others.

“Relationship analytics” is not a perfect name, both because it’s longish and because it might over-connote a social-network focus. But then, no other term would be perfect either. So we might as well stick with it.

In that case, “relationship analytics” could use an actual definition, preferably one a little heftier than just:

Analytics on graphs.

Categories: Cogito and 7 Degrees, Objectivity and Infinite Graph, QlikTech and QlikView, RDF and graphs, Yarcdata and Cray

7 Comments

March 31, 2012

Our clients, and where they are located

From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Excluded from this round of disclosure is one vendor I have never written about.
Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.

For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more

Categories: About this blog, Akiban, ClearStory Data, Couchbase, DataStax, dbShards and CodeFutures, Hadapt, Hortonworks, HP and Neoview, IBM and DB2, Infobright, KXEN, MarkLogic, MongoDB, Netezza, PivotLink, SAND Technology, Schooner Information Technology, solidDB, StreamBase, Syncsort, Tableau Software, Teradata, Vertica Systems, WibiData, Yarcdata and Cray

3 Comments

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in