May 13, 2012

Notes on the analysis of large graphs

This post is part of a series on managing and analyzing graph data. Posts to date include:

My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.

How big can a graph be? That of course depends on:

The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.

*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.

The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:

An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.

Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.

Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be:

Store a table with one row per edge.
Do an (n-1)-way join, where n is the number of edges in the path or subgraph.

In many cases the cardinality of intermediate result sets would be high, and you’d basically be doing a series of full table scans. Those could take a while.

There are various approaches to dealing with this challenge. For example:

Graph analysis has been around long enough that much of it has surely been done relationally.
I wrote about some specific relational strategies for graph analysis five years ago.
A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data’s).
Objectivity Infinite Graph and Google Pregel emphasize pre-fetching (or pre-shipping) edges that might soon be needed.
Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I’m not clear as to which data structure(s) Jena uses.

When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management — how successfully can graphs be partitioned? Opinions vary widely, with the correct answers in each case surely depending on:

The topology of the graph.
The size of the graph.
The length of the paths that need to be examined.

But in the interest of getting this posted tonight, I’ll leave further discussion of graph partitioning to another time.

Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray

Subscribe to our complete feed!

Comments

20 Responses to “Notes on the analysis of large graphs”

Terminology: Relationship analytics | DBMS 2 : DataBase Management System Services on May 13th, 2012 11:36 pm

[…] Analysis of large graphs […]
PaulZH on May 14th, 2012 1:56 am

Jena is using RDF triples.
Curt Monash on May 14th, 2012 3:40 am

Thanks, Paul.

But organized/accessed how? B-trees? Something else?
Seth Grimes on May 14th, 2012 7:24 am

Curt, what about direction as an edge attribute, along with weight, timestamp, etc.? Not all edge-connections are bi-directional. Inter-node relationships may be asymmetric as well as time-varying. I’d think capturing directionality would be important for any but a basic or domain-specific system.

Seth
Curt Monash on May 14th, 2012 12:16 pm

Seth,

Of course direction is crucial, and I had it in there in a previous post. But it’s not an important factor affecting the size of a graph.

Typically, a graph is either directed or undirected/bidirectional. In the first case, direction is built into how the information for each edge is stored. In the second, it doesn’t have to be encoded explicitly either.

Even if direction DID have to be encoded explicitly, it would take up at most 2 bits.
unholyguy on May 14th, 2012 12:49 pm

Size of node: Can be in the megs or gigs. Typically there are a lot of attribute values stored at this level. Name, thumbnail pictures, etc etc. One way to think about it is, as the webapp is etching a node, saying “give me UnholyGuy” you want all the basic, commonly used info about UnholyGuy stored in that particulate node
unholyguy on May 14th, 2012 1:16 pm

Another way to think of this is the key architectural / reason for existence of a graph database and graph architecture is to be able to read/write a node and it’s associated hot information in constant (C) time, not Log(N) time.

If you stick to vanilla relational, and shard/index and are perfectly hashed, you are still fetching in Log(N/X) where X is the number of shards. Hence the interest in non relational backends. These backends are generally key/value stores..
Curt Monash on May 14th, 2012 2:42 pm

Unholy guy,

We’re mixing together a couple of things here. You can store anything you want at a node, and that’s great. But if you want to traverse the graph efficiently, which is what I’m talking about, it seems compression/tokenization/pointers would be in order.

As for the constant vs. log(N) time meme — that may be OK for pinpoint lookups, but I don’t see how it’s very relevant to analytics.

Your “the key reason …” is for me just one benefit in one set of use cases, and indeed not the use cases I’m most focused on at the moment.
unholyguy on May 14th, 2012 4:01 pm

It’s important to understand that graph algorithms can be written and run perfectly well in an RDBMS using a perfectly normal data model without any special wizzbangness. I wrote one myself my first year of undergraduate CS ,it was called the “bill of materials” problem. Any parent/child table is a graph. It’s old hat except when you start talking about large graphs and doing odd things with them. Anything in your earlier posts can be accomplished in PL/SQL with no issues at all. Assuming the volume is small enough.

So any question around “graph analysis” has to become two questions, one around “what kinds of questions? Another around algorithms and efficient implementation of those, oddball data structures and high volume, say hundreds of millions of customers +.

The reason why we are suddenly seeing an uptake in “graph analysis” is not some magical analytical ah-ha it is because we have become technically capable of storing and accessing LARGE graphs of customers and interactions. It is not so much the analytics that was missing before but the platform. The algoithms that we have always been aware of are now becoming runnable.

Understanding the O(C) architecture of a graph database or key/value store(which is different from a graph algorithm) helps understand what analytical questions actually benefit from being run on a something weird as opposed to a parent/child table which people have been analyzing pretty much forever.

Most of these I have seen are algorithms that, as you rightly point out, rely heavily on analyzing the structure of large graphs and subgraphs holistically, they IN THEORY benefit a lot from Bulk Synchronous Parallel friendly platforms which, again in theory, is expensive and hard to implement inside a RDBMS.

However, IN PRACTICE except for maybe Pregel most people are better off just using a RDBMS to analyze even large graphs, since the specilzied graph frameworks are so immature the theoretical performance benefit generally isn’t realized.

Also, as soon as you get the “structure of the graph” problem somewhat solved you IMMEDIATELY get driven into questions involving attributes of the entities in question. For the churn example, you will start seeing questions about LTV, temporal relationships, etc etc.

I’m not sure “traversing the graph” has a lot of analytical value without performing analysis on the attributes of nodes.

In general, you are going to see heavy nodes not light ones. While you can certainly store a pointer to a richer data structure elsewhere, it does not really change anything, it’s all about “what do you have to retrieve in each super step / join to make it to the next super step” and that is generally a richer, heavier set of data then just the pointers.

It’s an important distinction because the heavier the data the more you benefit form locality which means the more sense specialized graphdb’s start to make….

just my two cents
Curt Monash on May 14th, 2012 5:49 pm

Interesting take. Thanks.

I tend to agree that the “This is too hard to do in SQL; it’s much easier to do in a graph language” story is a bit overblown. More precisely, the layer one would have to put over SQL to negate much of the story is pretty thin.

On the other hand — why use SQL for something that it isn’t well suited for? Especially if it’s something that for performance reasons you want to do in-memory, negating many of the benefits of your favorite SQL DBMS?

Regarding the heaviness of the nodes — again, that depends on whether their size really interferes with performance. If it does, one looks to workarounds. Also, as you know, I’m a big fan of the idea that one often derives data and stores it for further analysis. So if the graph is derived rather than raw, I’m fine with that.
unholyguy on May 14th, 2012 6:37 pm

anyone who thinks it is harder to write an sql cursor then say a giraph job is welcome to try (and I love giraph mind you) (-:

“why use SQL for something that it isn’t well suited for? ”

In theory you should not. In practice it is because the graph platforms are immature, slow and unstable compared to RDBMS.

Also the log(n) properties of SQL are not inherent in sql-as-a-language, they are RDBMS-as-an-implementation, especially b-tree indexes.
Ross Judson on May 14th, 2012 10:21 pm

State of the art web and social graph compression packs trillion-edge sized graphs down into a few bits per edge (which is the usual measure). See http://webgraph.dsi.unimi.it/ as a starting point for a lot of interesting information.

Web graph compression research is looking for very high compression factors while maintaining a suitable set of navigational operators.

The stats at http://law.dsi.unimi.it/datasets.php are also interesting — note the compression achieved on the Facebook graphs.

The takeaway is that with modern graph compression and navigation techniques, it’s quite feasible to study very large graphs on a single well-equipped machine.
Graph databases as big data stores for every business | … and points beyond on May 17th, 2012 12:59 am

[…] (typeof(addthis_share) == "undefined"){ addthis_share = [];}Here’s another angle on big data applied to the world surrounding the common business: graph databases. This is a more […]
IT Software Community - John W. Verity - Graphs: Big-Data's Next Frontier on May 17th, 2012 6:01 pm

[…] large is large? Curt Monash, my favorite database guru, writes that YarcData (a client of his) has told him of intelligence […]
Notes on graph data management | DBMS 2 : DataBase Management System Services on May 17th, 2012 7:09 pm

[…] Analysis of large graphs […]
Andreas Crauser on June 3rd, 2012 10:23 pm

There is a special computer science research area for this, external memory algorithms and data structures. For graph problems, you can search for publications of Ulrich Meyer.

By the way, I doubt that a Rdbms it’s the right answer.
Relationship analytics application notes | DBMS 2 : DataBase Management System Services on June 20th, 2012 6:21 am

[…] Analysis of large graphs […]
Notes, links and comments August 6, 2012 | DBMS 2 : DataBase Management System Services on August 6th, 2012 1:12 am

[…] was a lot of commentary on my May series of graph analysis and management […]
Incremental MapReduce | DBMS 2 : DataBase Management System Services on November 19th, 2012 2:44 pm

[…] graph analytics is hard no matter what you do. Making it harder yet doesn’t seem wise. I’ve never heard […]
Jerven Bolleman on January 20th, 2013 4:32 pm

The funny thing about graph stores in the SPARQL sense is that there is a large variety of backend algorithms and solutions.

Yarcdata with their cray hardware takes an all in memory solution that is bandwith optimized. They only use Apache Jena for their query parsing and then hand it over to their graph database code written in a variant of C++ for their XMT architecture.

Ontotext with their OWLIM store takes a completely different approach to storing their graph data. Which comes down to 36 bytes per node-edge-node triple including indexes for the uniprot data.

Yet using the same queries and dataset you can switch between Yarcdata and Ontotext in hours (lovely for competitive tendering).

There are a lot of other approaches as well!

Using a RDBMS graph analysis is possible but even there I would take DB2 and Oracle’s optimized solution instead of rolling it myself by hand!

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Notes on the analysis of large graphs

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin