Notes on the analysis of large graphs
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be:
- Store a table with one row per edge.
- Do an (n-1)-way join, where n is the number of edges in the path or subgraph.
In many cases the cardinality of intermediate result sets would be high, and you’d basically be doing a series of full table scans. Those could take a while.
There are various approaches to dealing with this challenge. For example:
- Graph analysis has been around long enough that much of it has surely been done relationally.
- I wrote about some specific relational strategies for graph analysis five years ago.
- A lot of graph analysis these days is being done in Hadoop (or other MapReduce, notably Aster Data’s).
- Objectivity Infinite Graph and Google Pregel emphasize pre-fetching (or pre-shipping) edges that might soon be needed.
- Yarcdata, with its Cray genes, tries to optimize hardware (single RAM image across a cluster, with a whole lot of multithreading) for in-memory Apache Jena performance. Unfortunately, I’m not clear as to which data structure(s) Jena uses.
When trying to figure out which of these techniques is likely to win in the most demanding cases, I run into the key controversy around analytic graph data management — how successfully can graphs be partitioned? Opinions vary widely, with the correct answers in each case surely depending on:
- The topology of the graph.
- The size of the graph.
- The length of the paths that need to be examined.
But in the interest of getting this posted tonight, I’ll leave further discussion of graph partitioning to another time.
Comments
20 Responses to “Notes on the analysis of large graphs”
Leave a Reply
[…] Analysis of large graphs […]
Jena is using RDF triples.
Thanks, Paul.
But organized/accessed how? B-trees? Something else?
Curt, what about direction as an edge attribute, along with weight, timestamp, etc.? Not all edge-connections are bi-directional. Inter-node relationships may be asymmetric as well as time-varying. I’d think capturing directionality would be important for any but a basic or domain-specific system.
Seth
Seth,
Of course direction is crucial, and I had it in there in a previous post. But it’s not an important factor affecting the size of a graph.
Typically, a graph is either directed or undirected/bidirectional. In the first case, direction is built into how the information for each edge is stored. In the second, it doesn’t have to be encoded explicitly either.
Even if direction DID have to be encoded explicitly, it would take up at most 2 bits.
Size of node: Can be in the megs or gigs. Typically there are a lot of attribute values stored at this level. Name, thumbnail pictures, etc etc. One way to think about it is, as the webapp is etching a node, saying “give me UnholyGuy” you want all the basic, commonly used info about UnholyGuy stored in that particulate node
Another way to think of this is the key architectural / reason for existence of a graph database and graph architecture is to be able to read/write a node and it’s associated hot information in constant (C) time, not Log(N) time.
If you stick to vanilla relational, and shard/index and are perfectly hashed, you are still fetching in Log(N/X) where X is the number of shards. Hence the interest in non relational backends. These backends are generally key/value stores..
Unholy guy,
We’re mixing together a couple of things here. You can store anything you want at a node, and that’s great. But if you want to traverse the graph efficiently, which is what I’m talking about, it seems compression/tokenization/pointers would be in order.
As for the constant vs. log(N) time meme — that may be OK for pinpoint lookups, but I don’t see how it’s very relevant to analytics.
Your “the key reason …” is for me just one benefit in one set of use cases, and indeed not the use cases I’m most focused on at the moment.
It’s important to understand that graph algorithms can be written and run perfectly well in an RDBMS using a perfectly normal data model without any special wizzbangness. I wrote one myself my first year of undergraduate CS ,it was called the “bill of materials” problem. Any parent/child table is a graph. It’s old hat except when you start talking about large graphs and doing odd things with them. Anything in your earlier posts can be accomplished in PL/SQL with no issues at all. Assuming the volume is small enough.
So any question around “graph analysis” has to become two questions, one around “what kinds of questions? Another around algorithms and efficient implementation of those, oddball data structures and high volume, say hundreds of millions of customers +.
The reason why we are suddenly seeing an uptake in “graph analysis” is not some magical analytical ah-ha it is because we have become technically capable of storing and accessing LARGE graphs of customers and interactions. It is not so much the analytics that was missing before but the platform. The algoithms that we have always been aware of are now becoming runnable.
Understanding the O(C) architecture of a graph database or key/value store(which is different from a graph algorithm) helps understand what analytical questions actually benefit from being run on a something weird as opposed to a parent/child table which people have been analyzing pretty much forever.
Most of these I have seen are algorithms that, as you rightly point out, rely heavily on analyzing the structure of large graphs and subgraphs holistically, they IN THEORY benefit a lot from Bulk Synchronous Parallel friendly platforms which, again in theory, is expensive and hard to implement inside a RDBMS.
However, IN PRACTICE except for maybe Pregel most people are better off just using a RDBMS to analyze even large graphs, since the specilzied graph frameworks are so immature the theoretical performance benefit generally isn’t realized.
Also, as soon as you get the “structure of the graph” problem somewhat solved you IMMEDIATELY get driven into questions involving attributes of the entities in question. For the churn example, you will start seeing questions about LTV, temporal relationships, etc etc.
I’m not sure “traversing the graph” has a lot of analytical value without performing analysis on the attributes of nodes.
In general, you are going to see heavy nodes not light ones. While you can certainly store a pointer to a richer data structure elsewhere, it does not really change anything, it’s all about “what do you have to retrieve in each super step / join to make it to the next super step” and that is generally a richer, heavier set of data then just the pointers.
It’s an important distinction because the heavier the data the more you benefit form locality which means the more sense specialized graphdb’s start to make….
just my two cents
Interesting take. Thanks.
I tend to agree that the “This is too hard to do in SQL; it’s much easier to do in a graph language” story is a bit overblown. More precisely, the layer one would have to put over SQL to negate much of the story is pretty thin.
On the other hand — why use SQL for something that it isn’t well suited for? Especially if it’s something that for performance reasons you want to do in-memory, negating many of the benefits of your favorite SQL DBMS?
Regarding the heaviness of the nodes — again, that depends on whether their size really interferes with performance. If it does, one looks to workarounds. Also, as you know, I’m a big fan of the idea that one often derives data and stores it for further analysis. So if the graph is derived rather than raw, I’m fine with that.
anyone who thinks it is harder to write an sql cursor then say a giraph job is welcome to try (and I love giraph mind you) (-:
“why use SQL for something that it isn’t well suited for? ”
In theory you should not. In practice it is because the graph platforms are immature, slow and unstable compared to RDBMS.
Also the log(n) properties of SQL are not inherent in sql-as-a-language, they are RDBMS-as-an-implementation, especially b-tree indexes.
State of the art web and social graph compression packs trillion-edge sized graphs down into a few bits per edge (which is the usual measure). See http://webgraph.dsi.unimi.it/ as a starting point for a lot of interesting information.
Web graph compression research is looking for very high compression factors while maintaining a suitable set of navigational operators.
The stats at http://law.dsi.unimi.it/datasets.php are also interesting — note the compression achieved on the Facebook graphs.
The takeaway is that with modern graph compression and navigation techniques, it’s quite feasible to study very large graphs on a single well-equipped machine.
[…] (typeof(addthis_share) == "undefined"){ addthis_share = [];}Here’s another angle on big data applied to the world surrounding the common business: graph databases. This is a more […]
[…] large is large? Curt Monash, my favorite database guru, writes that YarcData (a client of his) has told him of intelligence […]
[…] Analysis of large graphs […]
There is a special computer science research area for this, external memory algorithms and data structures. For graph problems, you can search for publications of Ulrich Meyer.
By the way, I doubt that a Rdbms it’s the right answer.
[…] Analysis of large graphs […]
[…] was a lot of commentary on my May series of graph analysis and management […]
[…] graph analytics is hard no matter what you do. Making it harder yet doesn’t seem wise. I’ve never heard […]
The funny thing about graph stores in the SPARQL sense is that there is a large variety of backend algorithms and solutions.
Yarcdata with their cray hardware takes an all in memory solution that is bandwith optimized. They only use Apache Jena for their query parsing and then hand it over to their graph database code written in a variant of C++ for their XMT architecture.
Ontotext with their OWLIM store takes a completely different approach to storing their graph data. Which comes down to 36 bytes per node-edge-node triple including indexes for the uniprot data.
Yet using the same queries and dataset you can switch between Yarcdata and Ontotext in hours (lovely for competitive tendering).
There are a lot of other approaches as well!
Using a RDBMS graph analysis is possible but even there I would take DB2 and Oracle’s optimized solution instead of rolling it myself by hand!