Introduction to Neo Technology and Neo4j
I’ve been talking some with the Neo Technology/Neo4j guys, including Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products). Basics include:
- Neo Technology came up with Neo4j, open sourced it, and is building a company around the open source core product in the usual way.
- Neo4j is a graph DBMS.
- Neo4j is unlike some other graph DBMS in that:
- Neo4j is designed for OLTP (OnLine Transaction Processing), or at least as a general-purpose DBMS, rather than being focused on investigative analytics.
- To every node or edge managed by Neo4j you can associate an arbitrary collection of (name,value) pairs — i.e., what might be called a document.
Numbers and historical facts include:
- > 50 paying Neo4j customers.
- Estimated 1000s of production Neo4j users of open source version.*
- Estimated 1/3 of paying customers and free users using Neo4j as a “system of record”.
- >30,000 downloads/month, in some sense of “download”.
- 35 people in 6 countries, vs. 25 last December.
- $13 million in VC, most of it last October.
- Started in 2000 as the underpinnings for a content management system.
- A version of the technology in production in 2003.
- Neo4j first open-sourced in 2007.
- Big-name customers including Cisco, Adobe, and Deutsche Telekom.
- Pricing of either $6,000 or $24,000 per JVM per year for two different commercial versions.
*I forgot to ask why the paying/production ratio was so low, but my guesses start with:
- Relatively high price.
- No true scale-out.
I also forgot to ask if there were any OEM users involved.
If we look at my basic list of graph data model application areas, Neo4j seems to be involved in most of what you would think. Exceptions include:
- There’s some anti-fraud and anti-terrorism, but it’s not as top-of-mind as for other graph database efforts.
- Taxonomies don’t seem to be in the picture at all, probably because people who want to model those would be more likely to turn to an RDF store.
- Hardcore influencer-finding analytics aren’t mentioned much, perhaps because Neo4j isn’t tuned for big batch analytic efforts.
- There are a lot of applications focusing on people’s roles, permissions, business hierarchies, and so on.
- Some of these are clearly in the area of access control.
- Others Neo debatably characterizes as MDM (Master Data Management), in that they serve as some kind of single-source-of-truth.
To scope what kind of databases Neo4j can or can’t handle, it may be helpful to note that:
- The scale-out version of Neo4j is some number of quarters away. At this time, Neo4j doesn’t scale out, except insofar as …
- … Neo4j has read slaves, which can be used for performance, high availability, and/or disaster recovery. (There’s some level of “OK, I’m the master now” HA functionality built in.)
- Neo4j has hard limits of 32 billion nodes, 32 billion edges, and 64 billion property (name,value) pairs.
- You are advised but not required to have everything except the property (name,value) pairs in memory.
- If that winds up leaving most of the overall data on persistent storage, you are advised but not required to use solid-state rather than spinning disk.
Neo4j is built around pointers and linked lists. The record for an edge, aka relationship, consists of:
- ID for the starting node. (Neo4j graphs are directed.)
- ID for the ending node.
- ID for the relationship type.
- Pointer to the first (name,value) pair in the property list.
- Pointer to the next edge leading from the start node.
- Pointer to the previous edge leading from the start node.
- Pointer to the next edge leading to the end node.
- Pointer to the previous edge leading to the end node.
(As you might imagine, any of those pointers could conceivably be null.) The property list is, you guessed it, another double linked list, in this case of (name,value) pairs. Similarly, I believe that a node record contains a node ID, a pointer to one edge leading from the node, a pointer to one edge leading to the node, and a pointer to a property list.
The physical data retrieval story in Neo4j starts:
- It’s mainly a big pointer chase.
- The pointers are meant to (almost) always be in memory.
- The cost of the pointer chase, to a first approximation, is a constant x E^(L-1), where
- E is the typical number of edges leading from each node.
- L is the length of a path being examined.
- The constant is sub-microsecond.
- In particular, the cost does not directly depend on the size of the graph, which leads Neo to somewhat confusingly say that overall query cost is “constant”.
An exception to the “pointers are supposed to be in memory” rule occurs when property lists are kept on disk — but when you fetch the first property in the chain, all the rest are retrieved too, placing their pointers in memory from that time on.
Neo4j records are (almost) always fixed length, so they can be found just from offset calculations. Notes on that include:
- I get the impression that record lengths are changed from time to time, as Neo tweaks its optimization.
- Those records are short; the greatest length mentioned was 120 bytes, and figures came up as low as 20.
- If a record is constrained to be shorter than you need it to be, it turns into a pointer to some other location. I didn’t get all the details of that part.
Indexes do play a limited role, in determining at which node(s) to start the pointer chase. Notes on Neo4j indexing include:
- It seems to be Lucene-based.
- Both B-trees and text search were mentioned.
- To use an index, you need to explicitly refer to it in a query. Neo Technology is working on removing that requirement.
- Overall, indexes don’t seem to have been a big area of focus for Neo. Their heart is in pointers.
You can get at Neo4j via either its “Cypher” declarative query language or an older Java API.
Finally, there’s Neo4j’s durability story — it’s just full durability/ACID, with no tunable durability or anything like that. Neo4j doesn’t seem to feel any great rush about sending a write all the way to the database in persistent storage; but there’s also an update log, and the write doesn’t get acknowledged until that log has been flushed to disk.
Comments
11 Responses to “Introduction to Neo Technology and Neo4j”
Leave a Reply
[…] Neo Technology (the Neo4j guys) started out doing a content management system, and eventually decided that what they really wanted underneath it was a graph-oriented DBMS. […]
[…] Introduction to Neo Technology and Neo4j. Good stuff as usual from Curt Monash, going into a lot of detail about product and company (but not how to use or develop for Neo4j – use Google for that). […]
> *I forgot to ask why the paying/production ratio was so low, but my guesses start with:
Hi Curt, I’m happy to answer this question directly:
Part of Neo Technology’s mission as an open source company is to promote broad-based adoption of our graph database. It delights us to see the free Community Edition being used across all contexts. This has resulted in an amazingly vibrant community, and in the long-run is also healthy for the product.
Our commercial-to-free ratio, while it may appear low at first glance, is actually much higher than most open source projects: including MySQL.
The question of batch analytics raises an interesting aspect of graph databases:
Because they can so efficiently traverse data in real time, graph databases can take on certain analytic activities that—in a relational context—would need to take place in batch mode. The ability to migrate certain (not all) types of analytic activities, such as recommendations and fraud analysis, into the OLTP system with ACID properties, has been an important driver behind graph database adoption.
Some additional clarification on the technical implementation details, for those who are thirsty for more:
– Pointers in Neo4j are explicit relationships between data records created at insert time. This differs *significantly* from relationships in RDBMSs, which are physically calculated on the fly at query time via join operations and index lookups, and are consequently *much* more expensive. Graph database traversals don’t require index lookups, which for complex connected data operations, yields order-of-magnitude performance improvements over relational and other NOSQL options.
– The pointer scheme is entirely hidden from the user. Users need only walk the relationships, or tell Neo4j is what they need to bring back, and let the database traverse the graph and return the results.
– Both the data in the graph (including indexes), and the the graph structure, are always consistent at any point in time.
– While Lucene is our default choice of indexing (because of it predicability/maturity/performance characteristics), it is possible to swap out Lucene for other types of indexes. These must be JTA compliant in order to reap the consistency benefits that we have built into our Lucene solution.
– Buffers can be caused to flush more or less frequently by tuning the size of the logical logs.
– Regarding scale out vs. scale up: few users bump up against the current limits. A who have needed to scale across multiple instances have done by scaling out via the application. This leads to hybrid performance characteristics: extremely fast traversals for queries local to an instance; and slower indexed lookups across nodes, equivalent to non-graph databases.
Nice write up!
The MySQL example is something of a special case, because of all the WordPress and other embeds. Indeed, I’m a MySQL “user” numerous times over. 🙂
[…] Introduction to Neo Technology and Neo4j by Curt Monash. […]
[…] an object-oriented DBMS does the job (or a graph DBMS or […]
[…] and Neo4j both rely on direct […]
In response to the statement: “The cost [C] of the pointer chase, to a first approximation, is [C = k*E^(L-1)], where… the constant [k] is sub-microsecond.”
Using data from an experiment in Partner and Vukotic’s “Neo4j in Action” (cited in “Graph Databases”, Robinson and Webber p. 20), where paths of length up to 5 are queried from a social network of about 50 friends per person, this would give
C(L) = k*50^(L-1)
implying that the constant is about 8 microseconds using a linear regression when the equation is in the form
log C = L + log k – 1.
Of course, it could be machine-dependent, but if the claim is that the constant is sub-microsecond, this experiment does not support that claim.
[…] clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much […]