MongoDB users and use cases
I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space.
10gen recommends solid-state storage in many cases. In some cases solid-state lets you get away with fewer total nodes. 10gen also likes Flashcache (Facebook-developed technology to put a flash cache in front of hard disks). But the 100-node example mentioned above uses spinning disk.
Use cases 10gen is proud of include:
- Lots of user profile maintenance, including at online ad companies. This includes full user ad impression data. (I’ve argued for a while that user profile information belongs in something like a NoSQL database.)
- A big-name web company that wants to inspect every packet that enters their network, and replaced Splunk with MongoDB for performance reasons.
- A big-name photo/video site whose metadata is all in MongoDB. (That’s the kind of thing that often makes for good MarkLogic use cases.)
But actually, the reason we had the call was to review cases where MongoDB’s schemaless nature was significant. Examples of those included:
- A couple of top examples were of the kind “A bunch of apps, similar but not the same.” For MTV, it’s a single content management system for a bunch of websites. For Disney Playdom, it’s different schemas for every game.
- For a wireless telco, the issue was a product catalog in which devices and service plans called for very different schemas, and which the telco felt had thus become unmanageable in Oracle.
- For Craigslist, the issue wasn’t programming so much as performance — ALTER TABLE operations took months in MySQL, and that’s not a typo, although I’ll confess to not understanding why this was the case.
The 10gen guys went on to claim that schemalessness is helpful for incremental development in general, the point being that you don’t have a database-modification step. To some extent, changes can even be rolled back more easily than if you actually changed your schemas.
Comments
13 Responses to “MongoDB users and use cases”
Leave a Reply
RE: Craigslist & the month long ALTER TABLE command
After reading how ALTER TABLE works in MySQL I think it’s fairly obvious why adding columns is painful:
http://dev.mysql.com/doc/refman/5.1/en/alter-table-problems.html
Maybe they can move that into a metadata only (dictionary) change in the future like most commercial databases use today.
Jeremy wrote that “changes take over a month” and that is reported here as “operations took months”. Is there a rounding error?
From http://dev.mysql.com/doc/refman/5.1/en/alter-table-problems.html . Are all ALTER TABLE operations bound by this pattern, including adding new columns?
“ALTER TABLE works in the following way:
– Create a new table named A-xxx with the requested structural changes.
– Copy all rows from the original table to A-xxx.
– Rename the original table to B-xxx.
– Rename A-xxx to your original table name.
-Delete B-xxx.”
Mark,
10gen gave me a figure higher than Jeremy’s.
Curt,
Given the choice between data that came from Jeremy versus something that came from Jeremy -> 10gen -> you, I prefer the one with a shorter path. I don’t think the operator game is a good way to understand technology.
Mark,
There was no such choice. The information in Jeremy’s post was older, and hence did not contradict what I heard from 10gen.
We’re using MongoDB for a ‘schemaless’ application, and it’s mostly working out, but… in short: we got the job done on budget and on schedule, but if we had more time/resources, we probably should have used a graphdb like neo4j.
In Foswiki, everything is a topic (versioned, access-controlled structured document/page with its own URL).
I mean, application settings are stored in topics, applications themselves can be scripted in topics, user profiles are topics, access control settings are stored in topics, the user group memberships are defined in topics…
Users create their own schemas (which are themselves just topics), and apply them to other topics which store data. The query language supports indirect queries (kinda JOIN-ish, but more like XPath… our contractor figured out a way to abuse JS to translate these JOIN-esque queries into mongo).
So, you have a bunch of topics that might be using the same schema but are at different versions. It’s a wiki: chaotic.
We thought MongoDB would be perfect, however it’s far from it: in order to sensibly make use of a dataset >> 100 records (and we have many users with ~20k records, the biggest is ~60k topics, and we’ve been working on an import of ~160k), you want to be able to do paging of query results.
To do paging, you need sorting.
To do sorting, you need indexes.
To set indexes, the user needs to make choices about what they will & won’t query/sort on (isn’t that up to whoever is clicking wildly around a jqgrid?).
So even if you have unlimited RAM to waste, the maximum is 64 indexes per collection.
In other words, MongoDB isn’t as fun for schemaless data as we would have liked.
Paul,
Interesting. You’re basically saying that flexibility is limited by a limit on the number of indexes?
By the way — did you look at MarkLogic? Everything gets indexed there. Or were its economics a mismatch for your project?
I’m basically saying, if you need to think about setting indexes at all, then you need your documents to have consistent key names on which to set an index… in other words you need a schema! Regardless of the limit on how many you can set.
Ours is a research project, and the funding parameters required us to try very hard to use free/open source where possible. So MarkLogic was not an option.
MongoDB had the lowest integration cost of all the options we considered (and this included MySQL/Postgres), this is due to the excellent fit Mongo had with our data, the ability to delegate complex queries entirely out of the application and onto mongo via JS (avoiding chattiness), not to mention the official perl driver.
Having done a mongo back-end, the APIs are in a better position now to try others – in fact there’s already the beginnings of an SQL back-end.
So even if MongoDB is far from perfect, we’ve found work-arounds. Eg. without indexes, mongo seems able to sort & query up to ~2-5,000 records depending on the document sizes. So we limit the diversity of documents in a given web to fit the 64 limit, if that web has many thousands of topics. Which happens to match standard practice anyway.
@Paul, Have you seen http://www.mongodb.org/display/DOCS/Using+Multikeys+to+Simulate+a+Large+Number+of+Indexes ? It is a common pattern for handling a lot of variability in objects which is perfect for arbitrary metadata. It does have some trade-offs in that it trades space for flexibility and queries are a bit awkward, but it is a workable solution in many cases.
@Mark, I think part of the confusion is related to the fact that it took “over a month” on the master, then the same amount of time on the slave, plus another month or so for the slave to catch up to all the new writes on the master. So all told the whole process of altering the table and returning to a good state took a few months, even if each system only took about one.
For more info, watch the presentation at http://www.10gen.com/presentation/mongosf2011/craigslist
@Mathias, thank you for that link! We are aware of multikeys, but I’m not doing the MongoDBPlugin development (that’s Sven Dowideit), so, I’ll forward this on.
I seem to recall it had some limitations at the time Sven considered multikeys, but this has been in development since 1.6 so perhaps things have improved… except, reading the multikeys doc, it says you can only index one array per document. This isn’t 100% useful, because Foswiki’s standard “topic object model” can be described as a hash of arrays of hashes, but perhaps there’s a way we could use it for the META:FIELD.values (most common/diverse thing users sort/query on).
[…] banco de dados gigantesco (em inglês huMONGOus) que costuma armazenar dados na casa dos Terabytes pode ser facilmente manipulado por […]