Data management at Zynga and LinkedIn
Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn’s People You May Know application. 🙂
It’s blindingly obvious that Zynga is one of Vertica’s petabyte-scale customers, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it’s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.
I don’t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.
I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, memcached/Membase/Couchbase), as Zynga decided that sending the data to some kind of log first was more trouble than it’s worth. Second, there’s Zynga’s approach to analytic database design. Highlights of that include:
- Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don’t think that’s by deliberate choice.
- Zynga adds data into the real schema when it’s clear it will be needed for a while. This isn’t a matter of query volumes, for the most part; rather, it’s when Zynga’s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.
- Zynga only adds columns to its analytic database; it never goes through the more complex process of deleting them.
Just as Zynga is one of Vertica’s flagship accounts, LinkedIn is one of Aster Data’s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn’s People You May Know feature in Aster nCluster. This was long ago, and I’m not sure how sophisticated his use of SQL and MapReduce would be in today’s terms; for example, I was told he didn’t use “nPath or anything like that.” (Edit: See the comments below for clarifications from Jonathan.) Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.
That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn’t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.
*And this time that is indeed the phrase that was used. 😉
One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that’s all it was, there would be a huge number of false positives I’m not actually seeing.
Comments
27 Responses to “Data management at Zynga and LinkedIn”
Leave a Reply
Curt,
Have you ever visited the profile of these folks while logged in (either by search or through clicking around on the site) . I think,based on some experiences with PYMK, that LinkedIn is using your click stream to figure out other people you know ..
Shrikanth,
That theory makes a lot of sense, especially if one expands it to asking whether I ever searched on their names (including times I failed to find their profiles in the search).
Hi Curt,
Regarding how PYMK works I really can’t comment on that. Also given the enormous impact PYMK has had on growth and improving network health LinkedIn has now put 4-5 people to work on it and they have likely added many more methods for identifying likely connections which I’m not even aware of.
Aster was critical to the early development of a number of analytics products including PYMK among others. Some of these products made use of the SQL-MR capabilities while others mainly benefited from the MPP capability that Aster provided over their pre-existing data warehouse environment. The
Aster platform is a powerful one in that enables the analyst to escape from the confines of SQL and perform more sophisticated tasks. When it came to deploying PYMK in production, the engineering team responsible for the task decided to utilize the Hadoop stack that they owned (Aster was owned by a different team) and do batch runs of PYMK there. This does not negate the value of the Aster platform to do SQL and non-SQL discovery analytics in a quick, iterative fashion, and this is very important to LinkedIn’s product innovation.
Thanks,
Jonathan Goldman
Jonathan,
Good point. The tool one uses to research algorithms need not be the one one uses to execute them later.
The point often arises in a slightly different context, namely the modeling/scoring dichotomy in straightforward statistics/predictive analytics. But it fits here too.
And why didn’t you use nPath? Was it simply a matter of it not existing yet? 🙂
When I was at LinkedIn I used nPath extensively actually for analysis work on user engagement (e.g. analyzing clickstream data). I didn’t use nPath for PYMK if that’s what you mean.
Jonathan,
That makes a lot of sense. Thanks!
No, it’s not just people I’ve searched on. I just got a semi-personal connection whose last name I didn’t even know (long story), and a wholly personal connection I much doubt I ever searched on.
interesting.. Any chance they may have looked you up on LinkedIn? I assume this is bi-directional. A sees B even if B searched for A.
That’s another good thought, Shrikanth. It’s definitely possible.
And since I have an unusual name, anybody who is searching for me is searching for ME. (If there’s another person named “Curt Monash” in the world, he hasn’t left a trace on Google that I have found.)
Mike,
Well, it starts at memcached. But Zynga was a development leader in making memcached persistent, via technology that was later rolled into Membase/Couchbase.
Perhaps the Couchbase company guys can shed some light on the matter, and/or some Zynga folks.
I’ll delete your dupe comment. Was the site slow on comment response again?
Curt, I was led to believe that the bulk of the Zynga db infrastructure was MySQL, not couchbase. I could be wrong but here is some insight from Venu from the inside: http://venublog.com/2010/12/02/mysql-at-scale-zynga-games/
– Mike
OK. That’s interesting. Venu’s post contradicts what’s widely believed, and also what I thought I heard from Zynga last week.
While it is most certainly true that Zynga uses MySQL (I think it is probably safe to say that they use one of just about everything), it is also true that they have over 2,000 servers running Couchbase technology. And they use Membase in concert with Vertica, as you correctly highlighted, Curt.
So SOME Zynga games are on Membase, while others on are MySQL? That would make sense, although it wouldn’t entirely excuse Venu from pretty clearly making an erroneous claim.
I get my data from Cadir and others. I don’t think I know Venu, so I can’t really comment on his assertions. I do find it humorous that ScaleDB is commenting about our business though.
@James, I’m commenting about a supposition made by Curt that didn’t jive with what I heard from someone inside Zynga. I made no comment about your product or business. I even “couched” it by saying I could be wrong. No need to get sensitive. I wish you and your company well.
-Mike
This post on zynga’s engineering blog should settle the question: http://code.zynga.com/2011/07/building-a-scalable-game-server/
It used to be memcache + Mysql but it migrated to membase later.
Thank you, kind Zynga person!
[…] Analytic Data Management at Zynga (5 TB/day) and LinkedIn – Data is divided into two parts. One part has a […]
Curt, nice chatting with you recently. To clear up the confusion about whether we use MySQL/membase or Vertica, the answer is pretty simple: One is used for transactional purposes, and one is used for analytical purposes.
The games write transactional data to MySQL/membase, and the architecture is described here: http://code.zynga.com/2011/07/building-a-scalable-game-server/
By transactional data, I mean data regarding a player’s state in the game, such as what their game board looks like, how many coins they have left, etc. It’s all the info that the game needs when the player logs in so they can continue playing where they left off previously.
The games also separately write analytical data to our analytics platform, and the architecture is described here:
http://code.zynga.com/2011/06/deciding-how-to-store-billions-of-rows-per-day/
This analytical data is primarily event data related to player behaviors. Did the player just send a horse to their friend in Farmville? Log that to the analytics system. Did they visit a neighbor’s city in Cityville? Log that to the analytics system. Etc.
Hope that clears things up.
Ken,
It was great talking with you too!
Actually, the controversy was about whether you use Membase OR MySQL for the more “transactional” parts. Somebody who self-described as a consultant or something to you claimed it was all MySQL and zero Membase, in a blog post linked in a comment above, and confusion ensued.
Am I correct in guessing that it’s Membase for some games, memcached/MySQL for others?
[…] the Full Story Posted in Analytics, Business of Big Data, Hadoop by Ralph 0 […]
[…] Zynga and LinkedIn […]
2012-01-11 Tableau – Francois Ajenstat…
Chris and David met with Francois Ajenstat (handl…
[…] existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other […]
[…] examples I’ve written about explicitly are eBay and Zynga. Satisfying a similar need is one of the pillars of the Splunk value […]
[…] years ago I wrote about how Zynga managed analytic data: Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as […]