Web analytics
Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
- The use of analytic technologies for logfile analysis
- (in Text Technologies) Online marketing
Kickfire update
I talked recently with my clients at Kickfire, especially newish CEO Bruce Armstrong. I also visited the Kickfire blog, which among other virtues features a fairly clear overview of Kickfire technology. (I did my own Kickfire overview in October.) Highlights of the current Kickfire story include:
- Kickfire is initially focused on three heavily overlapping markets — network event analysis, the general Web 2.0/clickstream/online marketing analytics area, and MySQL/LAMP data warehousing.
- Kickfire has blogged about a few sales to unnamed customers in those markets.
- I think network management is a market that’s potentially friendly to five-figure-cost appliances. After all, networking equipment is generally sold in appliance form. Kickfire doesn’t dispute this analysis.
- Kickfire’s sales so far are to run databases in the sub-terabyte range, although both Kickfire and its customers intend to run bigger databases soon. (Kickfire describes the range as 300 GB – 1 TB.) Not coincidentally, Kickfire believes that MySQL doesn’t scale very well past 100 GB without a lot of partitioning effort (in the case of data warehouses) or sharding (in the case of OLTP).
- When Bruce became CEO, he let go some sales, marketing, and/or business development folks. He likes to call this a restructuring of Kickfire rather than a reduction-in-force, but anyhow — that’s what happened. There are now about 50 employees, and Kickfire still has most of the $20 million it raised last August in the bank. Edit: The company clarifies that it actually wound up with more sales and marketing people than before.
- Kickfire has thankfully deemphasized various marketing themes I found annoying, such as ascribing great weight to TPC-H benchmarks or explaining why John von Neumann originally made bad choices in his principles of computer design.
Categories: Data warehouse appliances, Data warehousing, Kickfire, MySQL, Open source, Web analytics | 1 Comment |
Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data
Data warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.
Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that. Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.
One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.
Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Fox and MySpace, Greenplum, Theory and architecture, Web analytics | 3 Comments |
Three Greenplum customers’ applications of MapReduce
Greenplum (and Truviso) advisor Joseph Hellerstein offers a few examples of MapReduce applications (specifically Greenplum MapReduce), namely:
The big aha moment occured for me during our panel discussion, which included Luke Lonergan from Greenplum, Roger Magoulas from O’Reilly, and Brian Dolan from Fox Interactive Media (which runs MySpace among other web properties).
Roger talked about using MapReduce to extract structured entities from text for doing tech trend analyses from billions of rows of online job postings. Brian (who is a mathematician by training) was talking about implementing conjugate gradiant and Support Vector Machines in parallel SQL to support “hypertargeting” for advertisers. I mentioned how Jonathan Goldman at LinkedIn was using SQL and MapReduce to do graph algorithms for social network analysis.
Incidentally: While it’s been some months since I asked, my sense is that the O’Reilly text extraction is home-grown, and primitive compared to what one could do via commercial products. That said, if the specific application is examining job postings, I’m not sure how much value more sophisticated products would add. After all, tech job listings are generally written in a style explicitly designed to ensure that most or all of their meaning is conveyed simply by a bag of keywords. And by the way, this effort has been underway for quite some time.
Related link
- Greenplum has a page on the O’Reilly relationship. However, the part that isn’t behind a registration barrier is trivial — and I wouldn’t know one way or the other about the registration-required part.
Categories: Analytic technologies, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Specific users, Web analytics | 3 Comments |
Fox Interactive Media’s multi-hundred terabyte database running on Greenplum
Greenplum’s largest named account is Fox Interactive Media — the parent organization of MySpace — which has a multi-hundred terabyte database that it uses for hardcore data mining/analytics. Greenplum has been engaging in regrettable business practices, claiming that it is in the process of supplanting Aster Data at Fox/MySpace. In fact, MySpace’s use of Aster is more mission-critical than Fox’s use of Greenplum, and is increasing significantly.
Still, as Greenplum’s gushing customer video with Fox Interactive Media* illustrates, the Fox/Greenplum database is impressive on its own merits. Read more
Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, Specific users, Theory and architecture, Web analytics | 3 Comments |
MySpace’s multi-hundred terabyte database running on Aster Data
Aster Data has put up a blog post embedding and summarizing a video about its MySpace account. Basic metrics include:
The combined Aster deployment now has 200+ commodity hardware servers working together to manage 200+ TB of data that is growing at 2-3TB per day by collecting 7-10B events that happen on one of the world.
I’m pretty sure that’s counting correctly (i.e., user data).* Read more
Categories: Analytic technologies, Application areas, Aster Data, Data warehousing, Fox and MySpace, Specific users, Theory and architecture, Web analytics | 11 Comments |
Infobright update
Infobright briefed me, and I thought it would be best to invite them to provide a write-up themselves of what customer and other information they did and didn’t want to disclose, for me to publish. Read more
Categories: Application areas, Data warehousing, Infobright, Open source, Telecommunications, Web analytics | 2 Comments |
An example of Aster Data’s nPath/MapReduce syntax
Perhaps in response to my prior post on Aster Data’s introduction of MapReduce-based nPath, Steve Wooledge of Aster offers a more detailed example. The particular case he works through is:
… the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site?
Categories: Analytic technologies, Aster Data, Data warehousing, MapReduce, Web analytics | Leave a Comment |
Aster Data nPath
Edit: Unfortunately, this post and its sequel rely on Aster Data posts that Aster’s buyer Teradata no longer makes easily available.
At the same time as it rolled out its cloud story, Aster Data told of nPath, a MapReduce-based feature in nCluster. As best I understand it, the core idea of nPath is that it preprocesses sequential data via MapReduce so that you can then do ordinary SQL on it. (Steve Wooledge’s blog post about nPath outlines why that might be needed. Point 1 in Mayank Bawa’s August, 2008 post is much more concise. 😉 ) Now, that might seem to contradict the syntax, which is all about MapReduce being invoked via SQL — still, it’s what’s really going on.
That leads to two obvious questions: What is nPath used (or useful) for? and How is the preprocessing done anyway? Read more
Categories: Aster Data, Data warehousing, MapReduce, Predictive modeling and advanced analytics, Web analytics | 2 Comments |
Aster Data in the cloud
Aster Data is in the news, bragging about a cloud version of nCluster, and providing both a press release and a blog post on the subject. It seems there are three actual customers, two of which have been publicly named. One of them, ShareThis, is in production. (2 terabytes of data on 9 nodes, planning to scale to 10-18 TB on 24 or so nodes by year-end.) All seem to be doing something in the area of internet marketing, web analytics or otherwise — which makes sense, as the same could be said of almost all Aster customers overall. That said, it seems that these customers are doing their primary analytic processing remotely, which makes Aster’s experience in that regard more akin to Kognitio’s than to Vertica’s. Read more
Categories: Analytic technologies, Application areas, Aster Data, Cloud computing, Data warehousing, MapReduce, Software as a Service (SaaS), Web analytics | 1 Comment |
More Oracle notes
When I went to Oracle in October, the main purpose of the visit was to discuss Exadata. And so my initial post based on the visit was focused accordingly. But there were a number of other interesting points I’ve never gotten around to writing up. Let me now remedy that, at least in part. Read more