Web analytics
Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
- The use of analytic technologies for logfile analysis
- (in Text Technologies) Online marketing
Introduction to Kaminario
At its core, the Kaminario story is simple:
- Throw out your disks and replace them with, not Flash, but actual DRAM.
-
Your IOPS (Input/Output Per Second) are so high* that you get the performance you need without any further system changes.
- The whole thing is very fast to set up.
In other words, Kaminario pitches a value proposition something like (my words, not theirs) “A shortcut around your performance bottlenecks.”
*1 million or so on the smallest Kaminario K2 appliance.
Kaminario asserts that both analytics and OLTP (OnLine Transaction Processing) are represented in its user base. Even so, the use cases Kaminario mentioned seemed to be concentrated on the analytic side. I suspect there are two main reasons:
- As Kaminario points out, OLTP apps commonly are designed to perform in the face of regrettable I/O wait.
- Also, analytic performance problems tend to arise more suddenly than OLTP ones do.*
*Somebody can think up a new analytic query overnight that takes 10 times the processing of anything they’ve ever run before. Or they can get the urge to run the same queries 10 times as often as before. Both those kinds of thing happen less often in the OLTP world.
Accordingly, Kaminario likes to sell against the alternative of getting a better analytic DBMS, stressing that you can get a Kaminario K2 appliance into production a lot faster than you can move your processing to even the simplest data warehouse appliance. Kaminario is probably technically correct in saying that; even so, I suspect it would often make more sense to view Kaminario K2 appliances as a transition technology, by which I mean:
- You have an annoying performance problem.
- Kaminario K2 could solve it very quickly.
- That buys you time for a more substantive fix.*
- If you want, you can redeploy your Kaminario K2 storage to solve your next-worst performance bottleneck.
On that basis, I could see Kaminario-like devices eventually getting to the point that every sufficiently large enterprise should have some of them, whether or not that enterprise has an application it believes should run permanently against DRAM block storage. Read more
Categories: Investment research and trading, Kaminario, Solid-state memory, Storage, Telecommunications, Web analytics | 7 Comments |
More notes on Membase and memcached
As a companion to my post about Membase last week, the company has graciously allowed me to post a rather detailed Membase slide deck. (It even has pricing.) Also, I left one point out.
Membase announced a Cloudera partnership. I couldn’t detect anything technically exciting about that, but it serves to highlight what I do find to be an interesting usage trend. A couple of big Web players (AOL and ShareThis) are using Hadoop to crunch data and derive customer profile data, then feed that back into Membase. Why Membase? Because it can serve up the profile in a millisecond, as part of a bigger 40-millisecond-latency request.
And why Hadoop, rather than Aster Data nCluster, which ShareThis also uses? Umm, I didn’t ask.
When I mentioned this to Colin Mahony, he said Vertica had similar stories. However, I don’t recall whether they were about Membase or just memcached, and he hasn’t had a chance to get back to me with clarification. (Edit: As per Colin’s comment below, it’s both.)
Categories: Aster Data, Cache, Cloudera, Couchbase, Hadoop, memcached, Memory-centric data management, NoSQL, Pricing, Specific users, Vertica Systems, Web analytics | 7 Comments |
Notes and links October 3 2010
Some notes, follow-up, and links before I head out to California: Read more
Categories: GIS and geospatial, Google, HP and Neoview, Humor, Kickfire, Netezza, Solid-state memory, Teradata, Web analytics | 3 Comments |
How to tell whether you need ACID-compliant transaction integrity
In a post about the recent JPMorgan Chase database outage, I suggested that JPMorgan Chase’s user profile database was over-engineered, in that various web surfing data was stored in a fully ACID-compliant manner when it didn’t really need to be. I’ve since gotten private communication expressing vehement agreement, and telling of the opposite choice being major in other major web-facing transactional systems.
What’s going on is this:
- ACID-compliant transaction integrity commonly costs more in terms of DBMS licenses and many other components of TCO (Total Cost of Ownership) than less rigorous approaches.
- Worse, it can actually hurt application uptime, by forcing your system to pull in its horns and stop functioning in the face of failures that a non-transactional system might smoothly work around.
- Other flavors of “complexity can be a bad thing” apply as well.
Thus, transaction integrity can be more trouble than it’s worth.
In essence, of course, that’s half of the classic NoSQL claim, where the other half of the claim is to assert that the same may be said of joins.
So when should you go for ACID-compliant transaction integrity, and when shouldn’t you bother? Every situation is different, but here’s a set of considerations to start you off. Read more
Categories: NoSQL, Web analytics | 12 Comments |
Big Data is Watching You!
There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:
- People
- Financial trades
- Electronic networks
- Everything else
The most varied, interesting, and valuable of those four categories is the first one.
Why you should go to XLDB4
Scientific data commonly:
- Comes in large volumes
- Is machine-generated
- Is augmented by synthetic and/or derived data
- Has a spatial and/or temporal structure
In those respects, it is akin to some of the hottest areas for big data analytics, including:
- Investment trade data – big, partly machine generated, augmented (often), temporal
- Web/network log data – big, machine-generated, post-processed into derived form, temporal
- Marketing analytic data – big, post-processed into derived form
- Genomic data
So when Jacek Becla started the XLDB conferences on the premise that scientific and big data analytic challenges have a lot in common, he had a point. There are several tough database problems that the science-focused folks have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that’s one of two big reasons why you should consider participating in XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA, as an attendee, sponsor, or both.
The other big reason is that it is important for the world that XLDB succeed. Read more
Categories: Investment research and trading, Log analysis, Scientific research, Web analytics | 2 Comments |
Cloudera Enterprise and Hadoop evolution
I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say: Read more
The most important part of the “social graph” is neither social nor a graph
“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There’s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most important parts of the Facebook “social graph” are neither social nor a graph. Rather, what’s really important is an aggregate Profile of Revealed Preferences, of which person-to-person connections or other things best modeled by a graph play only a small part.
Categories: Analytic technologies, Facebook, Games and virtual worlds, RDF and graphs, Surveillance and privacy, Web analytics | 13 Comments |
Notes on SciDB and scientific data management
I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That’s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here’s some of what has transpired since then.
The main new activity I know of has been in the open source SciDB project. Read more
Categories: Analytic technologies, Data warehousing, eBay, GIS and geospatial, Microsoft and SQL*Server, SciDB, Scientific research, Web analytics | 5 Comments |
Truviso evidently reinvents itself
When Aleri bought Coral8 last year, I wrote that the independent CEP (Complex Event Processing) vendors were floundering. Aleri quickly threw in the towel and sold out to Sybase, which hardly changed my opinion. StreamBase actually is persevering, but not with any kind of breakout success. Big vendors, such as Microsoft and IBM, have at least some aspirations of eventually filling the gap.
Meanwhile, Truviso — which never got much market traction in the first place — was in hiding; Roman Bukary never did keep his promise to brief me on the company’s new and improved strategy. Then Truviso had yet another management change, amidst rumors that it was repositioning away from CEP. As per a press release Truviso emailed today, that’s now official, with Truviso’s main business being something to do with web analytics.
Edit: It seems Truviso was at some point absorbed into Cisco.