Web analytics
Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
- The use of analytic technologies for logfile analysis
- (in Text Technologies) Online marketing
Introduction to MemSQL
I talked with MemSQL shortly before today’s launch. MemSQL technology basics are:
- In-memory relational DBMS.
- Being released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development too.
- Subset of SQL-92.
- MySQL wire-compatible (SQL coverage issues excepted).
MemSQL’s performance claims include:
- Read performance 10% or so worse than memcached.
- Write performance 20% or so better than memcached.
- 1.2 million inserts/second on a 64-core, 1/2 TB of RAM machine.
- Similarly, 1/2 billion records loaded in under 20 minutes.
MemSQL company basics include: Read more
Categories: Database compression, In-memory DBMS, Investment research and trading, Market share and customer counts, memcached, MemSQL, OLTP, Pricing, Web analytics | 3 Comments |
Introduction to Metamarkets and Druid
I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
- Introduction to Metamarkets and Druid (this post)
- Druid overview
- Metamarkets’ back-end technology
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing: Read more
Quick-turnaround predictive modeling
Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.
Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more
Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics | 10 Comments |
Cool analytic stories
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics | 6 Comments |
Thinking about market segments
It is a reasonable (over)simplification to say that my business boils down to:
- Advising vendors what/how to sell.
- Advising users what/how to buy.
One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.
Last June I wrote:
In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.
I’d further say that it matters whether the buyer:
- Is a large central IT organization.
- Is the well-staffed IT organization of a particular business department.
- Is a small, frazzled IT organization.
- Has strong engineering or technical skills, but less in the way of IT specialists.
- Is trying to skate by without much technical knowledge of any kind.
Now let’s map those considerations (and others) to some specific market segments. Read more
DataStax Enterprise 2.0
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
Translucent modeling, and the future of internet marketing
There’s a growing consensus that consumers require limits on the predictive modeling that is done about them. That’s a theme of the Obama Administration’s recent work on consumer data privacy; it’s central to other countries’ data retention regulations; and it’s specifically borne out by the recent Target-pursues-pregnant-women example. Whatever happens legally, I believe this also calls for a technical response, namely:
Consumers should be shown key factual and psychographic aspects of how they are modeled, and be given the chance to insist that marketers disregard any or all of those aspects.
I further believe that the resulting technology should be extended so that
information holders can collaborate by exchanging estimates for such key factors, rather than exchanging the underlying data itself.
To some extent this happens today, for example with attribution/de-anonymization or with credit scores; but I think it should be taken to another level of granularity.
My name for all this is translucent modeling, rather than “transparent”, the idea being that key points must be visible, but the finer details can be safely obscured.
Examples of dialog I think marketers should have with consumers include: Read more
Categories: Predictive modeling and advanced analytics, Surveillance and privacy, Web analytics | Leave a Comment |
Applications of an analytic kind
The most straightforward approach to the applications business is:
- Take general-purpose technology and think through how to apply it to a specific application domain.
- Produce packaged application software accordingly.
However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:
- Analytic applications of that kind are rarely complete.
- Incomplete applications rarely sell well.
I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.
Reasons that analytic applications are commonly less complete than the transactional kind include: Read more
Couchbase update
I checked in with James Phillips for a Couchbase update, and I understand better what’s going on. In particular:
- Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies.
- Couchbase now and for the foreseeable future has one product line, called Couchbase.
- Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped …
- … because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.
- Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.
- In connection with the need to rewrite parts of CouchDB, Couchbase has:
- Gotten out of the single-server CouchDB business.
- Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.
- The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.
- Couchbase has 60ish people, headed to >100 over the next few months.
Categories: Basho and Riak, Cassandra, Couchbase, CouchDB, DataStax, Market share and customer counts, MongoDB, NoSQL, Open source, Parallelization, Web analytics, Zynga | 7 Comments |
Splunk update
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.