Entity-centric event series analytics
Much of modern analytic technology deals with what might be called an entity-centric sequence of events. For example:
- You receive and open various emails.
- You click on and look at various web sites and pages.
- Specific elements are displayed on those pages.
- You study various products, and even buy some.
Analytic questions are asked along the lines “Which sequences of events are most productive in terms of leading to the events we really desire?”, such as product sales. Another major area is sessionization, along with data preparation tasks that boil down to arranging data into meaningful event sequences in the first place.
A number of my clients are focused on such scenarios, including WibiData, Teradata Aster (e.g. via nPath), Platfora (in the imminent Platfora 3), and others. And so I get involved in naming exercises. The term entity-centric came along a while ago, because “user-centric” is too limiting. (E.g., the data may not be about a person, but rather specifically about the actions taken on her mobile device.) Now I’m adding the term event series to cover the whole scenario, rather than the “event sequence(s)” I might appear to have been hinting at above.
I decided on “event series” earlier this week, after noting that:
- “Time series” isn’t quite right, because it generally refers to a collection of time-stamped data of a single datatype.
- “Event stream” isn’t quite right, because it connotes the immediacy of complex event/stream processing.
- “Series” sounds better than “sequence”. While “sequence” would be the more accurate term from a strict mathematical standpoint, that ship sailed when time series weren’t called “time sequences” instead.
And that was even before I recalled hearing the term from Vertica a couple of years ago.
Analyzing event series is tricky even when all the events are of the same kind, and hence naturally fit into the same database table. For example:
- Even the most specific of pattern-matches can, in SQL, require several nestings of time-stamp range sub-queries. (How else do you ensure that Event 2 happened after Event 1 but before Event 3?)
- The most common end-user business intelligence UIs aren’t well suited to such analyses; specific new ones are being invented instead. I think they’re already OK for static views – trees, funnels, etc. – but I haven’t seen anything yet that seems great for navigation, or for human real-time interaction.
When you’re correlating events from multiple database columns or tables – or their nested data structure equivalents – things get hairier yet.
I also think that predictive modeling on event series, a huge subject for consumer internet companies, still has a long way to go. How exactly do you characterize the independent variables? For that matter, how do you characterize the dependent ones?
Bottom line: Event series are likely to be a major subject of data management and analytics innovation for a number of years to come.
Comments
20 Responses to “Entity-centric event series analytics”
Leave a Reply
An ointeresting new engine for real time is Kibana (http://demo.kibana.org). A very beautiful and dynamic tool that really shows the powerr of NoSQL and text search for Analytics in human real time. All JavaScript and rendered in the browser as well.
The combo of Elasticsearch+Kibana is a really nice one-stop shop for building practical analytics on event series data, especially if there is fuzzy matching involved. We have seen a lot of success here, especially with identity resolution.
I think event series analysis is simply CEP and there is not need for another term. Of perhaps CEP could be renamed to event series analysis.
I see CEP not necessarily about “streaming” or “immediacy”, since its about providing a domain-specific analysis for correlation, pattern detection on events whether historical, recorded or arriving sooner or later.
Esper CEP customers have been using our EPL (event processing language) for performing event series analysis for 6+ years.
Product Info + docs:
http://esper.codehaus.org
Company info:
http://www.espertech.com
Thomas Bernhardt
CTO EsperTech
While MapReduce is not efficient for SQL emulation I think it is perfect match for this kind of processing.
It is natural in hadoop to take multiple inputs, group them by some key and then write custom logic in reducer to analyze it.
I do not think logic of processing series of events can be expressed in SQL. If so – why should I pay for DB license to run my own logic?
Hi David!
I agree that this kind of processing is in most cases embarrassingly parallel (certain explorations in predictive modeling might be exceptions), which in most respects would make it a great candidate for MapReduce. However, in the ideal case a lot of it happens at interactive speeds, the most obvious example being the web page that is informed by your actions a moment ago on its predecessor. Those speeds aren’t a great fit for MapReduce — hence, for example, the Produce/Gather alternative from WibiData.
I like your thoughts on this, Curt. Thanks for sharing!
I thought you or your audience may be interested in this further reading:
How to Think About Event Data
https://keen.io/blog/53958349217/analytics-for-hackers-how-to-think-about-event-data
Event series seem to be a new spin on an an old idea. The “fact tables” that are central to the traditional data warehouse contain not entities but facts. A fact, so the definition goes, is an occurrence of a business event.
Are event series different from facts? If not, can we please just stick with the old term?
I don’t disagree that data management systems can do more to manage event data. DW practitioners have been struggling with problems like sessionization for years.
Julian,
I guess we could say, at a very high level of abstraction, that everything in a database records a fact and also that everything in a database records an event.
But unless you want to say that therefore all of database management is trivial, because all databases in principle serve the same purposes, I don’t understand your point.
Curt,
“Event-series analysis” as you describe it is a data modeling technique. Of course all data modeling techniques ultimately boil down to storing real-world information as records on disk, but that doesn’t mean they’re all the same.
Entity-relationship modeling (ER) is the best known modeling technique. Entities are generally represented as tables, and relationships by either foreign keys or intersection tables. An entity in the real world, such as a person, has identity, and its state changes over time. The corresponding record in the database has a key, and is updated as its attributes change.
Dimensional modeling (DM) [1] is a different paradigm. In dimensional modeling, a fact represents an occurrence of a business process, not an entity. It doesn’t have a unique identifier, but it does have a timestamp. It is not generally updated, except to make corrections. A case can be made that facts can be aggregated to represent higher-level processes such as “monthly sales”.
My point was that “event-series analysis” has a lot in common with DM, and in particular your “event” is very similar to DM’s “fact” concept. I value your insights into how people are using data in the real world, and this particular kind of analysis could be an interesting trend. But if you want to coin a new term, you have to explain to old dogs like me why the old one will not suffice.
Julian
[1] http://www.kimballgroup.com/1997/08/02/a-dimensional-modeling-manifesto/
Julian,
Inaccurate. For starters, I’m looking at full analytic stacks. Hence my references to BI UIs and to predictive modeling.
Also, efforts may be made to accelerate certain operations that are awkward with standard RDBMS and SQL semantics. Hence my references to Aster nPath and to Vertica.
Nor is the modeling necessarily traditional. For example, in a typical relational model, a fact table will have like facts. If there are different kinds of facts stored, they will be stored in different tables, which of course may share common keys. (And if the database is big enough to be parallelized, it is often wise to distribute all the tables on that common key.) By way of contrast, somebody who’s serious about event series analytics may arrange data differently. For example, WibiData/Kiji stores all facts about a single entity in a single HBase row, likely with lots of nesting.
I think links supporting all of the above were in my post. If not, I’ll gladly post them down here.
[…] them for you — are sort of like time series but also somewhat like event streams. “Event series” was the winning […]
[…] more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus […]
[…] custom methodology, algorithms, and software. Monash has a nice coverage of event stream analytics here. Most people confuse event stream analytics with complex event processing, or time series analysis […]
[…] Datameer does have a bit in the way of event series visualization, it seems […]
[…] arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and […]
[…] Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for […]
[…] a lot of innovation relevant to the analytic side, in areas such as streaming, low-latency BI, event series analytics, and BI/predictive modeling […]
[…] of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and […]
[…] analysis is much more primitive. Event series interfaces may be the closest thing to an […]
[…] or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably […]
[…] does ad-hoc event series analytics, which they call “interactive behavioral analytics […]