Splunk and inverted-list indexing
Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
and
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted-list indexing
Inverted list technology is confusing for several reasons, which start:
- It has two names that — rightly or wrongly — are used fairly interchangeably: inverted index and inverted list.
- Inverted indexes have played different roles at different times. in particular:
- They were the architecture of the best pre-relational general-purpose DBMS, namely ADABAS, Datacom/DB, and Model 204.
- They are the core architecture of text search.
- They are the architecture of certain document- or object-oriented DBMS, such as MarkLogic.
- They are the core architecture of Splunk. 🙂
What’s more, inverted list technology can take several different forms.
- In the simplest case, for each of many keywords, the inverted index lists the documents that contain it. Splunk does a form of this, where the “keyword” is the field — i.e. name — in a (field, value) pair.
- Another option is to store, for each keyword or name, not just document_IDs, but additional information.
- In the case of (field, value) pairs, the value can be stored. Splunk sometimes does that too.
- In the case of text documents, the index can store the position(s) in the document that the word occurs. This is irrelevant to Splunk.
- When you list all the records that have a certain field in them, and the list mentions the values, you’re getting pretty close to having a column-group NoSQL DBMS (e.g. Cassandra or HBase). Indeed, you might even be on your way to a columnar RDBMS; after all, SAP HANA grew out of a text indexing system.
Splunk, HPAS, and inverted indexes
With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.
- Splunk’s classic data store is an inverted list system that:
- Tracks (field, value) pairs for a few fields that are always the same, such as Source_System.
- Otherwise tracks fields only.
- Splunk HPAS is an inverted list system that tracks (field, value) pairs for arbitrary fields. This gives much higher performance for queries that SELECT on or GROUP BY those fields.
- As of Splunk 6, Splunk Classic and Splunk HPAS are tightly and almost transparently integrated.
While I haven’t probed for full specifics, I did gather:
- Queries execute against both data stores at once, without any syntax change. At least, they do if you press some button; that’s the “almost” in the transparency.
- HPAS time-slices the data it stores by the same time intervals that Splunk Classic does. Hence for each time range, integrated Splunk can interrogate the HPAS first and, if it can’t answer, go to the slower traditional Splunk store.
- There are two basic ways to populate the HPAS:
- As the data streams in.
- Via the result sets of Splunk queries. Splunk talks as if this is the preferred way, which fits with Splunk’s long-time argument that it’s nice not to have to make any schema choices before you start streaming the data in.
Comments
One Response to “Splunk and inverted-list indexing”
Leave a Reply
Curt – great to see the topic of inverted index being discussed. While it’s mostly been used in text retrieval systems in recent times, I believe it can serve a great purpose in modern, big-data scale, analytical DBMSs. At JethroData we took it to the extreme and actually build an inverted index for every column in a table.
Inverted indexes are suitable for SELECT with narrowing WHERE clause, and column store is necessary for efficient complex aggregation. The combination of both is a great base for building high performance RDBMS that can make accurate optimization choices based on complete columns statistics.