Technical introduction to Splunk
As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.
Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:
- Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs. However, in the latter case indexing is turned off. Thus, Splunk does not portray its software as “agentless.” However, it asserts that its agent-like software runs without “material” overhead.
- The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
- Splunk tries to figure out what the individual entries are in a section of log it looks at. In particular:
- Time stamps are a big clue in this “inferencing” process, but they are not the be-all and end-all.
- Nor are line boundaries, if logs are naturally broken up into lines. (Splunk threw that latter comment in as a shot at SenSage.)
- I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
- Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.
Apparently, Splunk’s indexing is typically done via MapReduce jobs. I don’t know whether any actual Splunk searches are also done via MapReduce; surely they aren’t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn’t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster’s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.
Splunk’s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn’t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
Comments
12 Responses to “Technical introduction to Splunk”
Leave a Reply
[…] More on those in a separate post. […]
Sorry if I missed it but do the spelunkers use a commercial backend or did they roll it all on their own on the storage side?
Thanks.
J.
Hello Curt and Jerome… to clarify the answers to both questions…
Our search execution uses MapReduce for all statistical analysis, whether on-demand when users search against the raw unsummarized index data, or on a scheduled basis into “summary indexes”, our version of materialized views. The latter may be how you picked up that we use MapReduce for indexing.
Re storage – we’ve built our own indexing technology and datastore – we rely on nothing more than the filesystem.
Hope this clarifies.
Nice post, I thought i’d try and clarify our search results and tabular data.
As you point out, most of the time you interact with splunk by building and saving searches, usually through a simple and interactive process.
A search can be as simple as “failed login”, which will search our index using keywords much like the way Google will search the web for “failed login”, except that splunk will return log events, config files, network packets, etc., that contain those terms. Unlike a web search engine, Splunk will turn the results of any search into a table where the columns are either auto detected or can be specified by a user in advance. Auto detection works by looking for patterns of data like key=value or key:value, etc. User defined extractions can occur inadvance by specifying a regex or a user can use the UI to define a field. I’ll skip all they nice ways users can do this, but it’s usually easy to extract out fields if Splunk does not do so automatically.
I use the following example, suppose in Google i could say, “What is the average price of Pad Thai in San Francisco, broken out by Zip code over the past 6 months”. Something like Google would have a hard time of doing that, but that is a typical Splunk search – though analyzing Pad Thai prices in Splunk is not common but someone must have tried ;-).
The Splunk search language supports piping from one search command to the next. A table is the output of one command, and the input to the next, and is executed in our map reduce framework. The above example “failed login” defaults to “| search failed login”, since a search without a “|” defaults to the “search” command. The results of “failed login” return both the raw data so that users can see their log events, config files, etc, as well as a table. That table can be extremely sparse if the results are heterogenous or dense if all from the same source. Splunk has dozens of useful command to make reporting easy – for example, we could add to the above “failed login | top username” and the first table of results is piped through the “top” command which will quickly calculate an aggregate statistic listing a table of top usernames. Top is just one of many commands that you can easily string together and use to build reporting and analysis for putting on dashboards or using for alerting purposes. We have filtering commands like search, where, dedup. We have enriching commands like eval, extract, lookup, delta, fillnull, etc. We have reporting commands like stats, chart, timechart, rare, etc. And we have other transforming commands for extracting transactions, clustering, sorting, etc.. All very easy to use and work out of the box on any time series data.
Lastly we are looking at providing SQL interface to splunk so that tools that speak odbc/jdbc and query splunk.
Not sure that this comment helps any, but its important to understand how our search language works out-of-the-box for big data
Great post. The technology and architecture of Splunk is interesting. It looks like a useful tool for a sysadmin.
I tried the ‘failed login’ report on some of my system logs and it picked up the messages that had both the words ‘failed’ and ‘login’, but it didn’t pick up the applications that had ‘login failure’ or ‘failed logon’. I tried the word ‘fail’ but that didn’t register ‘failed’ or ‘failure’ either.
From what I can tell it is most useful in a homogeneous environment where you are very knowledgeable of the log format and contents before you run the queries.
Hi Tom,
Try using a wildcard search like fail*. Also, if you want the conjunction try adding quotes – “failed login”. Usual suspects like NOT, OR work as well.
I agree that it helps to know whats in your logs – but i find the opposite that i find that heterogenous data is more *interesting*. Splunk doesn’t need any parsing rules or predetermined schema so you can dump in any data. I index all my logs, all my config files, the output from commands like vmstat, iostat, top, network traffic, as well as mail in my inbox, and so on. Its most interesting to splunk across all sorts of datasets as there are often interesting relationships between data. I know people who throw in pitch-by-pitch baseball stats, global windmill power plant output, protein prediction data, and on and on – its not just IT data.
One thing we are working on is a Guide to finding stuff in your data. I hope this will help people who pick up splunk, throw data at it, quickly find interesting information. I’ll re-post when the guide is ready.
Feel free to bug me if you have specific questions on usage and thanks for the comments.
e
[…] was interesting to see Curt Monash, veteran database analyst and guru, post about splunk. If was a very short introduction to Splunk, but our appearance on his list signals our entry into […]
Splunk is conceptually a great product, but there are a couple of gotchas:
1) Query performance is dismal on even moderately sized data sets. It’s not a database, doesn’t have indexes, etc. I wanted to love Splunk, but the query performance just wasn’t there for exploring data. It’s primarily a batch-mode reporting tool. I could live with that, except…
2) Pricing. Splunk gets expensive fast, and the price is not well-aligned with the amount of value it delivers. I’d say it’s about twice as expensive as it should be.
Query performance is normally good, even on very large databases. If you experienced something else, then there was troubleshooting needed.
All data is indexed.
Can’t speak to pricing.
[…] was interesting to see Curt Monash, veteran database analyst and guru, post about splunk. If was a very short introduction to Splunk, but our appearance on his list signals our entry into […]
[…] an October, 2009 technical introduction to Splunk, I wrote (emphasis added): Splunk software both reads logs and indexes them. The same code runs […]
This website truly has all of the information I needed concerning this subject and didn’t know
who to ask.