Log analysis
Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:
- The use of analytic technologies to study web and network event data
Citrusleaf RTA
Citrusleaf has released an add-on product called Citrusleaf RTA (Real-Time Attribution). It’s to be used when:
- You want to update dashboards within a minute.
- You want to update predictive models fairly quickly (within the hour?), although it’s not clear to me how much the models are being updated or changed with that latency.
The metrics envisioned are:
- 100 or so ad impressions per person …
- … for 1 billion or so people …
- … stored for 30-90 days …
- … where each ad impression is a fairly short record …
- … stored on disk …
- … but indexed in a way so that the index can fit into RAM.
- 50-100,000 writes per second. (I didn’t ask on what amount of hardware.)
- Several hundred reads per second.
A consistent relational schema is NOT assumed.
Citrusleaf’s solution is:
- Have one index entry for each of the 1 billion people.
- Bang each new object/record to disk. Include in it a pointer to the previous object/record for the same person.
- Each time a new object/record is added, update the index in place so that it now points to the new once. Hence, the index is sized according to the number of people, not according to the total number of objects/records.
- Eventually let objects/records age off in the obvious way.
The downside is that when you do read 100 objects/records per person, you might need to do 100 seeks.
Temporal data, time series, and imprecise predicates
I’ve been confused about temporal data management for a while, because there are several different things going on.
- Date arithmetic. This of course has been around for a very long — er, for a very long time.
- Time-series-aware compression. This has been around for quite a while too.
- “Time travel”/snapshotting — preserving the state of the database at previous points in time. This is a matter of exposing (and not throwing away) the information you capture via MVCC (Multi-Version Concurrency Control) and/or append-only updates (as opposed to update-in-place). Those update strategies are increasingly popular for pretty much anything except update-intensive OLTP (OnLine Transaction Processing) DBMS, so time-travel/snapshotting is an achievable feature for most vendors.
- Bitemporal data access. This occurs when a fact has both a transaction timestamp and a separate validity duration. A Wikipedia article seems to cover the subject pretty well, and I touched on Teradata’s bitemporal plans back in 2009.
- Time series SQL extensions. Vertica explained its version of these to me a few days ago. I imagine Sybase IQ and other serious financial-trading market players have similar features.
In essence, the point of time series/event series SQL functionality is to do SQL against incomplete, imprecise, or derived data.* Read more
Categories: Analytic technologies, Data types, Investment research and trading, Log analysis, Sybase, Telecommunications, Theory and architecture, Vertica Systems | 2 Comments |
Columnar DBMS vendor customer metrics
Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive. Read more
Infobright 4.0
Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on machine-generated data. This hasn’t been Infobright’s strategy from the getgo, but it is these days, with pretty good focus and commitment. While some fraction of Infobright’s customer base is in the Sybase-IQ-like data mart market — and indeed Infobright put out a customer-win press release in that market a few days ago — Infobright’s current customer targets seem to be mainly:
- Web companies, many of which are already MySQL users.
- Telecommunication and similar log data, especially in OEM relationships.
- Trading/financial services, especially at mid-tier companies.
Key aspects of Infobright 4.0 include: Read more
Categories: Data warehousing, Database compression, Infobright, Investment research and trading, Log analysis, Open source, Telecommunications, Web analytics | 8 Comments |
Dirty data, stored dirt cheap
A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.
Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.
So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. Read more
Categories: Hadoop, Investment research and trading, Log analysis, Splunk | 9 Comments |
eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more
I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included: Read more
Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata | 30 Comments |
Big Data is Watching You!
There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:
- People
- Financial trades
- Electronic networks
- Everything else
The most varied, interesting, and valuable of those four categories is the first one.
Nested data structures keep coming up, especially for log files
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture | 7 Comments |
Cassandra technical overview
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization | 4 Comments |
Why you should go to XLDB4
Scientific data commonly:
- Comes in large volumes
- Is machine-generated
- Is augmented by synthetic and/or derived data
- Has a spatial and/or temporal structure
In those respects, it is akin to some of the hottest areas for big data analytics, including:
- Investment trade data – big, partly machine generated, augmented (often), temporal
- Web/network log data – big, machine-generated, post-processed into derived form, temporal
- Marketing analytic data – big, post-processed into derived form
- Genomic data
So when Jacek Becla started the XLDB conferences on the premise that scientific and big data analytic challenges have a lot in common, he had a point. There are several tough database problems that the science-focused folks have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that’s one of two big reasons why you should consider participating in XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA, as an attendee, sponsor, or both.
The other big reason is that it is important for the world that XLDB succeed. Read more