Log analysis

Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:

The use of analytic technologies to study web and network event data

June 22, 2011

Citrusleaf RTA

Citrusleaf has released an add-on product called Citrusleaf RTA (Real-Time Attribution). It’s to be used when:

You want to update dashboards within a minute.
You want to update predictive models fairly quickly (within the hour?), although it’s not clear to me how much the models are being updated or changed with that latency.

The metrics envisioned are:

100 or so ad impressions per person …
… for 1 billion or so people …
… stored for 30-90 days …
… where each ad impression is a fairly short record …
… stored on disk …
… but indexed in a way so that the index can fit into RAM.
50-100,000 writes per second. (I didn’t ask on what amount of hardware.)
Several hundred reads per second.

A consistent relational schema is NOT assumed.

Citrusleaf’s solution is:

Have one index entry for each of the 1 billion people.
Bang each new object/record to disk. Include in it a pointer to the previous object/record for the same person.
Each time a new object/record is added, update the index in place so that it now points to the new once. Hence, the index is sized according to the number of people, not according to the total number of objects/records.
Eventually let objects/records age off in the obvious way.

The downside is that when you do read 100 objects/records per person, you might need to do 100 seeks.

Categories: Aerospike, Analytic technologies, Business intelligence, Data models and architecture, Data warehousing, Log analysis, Predictive modeling and advanced analytics, Theory and architecture, Web analytics

3 Comments

June 20, 2011

Temporal data, time series, and imprecise predicates

I’ve been confused about temporal data management for a while, because there are several different things going on.

Date arithmetic. This of course has been around for a very long — er, for a very long time.
Time-series-aware compression. This has been around for quite a while too.
“Time travel”/snapshotting — preserving the state of the database at previous points in time. This is a matter of exposing (and not throwing away) the information you capture via MVCC (Multi-Version Concurrency Control) and/or append-only updates (as opposed to update-in-place). Those update strategies are increasingly popular for pretty much anything except update-intensive OLTP (OnLine Transaction Processing) DBMS, so time-travel/snapshotting is an achievable feature for most vendors.
Bitemporal data access. This occurs when a fact has both a transaction timestamp and a separate validity duration. A Wikipedia article seems to cover the subject pretty well, and I touched on Teradata’s bitemporal plans back in 2009.
Time series SQL extensions. Vertica explained its version of these to me a few days ago. I imagine Sybase IQ and other serious financial-trading market players have similar features.

In essence, the point of time series/event series SQL functionality is to do SQL against incomplete, imprecise, or derived data.* Read more

Categories: Analytic technologies, Data types, Investment research and trading, Log analysis, Sybase, Telecommunications, Theory and architecture, Vertica Systems

2 Comments

June 20, 2011

Columnar DBMS vendor customer metrics

Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive. Read more

Categories: Columnar database management, Data warehousing, Games and virtual worlds, Infobright, Investment research and trading, Log analysis, Market share and customer counts, Open source, ParAccel, Petabyte-scale data management, SAND Technology, Sybase, Telecommunications, Vertica Systems, Web analytics

5 Comments

June 14, 2011

Infobright 4.0

Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on machine-generated data. This hasn’t been Infobright’s strategy from the getgo, but it is these days, with pretty good focus and commitment. While some fraction of Infobright’s customer base is in the Sybase-IQ-like data mart market — and indeed Infobright put out a customer-win press release in that market a few days ago — Infobright’s current customer targets seem to be mainly:

Web companies, many of which are already MySQL users.
Telecommunication and similar log data, especially in OEM relationships.
Trading/financial services, especially at mid-tier companies.

Key aspects of Infobright 4.0 include: Read more

Categories: Data warehousing, Database compression, Infobright, Investment research and trading, Log analysis, Open source, Telecommunications, Web analytics

8 Comments

June 4, 2011

Dirty data, stored dirt cheap

A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.

Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.

So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. Read more

Categories: Hadoop, Investment research and trading, Log analysis, Splunk

9 Comments

October 6, 2010

eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more

I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included: Read more

Categories: Data warehousing, Derived data, eBay, Greenplum, Hadoop, HBase, Log analysis, Petabyte-scale data management, Teradata

30 Comments

August 11, 2010

Big Data is Watching You!

There’s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

Categories: Aster Data, Data warehousing, Investment research and trading, Log analysis, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Specific users, Surveillance and privacy, Telecommunications, Web analytics

6 Comments

July 31, 2010

Nested data structures keep coming up, especially for log files

Nested data structures have come up several times now, almost always in the context of log files.

Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
Facebook’s log files have a big nested data structure flavor.

I don’t have a grasp yet on what exactly is happening here, but it’s something.

Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture

7 Comments

July 6, 2010

Cassandra technical overview

Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.

Jonathan’s core claims for Cassandra include:

Cassandra is shared-nothing.
Cassandra has good approaches to replication and partitioning, right out of the box.
In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.

In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more

Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization

4 Comments

July 1, 2010

Why you should go to XLDB4

Scientific data commonly:

Comes in large volumes
Is machine-generated
Is augmented by synthetic and/or derived data
Has a spatial and/or temporal structure

In those respects, it is akin to some of the hottest areas for big data analytics, including:

Investment trade data – big, partly machine generated, augmented (often), temporal
Web/network log data – big, machine-generated, post-processed into derived form, temporal
Marketing analytic data – big, post-processed into derived form
Genomic data

So when Jacek Becla started the XLDB conferences on the premise that scientific and big data analytic challenges have a lot in common, he had a point. There are several tough database problems that the science-focused folks have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that’s one of two big reasons why you should consider participating in XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA, as an attendee, sponsor, or both.

The other big reason is that it is important for the world that XLDB succeed. Read more

Categories: Investment research and trading, Log analysis, Scientific research, Web analytics

2 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in