Log analysis

Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:

The use of analytic technologies to study web and network event data

January 8, 2012

Big data terminology and positioning

Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:

Bigness — Volume, Velocity, size
Structure — Variety, Variability, Complexity

given that

High-velocity “big data” problems are usually high-volume as well.*
Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.

But the conflation should stop there.

*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.

When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:

Relational big data is data of high volume that fits well into a relational DBMS.
Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.

Categories: Cassandra, Data models and architecture, Data warehousing, Exadata, Facebook, Google, Hadoop, HBase, Log analysis, Market share and customer counts, MarkLogic, NewSQL, NoSQL, Oracle, Splunk, Yahoo

10 Comments

October 19, 2011

What those nested data structures are about

As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.

The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:

All 50 search results you were shown, and their positions in the search rankings.
Every ad, image, or graphical element.
An ID as to which test you were participating in (every page you see on eBay has some element being tested).

*Edit: Oliver subsequently moved on to Sears and then Teradata.

There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)

Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.

Categories: Data models and architecture, Data warehousing, eBay, Log analysis, Web analytics

7 Comments

October 10, 2011

Text data management, Part 1: Confusion

This is Part 1 of a three post series. The posts cover:

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

The terminology around text data is inaccurate.
Data volume estimates for text are misleading.
Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
Text search vendors have disappointed, especially technically.
Text analytics vendors have disappointed, especially financially.
Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more

Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text

2 Comments

September 12, 2011

Hadoop notes

I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.

The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:

Internet businesses.
Traditional enterprises ‘ internet operations.
Traditional enterprises’ other operations.

Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.

Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics

5 Comments

July 27, 2011

MongoDB users and use cases

I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space. Read more

Categories: Data models and architecture, Games and virtual worlds, Log analysis, MongoDB, NoSQL, Solid-state memory, Specific users, Splunk, Telecommunications, Web analytics

13 Comments

July 26, 2011

Remote machine-generated data

I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it’s time to distinguish more carefully among different kinds of machine-generated data. In particular, I think it may be useful to distinguish between:

Log-stream machine-generated data, when what you’re looking at — at least initially — is the entire output of verbose logging systems.
Remote machine-generated data.

Here’s what I’m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example: Read more

Categories: Analytic technologies, Cloud computing, Log analysis, MySQL, Netezza, Splunk, Truviso

2 Comments

July 18, 2011

HBase is not broken

It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.

After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details. Read more

Categories: Cloudera, Derived data, Facebook, Hadoop, HBase, Log analysis, Market share and customer counts, Open source, Specific users, Web analytics

6 Comments

July 6, 2011

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

42,000 Hadoop nodes …
… holding 180-200 petabytes of data.

Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo

13 Comments

July 5, 2011

Eight kinds of analytic database (Part 2)

In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more

Categories: Analytic technologies, Archiving and information preservation, Business intelligence, Buying processes, Cloud computing, Columnar database management, Data mart outsourcing, Data types, Data warehouse appliances, Data warehousing, Database compression, Database diversity, EAI, EII, ETL, ELT, ETLT, Greenplum, Hadoop, Investment research and trading, Log analysis, MapReduce, MOLAP, MySQL, Netezza, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Rainstor, SAND Technology, Scientific research, SenSage, Software as a Service (SaaS), Streaming and complex event processing (CEP), Telecommunications, Vertica Systems, Web analytics

6 Comments

July 5, 2011

Eight kinds of analytic database (Part 1)

Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.

Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more

Categories: Analytic technologies, Aster Data, Benchmarks and POCs, Business intelligence, Buying processes, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Database diversity, Exadata, Greenplum, IBM and DB2, Infobright, Investment research and trading, Log analysis, Microsoft and SQL*Server, MOLAP, Netezza, OLTP, Oracle, ParAccel, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, QlikTech and QlikView, SAND Technology, Scientific research, Sybase, Teradata, Vertica Systems, Web analytics, Workload management

7 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in