Text

Analysis of data management technology optimized for text data. Related subjects include:

October 18, 2009

Technical introduction to Splunk

As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.

Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more

October 18, 2009

General introduction to Splunk

I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.

Splunk’s technical stories include:

More on those in a separate post.

Less technical Splunk highlights include: Read more

October 10, 2009

How 30+ enterprises are using Hadoop

MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:

Read more

April 24, 2009

IBM’s Oracle emulation strategy reconsidered

I’ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)

Key points include:

Because of Oracle’s market share, many ISVs focus on Oracle as the underlying database management system for their applications, whether or not they actually resell it along with their own software. IBM proposed three reasons why such ISVs might want to support DB2: Read more

October 5, 2008

MarkLogic architecture deep dive

While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn’t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.

Ron described MarkLogic Server as a DBMS for trees. Read more

August 26, 2008

Known applications of MapReduce

Most of the actual MapReduce applications I’ve heard of fall into a few areas:

That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what’s in the two big sources I found online. Read more

January 27, 2008

The 4 main approaches to datatype extensibility

Based on a variety of conversations – including some of the flames about my recent confession that mid-range DBMS aren’t suitable for everything — it seems as if a quick primer may be in order on the subject of datatype support. So here goes.

“Database management” usually deals with numeric or alphabetical data – i.e., the kind of stuff that goes nicely into tables. It commonly has a natural one-dimensional sort order, which is very useful for sort/merge joins, b-tree indexes, and the like. This kind of tabular data is what relational database management systems were invented for.

But ever more, there are important datatypes beyond character strings, numbers and dates. Leaving out generic BLOBs and CLOBs (Binary/Character Large OBjects), the big four surely are:

Numerous other datatypes are important as well, with the top runners-up probably being images, sound, video, time series (even though they’re numeric, they benefit from special handling).

Four major ways have evolved to manage data of non-tabular datatype, either on their own or within an essentially relational data management environment. Read more

December 9, 2005

Relational DBMS versus text data

There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.

1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.

2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.

3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.

But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).

4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).

5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.

← Previous Page

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.