Structured documents

Analysis of data management technology based on a structured-document model, or optimized for XML data. Related subjects include:

Text data management
Object-oriented database management
Mark Logic
IBM’s DB2 pureXML
(in Text Technologies) More extensive coverage of text search

December 29, 2009

This and that

I have various subjects backed up that I don’t really want to write about at traditional blog-post length. Here are a few of them. Read more

Categories: Analytic technologies, Columnar database management, MarkLogic, Open source, Oracle, Streaming and complex event processing (CEP), Structured documents, Theory and architecture, Vertica Systems

2 Comments

October 18, 2009

Technical introduction to Splunk

As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it’s probably OK to assume they’re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that’s in early days at best.

Splunk’s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include: Read more

Categories: Analytic technologies, Log analysis, MapReduce, Splunk, Structured documents, Text, Web analytics

12 Comments

October 10, 2009

How 30+ enterprises are using Hadoop

MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:

Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance

Categories: Application areas, Aster Data, Cloudera, Data types, Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Log analysis, MapReduce, Open source, Parallelization, Predictive modeling and advanced analytics, Scientific research, Structured documents, Telecommunications, Text, Vertica Systems, Web analytics

9 Comments

September 13, 2009

HadoopDB

Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.

The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:

Open/cheap/free
Query fault-tolerance
The related concept of tolerating node degradation that isn’t an outright node failure.

HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.

The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more

Categories: Analytic technologies, Columnar database management, Data models and architecture, Data types, Data warehousing, Database diversity, Hadoop, MapReduce, Open source, Parallelization, PostgreSQL, Scientific research, Structured documents, Theory and architecture

5 Comments

April 24, 2009

IBM’s Oracle emulation strategy reconsidered

I’ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)

Key points include:

This really is more like Oracle emulation than it is transparency, a term I carelessly used before.
IBM’s Oracle emulation effort is focused on two technological goals:
- Making it easy for an Oracle application to be ported to DB2.
- Making it easy for an Oracle developer to develop for DB2.
The initial target market for DB2’s Oracle emulation is ISVs (Independent Software Vendors) much more than it is enterprises. IBM suggested there were a couple hundred early adopters, and those are primarily in the ISV area.

Because of Oracle’s market share, many ISVs focus on Oracle as the underlying database management system for their applications, whether or not they actually resell it along with their own software. IBM proposed three reasons why such ISVs might want to support DB2: Read more

Categories: Data types, Emulation, transparency, portability, EnterpriseDB and Postgres Plus, GIS and geospatial, IBM and DB2, Market share and customer counts, Oracle, Pricing, Structured documents, Text

12 Comments

October 5, 2008

Schema flexibility and XML data management

Conor O’Mahony, marketing manager for IBM’s DB2 pureXML, talks a lot about one of my favorite hobbyhorses — schema flexibility — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:

Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
The same, to some extent, goes for the travel industry, which also keeps adding different kinds of offers and destinations.
The energy industry keeps adding new kinds of highly complex equipment it has to manage.

Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange. Read more

Categories: Data models and architecture, EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

3 Comments

October 5, 2008

Vertical market XML standards

Tracking the alphabet soup of vertical market XML standards is hard. So as a starting point, I’m splitting a list I got from IBM into a standalone post.

Among the most important or successful IBM pureXML–supported standards, in terms of downloads and other evidence of customer interest, are: Read more

Categories: Application areas, EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

2 Comments

October 5, 2008

Overview of IBM DB2 pureXML

On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O’Mahony and Qi Jin). I’m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy …)

As I write it, I see there are a considerable number of holes, but that’s the way it seems to go when researching XML storage. I’m also writing up a September call from which I finally figured out (I think) the essence of how MarkLogic Server works – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML. Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.

What I understand so far about the basic DB2 pureXML architecture goes like this: Read more

Categories: EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

7 Comments

October 5, 2008

MarkLogic architecture deep dive

While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn’t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.

Ron described MarkLogic Server as a DBMS for trees. Read more

Categories: MarkLogic, Structured documents, Text

5 Comments

June 28, 2008

Who is doing what in XML data management these days?

A comment thread to a post on a different subject has opened up a discussion of XML storage. Frankly, I haven’t kept up with my briefings on the subject, in part because XML support hasn’t proved to be very important yet to the big DBMS vendors, somewhat to my surprise. When last I looked, the situation wasn’t much different from what it was back in November, 2005. Unless I’ve missed something (and please tell me if I have!), here’s what’s going on: Read more

Categories: IBM and DB2, Intersystems and Cache', MarkLogic, Microsoft and SQL*Server, Oracle, Structured documents

7 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Structured documents

This and that

Technical introduction to Splunk

How 30+ enterprises are using Hadoop

HadoopDB

IBM’s Oracle emulation strategy reconsidered

Schema flexibility and XML data management

Vertical market XML standards

Overview of IBM DB2 pureXML

MarkLogic architecture deep dive

Who is doing what in XML data management these days?

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin