MarkLogic

Analysis of Mark Logic and its Marklogic Server search-friendly XML DBMS product. Related subjects include:

Native XML database management
Text data management
(in Text Technologies) Mark Logic viewed from a text search perspective

November 1, 2011

MarkLogic 5, and why you might care

MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:

More-of-the-same in line with MarkLogic’s core positioning.
A new bi-directional Hadoop connector.
A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.

Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.

*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.

MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:

MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
… which has been optimized from the getgo for poly-structured data.
MarkLogic can and does scale out to handle large amounts of data.
MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.

Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.

Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text

1 Comment

October 10, 2011

Text data management, Part 2: General and short-request

This is Part 2 of a three post series. The posts cover:

I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from

Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.

I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?

Here are some reasons why.

There are three basic kinds of text management use case:

Text as payload.
Text as search parameter.
Text as analytic input.

Categories: MarkLogic, NoSQL, Text

5 Comments

October 10, 2011

Text data management, Part 1: Confusion

This is Part 1 of a three post series. The posts cover:

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

The terminology around text data is inaccurate.
Data volume estimates for text are misleading.
Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
Text search vendors have disappointed, especially technically.
Text analytics vendors have disappointed, especially financially.
Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more

Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text

2 Comments

October 2, 2011

Defining NoSQL

A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.

NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more

Categories: Cassandra, dbShards and CodeFutures, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology

7 Comments

September 6, 2011

Derived data, progressive enhancement, and schema evolution

The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:

Derived data.
Many-step processes to produce derived data.
Schema evolution.
Temporary data constructs.

So let’s dive in. Read more

Categories: Data models and architecture, Data warehousing, Derived data, MarkLogic, Text

Another category of derived data

Six months ago, I argued the importance of derived analytic data, saying

… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

Aggregates, when they are maintained, generally for reasons of performance or response time.

Calculated scores, commonly based on data mining/predictive analytics.

Text analytics.

The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.

Adjusted data, especially in scientific contexts.

Probably there are yet more examples that I am at the moment overlooking.

Well, I did overlook at least one category. 🙂

A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.

Categories: Data models and architecture, Derived data, Hadoop, MarkLogic

2 Comments

May 15, 2011

What to do about “unstructured data”

We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.

*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.

Second, a more accurate distinction is not whether a database has one structure or none — it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.

One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.

*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.

All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more

Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture

19 Comments

April 5, 2011

Whither MarkLogic?

My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of publishing (especially technical publishing) and national intelligence.

So what other markets could MarkLogic pursue? Before Ken even started work, I sent over some thoughts. They included (but were not limited to): Read more

Categories: MarkLogic, Object, RDF and graphs, Structured documents

6 Comments

February 28, 2011

Updating our vendor client disclosures

Edit: This disclosure has been superseded by a March, 2012 version.

From time to time, I disclose our vendor client lists. Another iteration is below. To be clear:

This is a list of Monash Advantage members.
All our vendor clients are Monash Advantage members, unless …
… we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
We do not usually disclose our user clients.
We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
Included in the list below are two expired Monash Advantage members who haven’t said they will renew, as mentioned in my recent post on analyst bias. (You can probably imagine a couple of reasons for that obfuscation.)

With that said, our vendor client disclosures at this time are:

Aster Data
Cloudera
CodeFutures/dbShards
Couchbase
EMC/Greenplum
Endeca
IBM/Netezza
Infobright
Intel
MarkLogic
ParAccel
QlikTech
salesforce.com/database.com
SAND Technology
SAP/Sybase
Schooner Information Technology
Skytide
Splunk
Teradata
Vertica

Categories: About this blog, Aster Data, Cloudera, Couchbase, dbShards and CodeFutures, EMC, Greenplum, IBM and DB2, Infobright, Intel, MarkLogic, Netezza, ParAccel, QlikTech and QlikView, SAND Technology, SAP AG, Schooner Information Technology, Splunk, Sybase, Tableau Software, Teradata, Vertica Systems

1 Comment

February 7, 2011

Notes on document-oriented NoSQL

When people talk about document-oriented NoSQL or some similar term, they usually mean something like:

Database management that uses a JSON model and gives you reasonably robust access to individual field values inside a JSON (JavaScript Object Notation) object.

Or, if they really mean,

The essence of whatever it is that CouchDB and MongoDB have in common.

well, that’s pretty much the same thing as what I said in the first place. 🙂

Of the various questions that might arise, three of the more definitional ones are:

Why JSON rather than XML?
What’s with this fluidity between the terms “document” and “object”?
Are you serious about the lack of joins?

Let me take a crack at each. Read more

Categories: CouchDB, MapReduce, MarkLogic, MongoDB, NoSQL, Object, Structured documents

16 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in