Text

Analysis of data management technology optimized for text data. Related subjects include:

Native XML database management
(in Text Technologies) More extensive coverage of text search

July 8, 2012

Database diversity revisited

From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.

The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:

I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
Even so, one product or technology may be well-suited to address a couple different kinds of challenges.

The Big Seven database challenges that almost any enterprise faces are: Read more

Categories: Data integration and middleware, Data models and architecture, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, NoSQL, Object, OLTP, RDF and graphs, Structured documents, Talend, Text

3 Comments

March 27, 2012

DataStax Enterprise and Cassandra revisited

My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.

For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:

Vanilla Cassandra.
Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.

Cassandra, Solr, Lucene, and Hadoop are all Apache projects.

If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:

Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
There’s something called CQL (Cassandra Query Language), said to be SQL-like.
There’s a JDBC driver for CQL.
With Hadoop MapReduce also come Hive, Pig, and Sqoop.
With Solr and Lucene come full-text search.

In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.

The two main ways to get direct SQL* access to data in DataStax Enterprise are:

JDBC/SQL.
Hive/Hadoop.

*or very SQL-like, depending on how you view things

Before going further, let’s recall some Cassandra basics: Read more

Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text

6 Comments

March 21, 2012

DataStax Enterprise 2.0

Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.

My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:

Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.

DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:

You can manage the interactive data for a web site.
You can store the logs for that website.
You can analyze all of the above in Hadoop.

Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics

5 Comments

February 26, 2012

SAP HANA today

SAP HANA has gotten much attention, mainly for its potential. I finally got briefed on HANA a few weeks ago. While we didn’t have time for all that much detail, it still might be interesting to talk about where SAP HANA stands today.

The HANA section of SAP’s website is a confusing and sometimes inaccurate mess. But an IBM whitepaper on SAP HANA gives some helpful background.

SAP HANA is positioned as an “appliance”. So far as I can tell, that really means it’s a software product for which there are a variety of emphatically-recommended hardware configurations — Intel-only, from what right now are eight usual-suspect hardware partners. Anyhow, the core of SAP HANA is an in-memory DBMS. Particulars include:

Mainly, HANA is an in-memory columnar DBMS, based on SAP’s confusingly-renamed BI Accelerator/BW Accelerator. Analytics and most OLTP (OnLine Transaction Processing) go against the columnar part of HANA.
The HANA DBMS also has an in-memory row storage option, used to store metadata, small tables, and so on.
SAP HANA talks both SQL and MDX.
The HANA DBMS is shared-nothing across blades or rack servers. I imagine that within an individual blade it’s shared everything. The usual-suspect data distribution or partitioning strategies are available — hash, range, round-robin.
SAP HANA has what sounds like a natural disk-based persistence strategy — logs, snapshots, and so on. SAP says that this is synchronous enough to give ACID compliance. For some hardware partners, those “disks” are actually Fusion I/O cards.
HANA is fault-tolerant “across servers”.
Text support is “coming soon”, which makes sense, given that BI Accelerator was based on the TREX search engine in the first place. Inxight is also in the HANA text mix.
You can put data into SAP HANA in a variety of obvious ways:
- Writing it directly.
- Trigger-based replication (perhaps from the DBMS that runs your SAP apps).
- Log-based replication (based on Sybase Replication Server).
- SAP Business Objects’ ETL tool.

SAP says that the row-store part is based both on P*Time, an acquisition from Korea some time ago, and also on SAP’s own MaxDB. The IBM white paper mentions only the MaxDB aspect. (Edit: Actually, see the comment thread below.) Based on a variety of clues, I conjecture that this was an aspect of SAP HANA development that did not go entirely smoothly.

Other SAP HANA components include: Read more

Categories: Analytic technologies, Business Objects, Data warehouse appliances, Data warehousing, In-memory DBMS, Investment research and trading, OLTP, Predictive modeling and advanced analytics, SAP AG, Software as a Service (SaaS), Text

12 Comments

February 17, 2012

The future of enterprise application software

Sarah Lacy argues that enterprise application software is due for a change. Her reasons seemingly boil down to:

Users are increasingly eager for friendlier, consumer-like technology.
The current generation of apps was installed long enough ago — often before the Year 2000 deadline — that enterprises are willing to contemplate rip-and-replace.

I’m inclined to agree, although I’d add some further, more technological-oriented drivers to the mix.

Changes I envision to enterprise applications include (and these overlap):

Better integration with communication technology.
- Social software.
- Better stakeholder-facing interfaces.
- Voice control.
Better integration with analytic technology.
- Dashboard-first UIs.
- Search-first UIs.
- Alert-first UIs.
- Analytic assessment aids (job performance, supplier desirability, expense approval, etc.).
- Automated decisioning.
- Some true analytic apps, interesting or otherwise.
Better use of different kinds of data.
- Text.
- Machine-generated.
- Analytically-derived data.

Categories: salesforce.com, Software as a Service (SaaS), Text

5 Comments

February 6, 2012

Sumo Logic and UIs for text-oriented data

I talked with the Sumo Logic folks for an hour Thursday. Highlights included:

Sumo Logic does SaaS (Software as a Service) log management.
Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
Sumo Logic’s main differentiation is automated classification of events.
There’s some kind of streaming engine in the mix, to update counters and drive alerts.
Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
Sumo Logic recently raised a bunch of venture capital.
Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.

What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more

Categories: Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Software as a Service (SaaS), Text

7 Comments

November 4, 2011

Lessons from T-Mobile’s epic fail

When my electric power came back on but my Verizon FiOS internet connection didn’t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile’s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that.

T-Mobile’s instructions tell you that you need to know the factory defaults for network name and password. That makes sense. They don’t also tell you that you need to know your SIM card number (included), IMEI number (included), or authorization number (not included).

That’s right — you need a number that T-Mobile doesn’t tell you you need. But the story gets a lot worse from there, because it’s almost impossible to get the number from them. I eventually talked with approximately 8 T-Mobile call center associates over the course of the evening before getting successfully connected.

Categories: Specific users, Text

MarkLogic 5, and why you might care

MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:

More-of-the-same in line with MarkLogic’s core positioning.
A new bi-directional Hadoop connector.
A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.

Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.

*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.

MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:

MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
… which has been optimized from the getgo for poly-structured data.
MarkLogic can and does scale out to handle large amounts of data.
MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.

Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.

Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text

1 Comment

October 10, 2011

Text data management, Part 3: Analytic and progressively enhanced

This is Part 3 of a three post series. The posts cover:

Confusion about text data management.

Choices for text data management (general and short-request).

Choices for text data management (analytic).

I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:

Using text data commonly involves a long series of data enhancement steps.

Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:

Figure out where the words break.
Figure out where the clauses and sentences break.
Figure out where the paragraphs, sections, and chapters break.
(Where necessary) map the words to similar ones — spelling correction, stemming, etc.
Figure out which words are grammatically which parts of speech.
Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
Figure out what was being said, one clause at a time.
Figure out the emotion — or “sentiment” — associated with it.

Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.

Categories: Data warehousing, Hadoop, NoSQL, Text

4 Comments

October 10, 2011

Text data management, Part 2: General and short-request

This is Part 2 of a three post series. The posts cover:

I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from

Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.

I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?

Here are some reasons why.

There are three basic kinds of text management use case:

Text as payload.
Text as search parameter.
Text as analytic input.

Categories: MarkLogic, NoSQL, Text

5 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Text

Database diversity revisited

DataStax Enterprise and Cassandra revisited

DataStax Enterprise 2.0

SAP HANA today

The future of enterprise application software

Sumo Logic and UIs for text-oriented data

Lessons from T-Mobile’s epic fail

MarkLogic 5, and why you might care

Text data management, Part 3: Analytic and progressively enhanced

Text data management, Part 2: General and short-request

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin