Data types

Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:

Any subcategory
Database diversity

May 26, 2009

Teradata Developer Exchange (DevX) begins to emerge

Every vendor needs developer-facing web resources, and Teradata turns out to have been working on a new umbrella site for its. It’s called Teradata Developer Exchange — DevX for short. Teradata DevX seems to be in a low-volume beta now, with a press release/bigger roll-out coming next week or so. Major elements are about what one would expect:

Articles
Blogs
Downloads
Surprisingly, so far as I can tell, no forums

If you’re a Teradata user, you absolutely should check out Teradata DevX. If you just research Teradata — my situation 🙂 — there are some aspects that might be of interest anyway. In particular, I found Teradata’s downloads instructive, most particularly those in the area of extensibility. Mainly, these are UDFs (User-Defined Functions), in areas such as:

Compression
Geospatial data
Imitating Oracle or DB2 UDFs (as migration aids)

Also of potential interest is a custom-portlet framework for Teradata’s management tool Viewpoint. A straightforward use would be to plunk some Viewpoint data into a more general system management dashboard. A yet cooler use — and I couldn’t get a clear sense of whether anybody’s ever done this yet — would be to offer end users some insight as to how long their queries are apt to run.

Categories: Database compression, Emulation, transparency, portability, GIS and geospatial, Teradata

2 Comments

April 24, 2009

IBM’s Oracle emulation strategy reconsidered

I’ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)

Key points include:

This really is more like Oracle emulation than it is transparency, a term I carelessly used before.
IBM’s Oracle emulation effort is focused on two technological goals:
- Making it easy for an Oracle application to be ported to DB2.
- Making it easy for an Oracle developer to develop for DB2.
The initial target market for DB2’s Oracle emulation is ISVs (Independent Software Vendors) much more than it is enterprises. IBM suggested there were a couple hundred early adopters, and those are primarily in the ISV area.

Because of Oracle’s market share, many ISVs focus on Oracle as the underlying database management system for their applications, whether or not they actually resell it along with their own software. IBM proposed three reasons why such ISVs might want to support DB2: Read more

Categories: Data types, Emulation, transparency, portability, EnterpriseDB and Postgres Plus, GIS and geospatial, IBM and DB2, Market share and customer counts, Oracle, Pricing, Structured documents, Text

12 Comments

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

2 1/2 petabytes of data managed via Hadoop
10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
Ad targeting queries run every 15 minutes in Hadoop
Dashboard roll-up queries run every hour in Hadoop
Ad-hoc research/analytic Hadoop queries run whenever
Anti-fraud analysis done in Hadoop
Text mining (e.g., of things written on people’s “walls”) done in Hadoop
100s or 1000s of simultaneous Hadoop queries
JSON-based social network analysis in Hadoop

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

Categories: Analytic technologies, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Petabyte-scale data management, RDF and graphs, Specific users, Web analytics

27 Comments

January 28, 2009

More Oracle notes

When I went to Oracle in October, the main purpose of the visit was to discuss Exadata. And so my initial post based on the visit was focused accordingly. But there were a number of other interesting points I’ve never gotten around to writing up. Let me now remedy that, at least in part. Read more

Categories: Data types, Data warehousing, Database compression, GIS and geospatial, MOLAP, Oracle, SAP AG, Streaming and complex event processing (CEP), Theory and architecture, Web analytics

9 Comments

November 7, 2008

Big scientific databases need to be stored somehow

A year ago, Mike Stonebraker observed that conventional DBMS don’t necessarily do a great job on scientific data, and further pointed out that different kinds of science might call for different data access methods. Even so, some of the largest databases around are scientific ones, and they have to be managed somehow. For example:

Microsoft just put out an overwrought press release. The substance seems to be that Pan-STARRS — a Jim Gray legacy also discussed in an August, 2008 Computerworld article — is adding 1.4 terabytes of image data per night, and one not so new database adds 15 terabytes per year of some kind of computer simulation output used to analyze protein folding. Both run on SQL Server, of course.
Kognitio has an astronomical database too, at Cambridge University, adding 1/2 a terabyte of data per night.
Oracle is used for a McGill University proteonomics database called CellMapBase. A figure of 50 terabytes of “mass storage” is included, which doesn’t include tape backup and so on.
The Large Hadron Collider, once it actually starts functioning, is projected to generate 15 petabytes of data annually, which will be initially stored on tape and then distributed to various computing centers around the world.
Netezza is proud of its ability to serve images and the like quickly, although off the top of my head I’m not thinking of a major customer it has in that area. (But then, if you just sell software, your academic discount can approach 100%; but if like Netezza you have an actual cost of goods sold, that’s not as appealing an option.)

Long-term, I imagine that the most suitable DBMS for these purposes will be MPP systems with strong datatype extensibility — e.g., DB2, PostgreSQL-based Greenplum, PostgreSQL-based Aster nCluster, or maybe Oracle.

Categories: Aster Data, Data types, Greenplum, IBM and DB2, Kognitio, Microsoft and SQL*Server, Netezza, Oracle, Parallelization, PostgreSQL, Scientific research

1 Comment

October 14, 2008

Teradata Geospatial, and datatype extensibility in general

As part of it’s 13.0 release this week, Teradata is productizing its geospatial datatype, which previously was just a downloadable library. (Edit: More precisely, Teradata announced 13.0, which will actually be shipped some time in 2009.) What Teradata Geospatial now amounts to is:

User-defined functions (UDF) written by Teradata (this is the part that existed before).
(Possibly new) Enhanced implementations of the Teradata geospatial UDFs, for better performance.
(Definitely new) Optimizer awareness of the Teradata geospatial UDFs.

Teradata also intends in the future to implement actual geospatial indexing; candidates include r-trees and tesselation.

Hearing this was a good wake-up call for me, because in the past I’ve conflated two issues on datatype extensibility, namely:

Whether the query executer uses a special access method (i.e., index type) for the datatype
Whether the optimizer is aware of the datatypes.

But as Teradata just pointed out, those two issues can indeed be separated from each other.

Categories: Data types, Data warehousing, GIS and geospatial, Teradata

1 Comment

October 5, 2008

Schema flexibility and XML data management

Conor O’Mahony, marketing manager for IBM’s DB2 pureXML, talks a lot about one of my favorite hobbyhorses — schema flexibility — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:

Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
The same, to some extent, goes for the travel industry, which also keeps adding different kinds of offers and destinations.
The energy industry keeps adding new kinds of highly complex equipment it has to manage.

Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange. Read more

Categories: Data models and architecture, EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

3 Comments

October 5, 2008

Vertical market XML standards

Tracking the alphabet soup of vertical market XML standards is hard. So as a starting point, I’m splitting a list I got from IBM into a standalone post.

Among the most important or successful IBM pureXML–supported standards, in terms of downloads and other evidence of customer interest, are: Read more

Categories: Application areas, EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

2 Comments

October 5, 2008

Overview of IBM DB2 pureXML

On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O’Mahony and Qi Jin). I’m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy …)

As I write it, I see there are a considerable number of holes, but that’s the way it seems to go when researching XML storage. I’m also writing up a September call from which I finally figured out (I think) the essence of how MarkLogic Server works – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML. Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.

What I understand so far about the basic DB2 pureXML architecture goes like this: Read more

Categories: EAI, EII, ETL, ELT, ETLT, IBM and DB2, pureXML, Structured documents

7 Comments

October 5, 2008

MarkLogic architecture deep dive

While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn’t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.

Ron described MarkLogic Server as a DBMS for trees. Read more

Categories: MarkLogic, Structured documents, Text

5 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Data types

Teradata Developer Exchange (DevX) begins to emerge

IBM’s Oracle emulation strategy reconsidered

Cloudera presents the MapReduce bull case

More Oracle notes

Big scientific databases need to be stored somehow

Teradata Geospatial, and datatype extensibility in general

Schema flexibility and XML data management

Vertical market XML standards

Overview of IBM DB2 pureXML

MarkLogic architecture deep dive

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin