EAI, EII, ETL, ELT, ETLT

Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:

November 19, 2012

Incremental MapReduce

My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:

The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.

These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.

A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.

But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light: Read more

Categories: Cloudant, Couchbase, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MongoDB, RDF and graphs

1 Comment

October 18, 2012

Notes on Hadoop adoption and trends

With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”

If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more

Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics

5 Comments

October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

Informatica
IBM/Ascential
Ab Initio

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required. Read more

Categories: EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization

7 Comments

September 24, 2012

Notes on Hadoop adoption

I successfully resisted telephone consulting while on vacation, but I did do some by email. One was on the oft-recurring subject of Hadoop adoption. I think it’s OK to adapt some of that into a post.

Notes on past and current Hadoop adoption include:

Enterprise Hadoop adoption is for experimental uses or departmental production (as opposed to serious enterprise-level production). Indeed, it’s rather tough to disambiguate those two. If an enterprise uses Hadoop to search for new insights and gets a few, is that an experiment that went well, or is it production?
One of the core internet-business use cases for Hadoop is a many-step ETL, ELT, and data refinement pipeline, with Hadoop executing some or many of the steps. But I don’t think that’s in production at many enterprises yet, except in the usual forward-leaning sectors of financial services and (we’re all guessing) national intelligence.
In terms of industry adoption:
- Financial services on the investment/trading side are all over Hadoop, just as they’re all over any technology. Ditto national intelligence, one thinks.
- Consumer financial services, especially credit card, are giving Hadoop a try too, for marketing and/or anti-fraud.
- I’m sure there’s some telecom usage, but I’m hearing of less than I thought I would. Perhaps this is because telcos have spent so long optimizing their data into short, structured records.
- Whatever consumer financial services firms do, retailers do too, albeit with smaller budgets.

Thoughts on how Hadoop adoption will look going forward include: Read more

Categories: Cloud computing, Data warehouse appliances, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Telecommunications

3 Comments

September 7, 2012

Integrated internet system design

What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.

Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.

The top integration and integration-like challenges for me, from a practical standpoint, are:

Integrating silos — a decades-old problem still with us in a big way.
Dynamic schemas with joins.
Low-latency business intelligence.
Human real-time personalization.

Other concerns that get mentioned include:

Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
Logical data warehouse, a term that doesn’t actually mean anything real.
In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.

Let’s skip those latter issues for now, focusing instead on the first four.

Categories: About this blog, Business intelligence, Cache, Clustering, Data integration and middleware, Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Exadata, NoSQL, OLTP, Oracle, Predictive modeling and advanced analytics, SAP AG, Surveillance and privacy

6 Comments

August 24, 2012

Hadoop notes: Informatica, Splunk, and IBM

Informatica, Splunk, and IBM are all public companies, and correspondingly reticent to talk about product futures. Hence, anything I might suggest about product futures from any of them won’t be terribly detailed, and even the vague generalities are “the Good Lord willin’ an’ the creek don’ rise”.

Never let a rising creek overflow your safe harbor.

Anyhow:

1. Hadoop can be an awesome ETL (Extract/Transform/Load) execution engine; it can handle huge jobs and perform a great variety of transformations. (Indeed, MapReduce was invented to run giant ETL jobs.) Thus, if one offers a development-plus-execution stack for ETL processes, it might seem appealing to make Hadoop an ETL execution option. And so:

I’ve already posted that BI-plus-light-ETL vendors Pentaho and Datameer are using Hadoop in that way.
Informatica will be using Hadoop as an execution option too.

Informatica told me about other interesting Hadoop-related plans as well, but I’m not sure my frieNDA allows me to mention them at all.

IBM, however, is standing aside. Specifically, IBM told me that it doesn’t see the point of doing the same thing, as its ETL engine — presumably derived from the old Ascential product line — is already parallel and performant enough.

2. Last year, I suggested that Splunk and Hadoop are competitors in managing machine-generated data. That’s still true, but Splunk is also preparing a Hadoop co-opetition strategy. To a first approximation, it’s just Hadoop import/export. However, suppose you view Splunk as offering a three-layer stack: Read more

Categories: EAI, EII, ETL, ELT, ETLT, Hadoop, IBM and DB2, Informatica, Log analysis, MapReduce, Splunk

9 Comments

August 8, 2012

What kinds of metadata are important anyway?

In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.

Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.

In the relational world, there are at least three kinds of metadata:

Definitional information about data structures, without which you can’t have a relational database at all. That area seems binary; either you have enough to make sense of your data or you don’t.
Statistics about columns and tables, such as the most frequent values and how often they occur, which are kept for the purpose of optimization. Those seem to be nice-to-haves more than must-haves. The more information of this kind you have, the more chances you have to save resources.
Historical and security information about data. This is where things get really complicated. It’s also where Hadoop is still in the “So what exactly should we build?” stage of design.

As I see it:

Historical information about data answers questions in the realm of “Who did what to which data when?”
Security information about data answers questions around “Who may do what to which data in the future?”
They overlap because:
- They rely on closely related schemes for assessing roles and identity.
- Audit trails, a key aspect of security and compliance, could logically be viewed as falling in the realm of “history”.

Categories: EAI, EII, ETL, ELT, ETLT, Hadoop

9 Comments

July 28, 2012

Some Vertica 6 features

Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:

External tables and the associated new Hadoop connector.
Online schema evolution.
Workload management.

Also:

I have some tidbits to add to my June, 2011 coverage of Vertica’s analytic functionality.
I’ll stand for now on my previous coverage of Vertica’s database organization.

In general, the main themes of Vertica 6 appear to be:

Enterprise/SaaS-friendliness, high uptime, and so on.
Improved analytic usefulness.

Let’s do the analytic functionality first. Notes on that include:

Vertica has extended its user-defined function/analytic procedure/whatever functionality to include user-defined load. (Same SDK, different specific classes.)
One of the languages Vertica supports is R. But for now, parallel R is limited to “Of course, you can run the same functions and procedures on many nodes at once.”
Based on community activity around bugs and so on, it seems there are users for Vertica’s JSON-based Twitter sentiment analysis plug-in.

I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included: Read more

Categories: Columnar database management, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Predictive modeling and advanced analytics, SQL/Hadoop integration, Vertica Systems, Workload management

2 Comments

July 24, 2012

Notes on Datameer

In a short October, 2011 post about Datameer, I wrote:

Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

That’s all still mainly true, although with the recent Datameer 2.0:

You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
No doubt Datameer has been generally enhanced on multiple fronts.

In essence, Datameer has two positionings.

One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.

Categories: Business intelligence, Data models and architecture, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, Log analysis, Market share and customer counts, Web analytics

8 Comments

July 8, 2012

Database diversity revisited

From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.

The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:

I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
Even so, one product or technology may be well-suited to address a couple different kinds of challenges.

The Big Seven database challenges that almost any enterprise faces are: Read more

Categories: Data integration and middleware, Data models and architecture, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Memory-centric data management, NoSQL, Object, OLTP, RDF and graphs, Structured documents, Talend, Text

3 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in