EAI, EII, ETL, ELT, ETLT
Analysis of data integration products and technologies, especially ones related to data warehousing, such as ELT (Extract/Transform/Load). Related subjects include:
Teradata SQL-H, using HCatalog
When I grumbled about the conference-related rush of Hadoop announcements, one example of many was Teradata Aster’s SQL-H. Still, it’s an interesting idea, and a good hook for my first shot at writing about HCatalog. Indeed, other than the Talend integration bundled into Hortonworks’ HDP 1, Teradata SQL-H is the first real use of HCatalog I’m aware of.
The Teradata SQL-H idea is:
- Register your Hadoop data to HCatalog. I’ll confess to being unclear about the details of how that works, for example in the case of data that just doesn’t fit well into flat relational tables. Stay tuned for future posts. For now, I’ll just note that:
- HCatalog is closely based on Hive’s metadata management. If you’ve run Hive against the data, HCatalog should already know about it.
- HCatalog can handle Pig and HBase data as well.
- Write SQL DDL (Data Description Language) so that your Aster cluster knows about the data.
- Write any Teradata Aster SQL/MR against that data. Some of the execution will be done on the Hadoop cluster, but pulling data back into Aster may well be necessary.
At least in theory, Teradata SQL-H lets you use a full set of analytic tools against your Hadoop data, with little limitation except price and/or performance. Teradata thinks the performance of all this can be much better than if you just use Hadoop (35X was mentioned in one particularly favorable example), but perhaps much worse than if you just copy/extract the data to an Aster cluster in the first place.
So what might the use cases be for something like SQL-H? Offhand, I’d say:
- SQL-H use cases are probably focused in areas where copying the data to Aster in advance doesn’t make a lot of sense. So presumably …
- … the Hadoop clusters involved would hold a lot more data than you’d want to pay for storing in Teradata Aster. E.g., think of cases where Hadoop is used as a big bit bucket or archival data store.
- There could be a kind of investigative workflow. First you play around with the Hadoop data via SQL-H. Then when you think you’re onto something, you set up ETL (Extract/Transform/Load) to get the data into Aster and ratchet up the effort.
By way of contrast, the whole thing makes less sense for dashboarding kinds of uses, unless the dashboard users are very patient when they want to drill down.
Why I’m so forward-leaning about Hadoop features
In my recent series of Hadoop posts, there were several cases where I had to choose between recommending that enterprises:
- Go with the most advanced features any vendor was credibly advocating.
- Be more cautious, and only adopt features that have been solidly proven in the field.
I favored the more advanced features each time. Here’s why.
To a first approximation, I divide Hadoop use cases into two major buckets, only one of which I was addressing with my comments:
1. Analytic data management.* Here I favored features over reliability because they are more important, for Hadoop as for analytic RDBMS before it. When somebody complains about an analytic data store not being ready for prime time, never really working, or causing them to tear their hair out, what they usually mean is that:
- It couldn’t do the work that needed doing …
- … with reasonable performance and turnaround time …
- … without undue effort in administration and/or programming.
Those complaints are much, much, more frequent than “It crashed”. So it was for Netezza, DATAllegro, Greenplum, Aster Data, Vertica, Infobright, et al. So it also is for Hadoop. And how does one address those complaints? By performance and feature enhancements, of the kind that the Hadoop community is introducing at high speed. Read more
Categories: Buying processes, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Hortonworks, Open source | 1 Comment |
Metamarkets’ back-end technology
This is part of a three-post series:
- Introduction to Metamarkets and Druid
- Druid overview
- Metamarkets’ back-end technology (this post)
The canonical Metamarkets batch ingest pipeline is a bit complicated.
- Data lands on Amazon S3 (uploaded or because it was there all along).
- Metamarkets processes it, primarily via Hadoop and Pig, to summarize and denormalize it, and then puts it back into S3.
- Metamarkets then pulls the data into Hadoop a second time, to get it ready to be put into Druid.
- Druid is notified, and pulls the data from Hadoop at its convenience.
By “get data read to be put into Druid” I mean:
- Build the data segments (recall that Druid manages data in rather large segments).
- Note metadata about the segments.
That metadata is what goes into the MySQL database, which also retains data about shards that have been invalidated. (That part is needed because of the MVCC.)
By “build the data segments” I mean:
- Make the sharding decisions.
- Arrange data columnarly within shard.
- Build a compressed bitmap for each shard.
When things are being done that way, Druid may be regarded as comprising three kinds of servers: Read more
Workday update
In August 2010, I wrote about Workday’s interesting technical architecture, highlights of which included:
- Lots of small Java objects in memory.
- A very simple MySQL backing store (append-only, <10 tables).
- Some modernistic approaches to application navigation.
- A faceted approach to BI.
I caught up with Workday recently, and things have naturally evolved. Most of what we talked about (by my choice) dealt with data management, business intelligence, and the overlap between the two.
It is now reasonable to say that Workday’s servers fall into at least seven tiers, although we talked mainly about five that work together as a kind of giant app/database server amalgamation. The three that do noteworthy data management can be described as:
- In-memory objects and transactions. This is similar to what Workday had before.
- Persistent MySQL. Part of this is similar to what Workday had before. In addition, Workday is now storing certain data in tables in the ordinary relational way.
- In-memory caching and indexing. This has three aspects:
- Indexes for the ordinary relational tables, organized in interesting ways.
- Indexes for Workday’s search-box navigation (as per my original Workday technical post, you can search across objects, task-names, etc.).
- Compressed copies of the Java objects, used to instantiate other servers as needed. The most obvious uses of this are:
- Recovery for the object/transaction tier.
- Launch for the elastic compute tier. (Described below.)
Two other Workday server tiers may be described as: Read more
QlikTech bought Expressor
QlikTech has bought Expressor. Notes on that include:
- Expressor wanted to offer data integration/ETL (Extract/Transform/Load) that was all things to all people — great parallel performance, great UI, great price, etc.
- In practice, Expressor seemed to focus on cheap/easy ETL in the Microsoft Windows (I mean server) market.
- Expressor never got much traction. This seems confirmed by the “more than 20” figure for headcount mentioned in the acquisition press release.
- Both the press release and some tweets by QlikTech’s Donald Farmer seem to confirm that Expressor is being taken off the market for “boil the ocean” ETL. It will be companion technology to/integrated technology in QlikView.
- Unsurprisingly, Donald indicated that Expressor technology would expand past its Microsoft focus. (Edit: “If needed”)
Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Expressor, Pricing, QlikTech and QlikView | 5 Comments |
Human real-time
I first became an analyst in 1981. And so I was around for the early days of the movement from batch to interactive computing, as exemplified by:
- The rise of minicomputers as mainframe alternatives (first VAXen, then the ‘nix systems that did largely supplant mainframes).
- The move from batch to interactive computing even on mainframes, a key theme of 1980s application software industry competition.
Of course, wherever there is interactive computing, there is a desire for interaction so fast that users don’t notice any wait time. Dan Fylstra, when he was pitching me the early windowing system VisiOn, characterized this as response so fast that the user didn’t tap his fingers waiting.* And so, with the move to any kind of interactive computing at all came a desire that the interaction be quick-response/low-latency. Read more
DataStax Enterprise 2.0
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
Juggling analytic databases
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
Kinds of data integration and movement
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data —
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other. Read more
Categories: Clustering, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Hadoop, MapReduce | 10 Comments |
Departmental analytics — best practices
I believe IT departments should support and encourage departmental analytics efforts, where “support” and “encourage” are not synonyms for “control”, “dominate”, “overwhelm”, or even “tame”. A big part of that is:
Let, and indeed help, departments have the data they want, when they want it, served with blazing performance.
Three things that absolutely should NOT be obstacles to these ends are:
- Corporate DBMS standards.
- Corporate data governance processes.
- The difficulties of ETL.