Kinds of data integration and movement
“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data —
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other.
Local replication, for high availability and/or read scalability. If you have an important database application — SQL or non-SQL as the case may be — and you want to ensure its availability, you may choose to have a second machine capable of running it, on standby, possessed of a fully and reliability current copy of the data. These needs are usually met by local, synchronous replication, where “synchronous” means that an update is not committed on the first machine until there’s assurance it will also go through on the second one. (Fortunately, there are faster forms of synchronous replication than two-phase commit.)
In some cases, you also need the application to run against several copies of the data, for performance. That can be achieved with similar technology.
Remote replication, most commonly (but not only) for disaster recovery. If you really, really want to ensure application availability, you may also choose to have a live system ready to go at a different physical location. In that case you still want best-realistic-speed for data replication, but it needs to be asynchronous — your updates on the main system can’t afford to wait for acknowledgements to be sent back and forth across a wide-area network.
Synchronously replicated or not, you might want to send work to your high-availability or disaster recovery database copies, for performance, to the extent that you have them anyway. In particularly, asynchronous replication is fast enough for almost any analytic use case.
Low-latency replication, to populate an analytic database. It’s increasingly desirable to stream/trickle operational updates straight into an analytic database. But the two database management systems involved are likely to be different. That places additional demands on the replication technology beyond what is needed for replicating like-to-like.
Rebalancing, in a shared-nothing and/or adaptable DBMS. Databases are not always kept in single, homogenous systems. That can create the need to move or copy data from one place to another. An obvious example is when you add, delete, or repair nodes in a shared-nothing system. Another is when data is moved among different tiers of storage (e.g. solid-state vs. hard-disk).
Cross-database information lifecycle management (ILM). Sometimes you rebalance among different databases, managed by different hardware or software. Even if we assume that off-line storage isn’t involved — “disk is the new tape” — general ILM is lot more complicated than the single-DBMS kind.
ETL (Extract/Transform/Load). In the simplest cases, ETL takes data from one database and puts in another, often on a batch basis, sometimes in a trickle/stream. But unlike what we call “replication”, ETL also allows significant changes to data along the way. Even ETL distinguished primarily by performance puts data through complex processing pipelines.
Conceptually, and irrespective of what really is or isn’t going on, it’s probably easier to think of ETL as something you copy data into and then back out of than simply a set of pipes.
ELT (Extract/Load/Transform). On the other hand, sometimes ETL’s main function is indeed piping, with the transformation happening after the data gets to its new location. ELT can be appealing when (for example) the destination is a cheap, efficient analytic DBMS. Also, a particular rich form of ELT is possible using Hadoop (or other) MapReduce.
Often, it’s reasonable to say that the “E” and “L” parts of ELT are done via ETL technology, while the “T” is done via something else.
Data mart spin-out. When he was at eBay, Oliver Ratzesberger made a big deal of the ability to spin out a data mart very quickly. There are two kinds of ways to do this:
- Virtual data mart spin-out, with no physical data movement at all. (This is Oliver’s preferred way.)
- Physical data mart spin-out, based on copying the data. Greenplum, inspired by then-customer eBay, was probably the first to make a big fuss about that.
One way and/or the other, fast data-mart spin-out has become an important — albeit still forward-leaning — feature for analytic DBMS.
Business intelligence tools, querying multiple databases. Often, enterprises have looked to BI to achieve what they see as data integration. It’s pretty straightforward for a BI tool to query multiple relational databases. This is not exactly the same thing as doing ETL to support a BI project, or even as selling BI and ETL more or less bundled together.
Indexing, search, and/or query of external databases, from within a particular data store. Sometimes it’s the data store itself that reaches out to other databases. Text search engines are particularly likely to index information stored in other places — and by the way, in the case of text, the index usually holds a complete copy of the information being indexed. But relational DBMS have occasionally-used “external table” functionality as well.
Different storage engines in the same DBMS. Frequently, the makers of a single database management system find it advantageous to have two or more rather different storage engines under the covers. The base case is that some data gets put in one engine, some in another, and that’s that. But in a few cases, data might move from one engine to another. An example of the latter strategy is Vertica, with features both a write-optimized store (in memory) and read-optimized store (what we really think of as Vertica).
Bidirectional Hadoop (HDFS)/DBMS connectors. A variant on these two approaches are the bidirectional Hadoop connectors that various DBMS vendors have announced. Details vary, but functionality can include the ability to do DBMS queries that incorporate Hadoop data and/or Hadoop jobs that directly access the DBMS.
Service-Oriented Architecture (SOA), Enterprise Application Integration (EAI), Enterprise Service Bus (ESB), Composite applications … There are a whole lot of concepts that boil down to “letting loosely-coupled applications or databases — usually operational ones — exchange data as needed.” I probably should know more about these areas.
Of course, even that taxonomy understates the complexity of the subject. Most notably, various capabilities are combined in single vendors, products, or projects. For example:
- If you’re going NewSQL or NoSQL, you probably expect answers for local replication, remote replication, and rebalancing. You may want to stream to an analytic DBMS or Hadoop cluster as well.
- ELT and ETL are commonly combined into an ETLT strategy. In particular, ETL vendors are trying to subsume the data transformation capabilities of Hadoop.
I’m sure there’s much here to disagree with, and even more to criticize in the way of omissions. Fire away! Just please recall — no market categorization is ever precise.
Comments
10 Responses to “Kinds of data integration and movement”
Leave a Reply
One important category of data movement is (application) data migration; either big bang or gradual migration of data from “as is” to “to be” systems. Migrations may happen over a weekend, or take many months or even years during which old and new systems have to co-exist happily. Depending on the size, complexity and APIs available for the applications, the technology used may range from simple ETL through to complex EAI. Data quality tools can (and should) also play a large part in the process.
Regards
Good points, Nigel.
Hi Curt,
Some Business Intelligence tools require a materialized approach with up-front ELT/ETL into cubes or in-memory acceleration structures, which allows for row-level joins of entire data sets from disparate sources. Other tools provide a mediated schema and federate ad-hoc queries against that schema to the constituent data sources. This on-demand approach can be used to fetch disaggregate data which the tool joins at the row level prior to aggregating, or the tool may issue aggregate queries and join only the post-aggregated data along a common axis.
Each of these approaches has different characteristics for quantity of data transfer, startup latency, query processing latency, query expressiveness and supported data transformations such as cleaning and reshaping data. I think this topic merits more discussion, including possible comparisons with federated database products dedicated to this task.
-Robert
Great birds-eye view of the problem space.
One particularly vexing area I’ve encountered repeatedly is doing joins between data from different databases/sources. In the trading world we often had a market data database to be joined with reference data about the products, but with different types of symbols on both databases. For example the stock exchange would call IBM “ibm” and bloomberg would call it “IBM US”. And those symbols themselves would change from time to time.
A very informative summary on data integration on the industrial side!
On the academic side, I would just add that there are very rich theories and formalisms for data integration that pertains more to federation than data movement/replication. Things like schema mappings, mediated schemas and query models can enable users who want to avoid standardization, to query across federated databases and/or exchange data across sites (Perhaps this relates more to your last bolded point).
Hi Marie,
Heterogeneous federation has been worked on at leat since a CCA (Computer Corporation of America) project in the early 1980s. But I’m not aware of it ever amounting to much practical.
Or, if you’re talking relational-only, there was Mike Stonebraker’s first company post-Informix, Cohera. That didn’t work out too well in the main plan, although they had what seemed like a decent exit after a pivot.
Curt –
You have provided a good service to your readers with this article. Let me attempt to add value regarding one term, not explicitly listed above, data virtualization.
In the data integration vendor community, the term data federation (which focuses on high performance query across systems of similar or diverse data types) continues.
However, the more inclusive term we all have migrated to is data virtualization. Beyond federation, this term encompasses the abstraction and loose-coupling notions often associated with SOA, and the hiding of underlying complexity often associated with other virtualizations.
I wrote a book about this topic (www.datavirtualizationblook.com). I’ll send you a copy.
One refinement…www.datavirtualizationbook.com
[…] Monash has been attempting to deconstruct these fields and has some very good posts on the topic [Data Integration, Presentation, Observation, Analytics, […]
[…] and fragmented. The “important” part is self-evident; I gave examples of “fragmented” a couple years back. Beyond that, I’d […]