November 3, 2011

MarkLogic’s Hadoop connector

It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.

Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:

Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.

Otherwise, the whole thing is just what you would think:

Hadoop can read from and write to MarkLogic, in parallel at both ends.
If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”

MarkLogic said that it wrote this Hadoop connector itself.

When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:

MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.
However, MarkLogic doesn’t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.

Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is “Why not just an analytic RDBMS?” But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which I favor MarkLogic over MongoDB or straight Hadoop alike.

Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management

Subscribe to our complete feed!

Comments

2 Responses to “MarkLogic’s Hadoop connector”

MarkLogic 5, and why you might care | DBMS 2 : DataBase Management System Services on November 3rd, 2011 10:41 pm

[…] I posted separately about the MarkLogic Hadoop connector. As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now […]
Mike Zuckerman on November 12th, 2011 10:56 am

These connector’s are usually used for data movement (migration) or data synchronization. The Hadoop connector will be very useful although there are much better ways to do this through data integration tool sets available from 3rd parties.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

MarkLogic’s Hadoop connector

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin