MarkLogic’s Hadoop connector
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:
- MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.
- However, MarkLogic doesn’t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.
Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is “Why not just an analytic RDBMS?” But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which I favor MarkLogic over MongoDB or straight Hadoop alike.
Comments
2 Responses to “MarkLogic’s Hadoop connector”
Leave a Reply
[…] I posted separately about the MarkLogic Hadoop connector. As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now […]
These connector’s are usually used for data movement (migration) or data synchronization. The Hadoop connector will be very useful although there are much better ways to do this through data integration tool sets available from 3rd parties.