Spark vs. Tez, revisited
I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:
Tez vs Spark = Apples vs Oranges.
Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.
Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it
[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN
That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.
Related link
- The Twitter discussion with Shaun was a spin-out from my research around streaming for Hadoop.
Comments
6 Responses to “Spark vs. Tez, revisited”
Leave a Reply
I think a fairer comparison between Spark and Tez is probably Apples vs. Apple cores. Spark offers a superset of Tez’s functionality. Tez is an engine for distributing data-parallel computation over lots of computers. Spark is also this, but includes an elegant API on top, as well as a distributed memory abstraction that allows caching data across the cluster. They both relate to YARN in exactly the same way: they use it to deploy their bits and schedule work on the cluster.
More details at http://qr.ae/JcBMm
I agree with Sandy, that both Spark and Tex implements the same abstraction. I see two different cases of usage with different factors of success.
First one – interactive usage by knowledge worker. Here I think Spark will win, since its usability is fantastic. In memory abstraction is also nice here – since data volumes are frequently modest.
Second one – as a execution engine for the SQL processing. Usability is not a factor here. The factor is robustness and performance of shuffle… I can not tell now who is better from this perspective…
Regarding distributed memory abstraction, and capability to cache data in memory between steps – i believe its applicability for big data is limited because we can not give big heaps to JVM.
You have a typo in line#4. It should be suggested and not “suggesed”
Fixed — thanks!!
CAM
[…] Spark vs. Tez (October, 2014) […]