October 5, 2014

Spark vs. Tez, revisited

I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:

Tez vs Spark = Apples vs Oranges.

Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.

Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it

[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN

That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.

Related link

The Twitter discussion with Shaun was a spin-out from my research around streaming for Hadoop.

Categories: Data warehousing, Databricks, Spark and BDAS, Hadoop, Hortonworks, Predictive modeling and advanced analytics

Subscribe to our complete feed!

Comments

6 Responses to “Spark vs. Tez, revisited”

Sandy Ryza on October 5th, 2014 9:31 pm

I think a fairer comparison between Spark and Tez is probably Apples vs. Apple cores. Spark offers a superset of Tez’s functionality. Tez is an engine for distributing data-parallel computation over lots of computers. Spark is also this, but includes an elegant API on top, as well as a distributed memory abstraction that allows caching data across the cluster. They both relate to YARN in exactly the same way: they use it to deploy their bits and schedule work on the cluster.
Sandy Ryza on October 5th, 2014 9:33 pm

More details at http://qr.ae/JcBMm
David Gruzman on October 8th, 2014 4:02 am

I agree with Sandy, that both Spark and Tex implements the same abstraction. I see two different cases of usage with different factors of success.
First one – interactive usage by knowledge worker. Here I think Spark will win, since its usability is fantastic. In memory abstraction is also nice here – since data volumes are frequently modest.

Second one – as a execution engine for the SQL processing. Usability is not a factor here. The factor is robustness and performance of shuffle… I can not tell now who is better from this perspective…
Regarding distributed memory abstraction, and capability to cache data in memory between steps – i believe its applicability for big data is limited because we can not give big heaps to JVM.
tariq on October 9th, 2014 1:39 pm

You have a typo in line#4. It should be suggested and not “suggesed”
Curt Monash on October 10th, 2014 4:20 am

Fixed — thanks!!

CAM
Notes on the Hortonworks IPO S-1 filing | DBMS 2 : DataBase Management System Services on December 7th, 2014 8:57 am

[…] Spark vs. Tez (October, 2014) […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Spark vs. Tez, revisited

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin