April 6, 2011

So can logistic regression be parallelized or not?

A core point in SAS’ pitch for its new MPI (Message-Passing Interface) in-memory technology seems to be logistic regression is really important, and shared-nothing MPP doesn’t let you parallelize it. The Mahout/Hadoop folks also seem to despair of parallelizing logistic regression.

On the other hand, Aster Data said it had parallelized logistic regression a year ago. (Slides 6-7 from a mid-2010 Aster deck may be clearer.) I’m guessing Fuzzy Logix might make a similar claim, although I’m not really sure.

What gives?

Categories: Aster Data, Hadoop, Parallelization, Predictive modeling and advanced analytics, SAS Institute

Subscribe to our complete feed!

Comments

22 Responses to “So can logistic regression be parallelized or not?”

John M. Wildenthal on April 6th, 2011 11:46 am

The standard estimation technique is to transform the estimators into a linear estimate of the log of the odds:

https://secure.wikimedia.org/wikipedia/en/wiki/Logistic_regression#Formal_mathematical_specification

If ML Estimates can be determined by Iteratively Reweighted Least Squares, I don’t see why each WLS iteration couldn’t be parallelized.

Mahout/Hadoop’s choice of SGD instead of IRLS appears to be the problem there. SGD does allow for incremental updating, which could be important for some uses.
Mike Beckerle on April 7th, 2011 6:34 am

Many “non-parallelizable” algorithms are parallelized by finding an approximation that for practical purposes is just as good. Mathematically they are not equivalent, and there may be corner cases not handled as well, nevertheless, for many practical examples… nobody cares.

By analogy, I’ve always been a fan of piece-wise linear techniques, vs. more careful curve fits. By many technical measures they are coarse, inelegant, etc. Practically…. they rock.
Jon Bock on April 7th, 2011 2:51 pm

Curt,

Among the different algorithms and approximations for logistic regression, some definitely can be parallelized in a shared-nothing MPP system even while others are not suitable for parallelization there. In the Mahout case the algorithm used is stochastic gradient descent, which as they mention is inherently not suited to parallelization in a shared-nothing MPP system. However, other algorithms such as batch gradient descent and Newton’s method (also referred to as Newton-Raphson) are parallelizable.

As you mention, Aster Data’s logistic regression function is parallelized—we initially released a version last year based on the batch gradient descent method and have been busy expanding the algorithms available since then. Of note is that our logistic regression implementation is designed for not only cases that fit in memory but can also process cases that are larger than available memory.

–Jon
Randolph on April 7th, 2011 10:02 pm

Curt,
My first response was to ask “is there nothing that MPI can’t do?’ but a quick search showed up this paper where the authors are using MapReduce to solve this problem as well:
( http://www.siam.org/proceedings/datamining/2009/dm09_107_singhs.pdf )

MPI has the advantage that shared memory communicators can be used as well as network communicators, so problems of granularity related efficiency can be better addressed.

The more interesting subject that you touched on is what we call “orthogonal parallelism”; this is where instead of just splitting records up over various CPU’s in the shared nothing cluster, large objects and computations that may relate to a single field can be orthogonally parallelised over the cluster as well.

For example: a large database contains many EHR’s that include a fields containing large images, RNA or DNA sequences. Although a conventional MPP system would split the records over the cluster, each large DNA object (for example) would still exist entirely on only 1 node.
A better solution would be to split the large objects up by storing these fields in parallel.

Additionally orthogonally parallel computations such as the parallel hammer algorithm can be used to analyse these large DNA fields in parallel.

Version 1.1 of DeepCloud MPP will permit orthogonally parallel UDF’s for this purpose.
Ajay Ohri on April 8th, 2011 1:48 am

A big part to logistic modeling can be done in parallel. In fact they can be done using separate SAS sessions on a multi core machine as well.
In fact I used to run parallel logistic regressions in SAS System circa 2004(by changing one or two variables and redoing proc logistic). Essentially I was running two or three logistic regressions with one or two variables changed in order to finalize the model, and estimating VIF, etc etc
Fitting the parameters, estimating deviation from actual, and scoring model can all be parallelized in my opinion.
Marco Ullasci on April 8th, 2011 3:14 am

Mathematics on a computer, when dealing with all but the most trivial computations, is approximation to begin with.
As long as the results remain inside the safety range required by the specific use I really appreciate a parallel implementation that makes me able to scale out.
Application areas for SAS HPA | DBMS 2 : DataBase Management System Services on April 22nd, 2011 3:54 am

[…] Meanwhile, in another interview I heard about, SAS emphasized retailers. Indeed, that’s what spawned my recent post about logistic regression. […]
http://toramspsak.bplaced.net on November 28th, 2013 12:31 pm

tak duża liczba Wart podkreślenia maszyny. goryczy oraz
taniej

złośliwości.

– Przepraszam. bucowa (http://toramspsak.bplaced.net) Oczywiście pamiętam oraz – jakże mówiłem pierwej – dziękuję zbyt

jacyś. Choćby nawet jesteś po tej stronie służbowo.
Wiesz moje miano, Przeglądałaś rejestr?
Te

z Mińska? Tudzież być może późniejsze,
z Litwy?

Usiadł, aż jęknęły sprężyny. – Owo oraz Kiroiczew gryzie wie…
Popatrz, skurwysyn pustka

nie wygadał. Otóżdama powstała. Uzależniła śpiwó.
Alice on March 2nd, 2023 7:11 pm

Hi there, this weekend is nice for me, for the reason that this point in time i am reading
this impressive educational article here at my home.
카지노친구 on July 11th, 2023 7:53 am

카지노친구

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
온라인바카라 순위 on July 11th, 2023 8:10 pm

온라인바카라 순위

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
메이저 토토사이트 on July 13th, 2023 6:41 am

메이저 토토사이트

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
해외토토사이트 on September 1st, 2023 3:18 am

해외토토사이트

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
토토사이트 검증 on September 7th, 2023 11:20 am

토토사이트 검증

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
www.homemind.kr on September 11th, 2023 8:38 am

http://www.homemind.kr

So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
Therese Hass on October 23rd, 2023 5:58 pm

Thanks a lot! A good amount of write ups!
here on October 29th, 2023 11:53 pm

To engage in them, you’ll want to have reached the age of 18 years.
Jeannette on October 30th, 2023 5:35 am

Mankind can easily pretty much all make use of discover
even more about ourselves and our overall health and fitness.
A number of actions and exertions levels can currently have wonderful benefit to
us, and we all ought to learn even more info about them.
Your blog page seems to have given a invaluable
outlook that will be useful to various populations and individuals, and I actually love your sharing your ideas in this way.
Willian Schuster on November 2nd, 2023 3:18 am

On Daily Bullet Journal, we uncover all aspects pertaining to the world of bullet journals. Whether you’re new to the craft or a seasoned bullet journal enthusiast, we have a little something to suit your needs. Dive into tips to methods, and even fresh templates, BJD Daily serves as the premier resource for everything related to bullet journaling. Keep abreast with the most recent entries, as we frequently revamp our repository with the latest bullet journal developments. Be a member of our bullet journal family and discover the magic of bullet journaling from a unique angle.
http://weatheringthestormbp.com/
Aurelia Dillard on November 4th, 2023 12:28 pm

Whoa tons of fantastic tips.
Amy Newby on December 15th, 2023 3:40 am

Wow all kinds of fantastic info!
personalised workwear on October 9th, 2024 4:18 pm

Nicely voiced without a doubt. !

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

So can logistic regression be parallelized or not?

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin