April 6, 2011
So can logistic regression be parallelized or not?
A core point in SAS’ pitch for its new MPI (Message-Passing Interface) in-memory technology seems to be logistic regression is really important, and shared-nothing MPP doesn’t let you parallelize it. The Mahout/Hadoop folks also seem to despair of parallelizing logistic regression.
On the other hand, Aster Data said it had parallelized logistic regression a year ago. (Slides 6-7 from a mid-2010 Aster deck may be clearer.) I’m guessing Fuzzy Logix might make a similar claim, although I’m not really sure.
What gives?
Categories: Aster Data, Hadoop, Parallelization, Predictive modeling and advanced analytics, SAS Institute
Subscribe to our complete feed!
Comments
22 Responses to “So can logistic regression be parallelized or not?”
Leave a Reply
The standard estimation technique is to transform the estimators into a linear estimate of the log of the odds:
https://secure.wikimedia.org/wikipedia/en/wiki/Logistic_regression#Formal_mathematical_specification
If ML Estimates can be determined by Iteratively Reweighted Least Squares, I don’t see why each WLS iteration couldn’t be parallelized.
Mahout/Hadoop’s choice of SGD instead of IRLS appears to be the problem there. SGD does allow for incremental updating, which could be important for some uses.
Many “non-parallelizable” algorithms are parallelized by finding an approximation that for practical purposes is just as good. Mathematically they are not equivalent, and there may be corner cases not handled as well, nevertheless, for many practical examples… nobody cares.
By analogy, I’ve always been a fan of piece-wise linear techniques, vs. more careful curve fits. By many technical measures they are coarse, inelegant, etc. Practically…. they rock.
Curt,
Among the different algorithms and approximations for logistic regression, some definitely can be parallelized in a shared-nothing MPP system even while others are not suitable for parallelization there. In the Mahout case the algorithm used is stochastic gradient descent, which as they mention is inherently not suited to parallelization in a shared-nothing MPP system. However, other algorithms such as batch gradient descent and Newton’s method (also referred to as Newton-Raphson) are parallelizable.
As you mention, Aster Data’s logistic regression function is parallelized—we initially released a version last year based on the batch gradient descent method and have been busy expanding the algorithms available since then. Of note is that our logistic regression implementation is designed for not only cases that fit in memory but can also process cases that are larger than available memory.
–Jon
Curt,
My first response was to ask “is there nothing that MPI can’t do?’ but a quick search showed up this paper where the authors are using MapReduce to solve this problem as well:
( http://www.siam.org/proceedings/datamining/2009/dm09_107_singhs.pdf )
MPI has the advantage that shared memory communicators can be used as well as network communicators, so problems of granularity related efficiency can be better addressed.
The more interesting subject that you touched on is what we call “orthogonal parallelism”; this is where instead of just splitting records up over various CPU’s in the shared nothing cluster, large objects and computations that may relate to a single field can be orthogonally parallelised over the cluster as well.
For example: a large database contains many EHR’s that include a fields containing large images, RNA or DNA sequences. Although a conventional MPP system would split the records over the cluster, each large DNA object (for example) would still exist entirely on only 1 node.
A better solution would be to split the large objects up by storing these fields in parallel.
Additionally orthogonally parallel computations such as the parallel hammer algorithm can be used to analyse these large DNA fields in parallel.
Version 1.1 of DeepCloud MPP will permit orthogonally parallel UDF’s for this purpose.
A big part to logistic modeling can be done in parallel. In fact they can be done using separate SAS sessions on a multi core machine as well.
In fact I used to run parallel logistic regressions in SAS System circa 2004(by changing one or two variables and redoing proc logistic). Essentially I was running two or three logistic regressions with one or two variables changed in order to finalize the model, and estimating VIF, etc etc
Fitting the parameters, estimating deviation from actual, and scoring model can all be parallelized in my opinion.
Mathematics on a computer, when dealing with all but the most trivial computations, is approximation to begin with.
As long as the results remain inside the safety range required by the specific use I really appreciate a parallel implementation that makes me able to scale out.
[…] Meanwhile, in another interview I heard about, SAS emphasized retailers. Indeed, that’s what spawned my recent post about logistic regression. […]
tak duża liczba Wart podkreślenia maszyny. goryczy oraz
taniej
złośliwości.
– Przepraszam. bucowa (http://toramspsak.bplaced.net) Oczywiście pamiętam oraz – jakże mówiłem pierwej – dziękuję zbyt
jacyś. Choćby nawet jesteś po tej stronie służbowo.
Wiesz moje miano, Przeglądałaś rejestr?
Te
z Mińska? Tudzież być może późniejsze,
z Litwy?
Usiadł, aż jęknęły sprężyny. – Owo oraz Kiroiczew gryzie wie…
Popatrz, skurwysyn pustka
nie wygadał. Otóżdama powstała. Uzależniła śpiwó.
Hi there, this weekend is nice for me, for the reason that this point in time i am reading
this impressive educational article here at my home.
카지노친구
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
온라인바카라 순위
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
메이저 토토사이트
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
해외토토사이트
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
토토사이트 검증
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
http://www.homemind.kr
So can logistic regression be parallelized or not? | DBMS 2 : DataBase Management System Services
Thanks a lot! A good amount of write ups!
To engage in them, you’ll want to have reached the age of 18 years.
Mankind can easily pretty much all make use of discover
even more about ourselves and our overall health and fitness.
A number of actions and exertions levels can currently have wonderful benefit to
us, and we all ought to learn even more info about them.
Your blog page seems to have given a invaluable
outlook that will be useful to various populations and individuals, and I actually love your sharing your ideas in this way.
On Daily Bullet Journal, we uncover all aspects pertaining to the world of bullet journals. Whether you’re new to the craft or a seasoned bullet journal enthusiast, we have a little something to suit your needs. Dive into tips to methods, and even fresh templates, BJD Daily serves as the premier resource for everything related to bullet journaling. Keep abreast with the most recent entries, as we frequently revamp our repository with the latest bullet journal developments. Be a member of our bullet journal family and discover the magic of bullet journaling from a unique angle.
http://weatheringthestormbp.com/
Whoa tons of fantastic tips.
Wow all kinds of fantastic info!
Nicely voiced without a doubt. !