ParAccel PADB technical notes
I posted last October about PADB (ParAccel Analytic DataBase), but held back on various topics since PADB 3.0 was still under NDA. By the time PADB 3.0 was released, I was on blogging hiatus. Let’s do a bit of ParAccel catch-up now.
One big part of PADB 3.0 was an analytics extensibility framework. If we match PADB against my recent analytic computing system checklist,
- ParAccel is proud of PADB’s coverage in analytics-oriented SQL standard capabilities.
- I’m not aware of any PADB SQL goodies that go beyond the ANSI standards.
- PADB has a pretty flexible framework for user-defined functions (UDFs). In particular, ParAccel asserts this framework is even better than MapReduce, because it lets you do more steps at once, although I have trouble convincing myself that that makes sense in an important way.
- Anyhow — like Aster Data, ParAccel asserts that the same framework on which its DBMS is built has now been exposed to people wanting to write other kinds of analytic processes. (But Aster Data describes its framework as being pretty straight MapReduce.)
- All of PADB’s analytic process execution capabilities are subsumed in the UDF framework.
- PADB does not yet contain much in the way of fully parallelized analytic libraries. Exception: Like many of its competitors, ParAccel has a Fuzzy Logix partnership.
- ParAccel hasn’t focused yet on analytic development ease of use. (And that’s putting it mildly.)
- The only language now supported for PADB analytics is C++. ParAccel promises more language support, with (at least) Java and R coming in the summer.
- In line with its extreme focus on speed, ParAccel for now offers only in-process analytics execution.
- In a near-future release (just heading into QA now), ParAccel promises that PADB UDFs will be very flexible in terms of what kinds of memory structures it manages. However, if you want a structure to persist past the end of a query, you need to map it to a row architecture.
- ParAccel’s workload management is still primitive — just a short-query bias, rather than any kind of explicit prioritization. Hence, the question as to whether workload management extends to analytic process execution is fairly moot.
In other news, ParAccel’s Bala Narasimhan wrote:
Historically, an analyst who wants to spin up a new data mart with all of this data will have to wait for a number of days for the data copy to be made available. Instead, if you deploy PADB with a SAN that has fast and efficient snapshot and cloning capabilities, you can spin up multi-TB data marts in seconds.
That turns out to be not quite as ridiculous as it sounds. The scenario is:
- You’re using storage-area network technology with a copy-on-write option.
- You use the SAN’s copy-on-write option to make a second virtual copy of the database in question (or of certain tables/files/blocks from it).
- You point a separate instance of PADB at it, either on a separate cluster (“in seconds” — yeah, right) or else via virtualization (e.g. VMware — that sounds more plausible).
Hmm. I have no actual knowledge of this, but it sounds like a capability that EMC should also offer soon, given the historical Greenplum focus on data mart spin-out.
I posted last October about PADB (ParAccel Analytic DataBase), but held back on various topics since PADB 3.0 was still under NDA. By the time PADB 3.0 was released, I was on blogging hiatus. Let’s do a bit of ParAccel catch-up now.
One big part of PADB 3.0 was an analytics extensibility framework. If we match PADB against my recent analytic computing system checklist,
· ParAccel is proud of PADB’s coverage in analytics-oriented SQL standard capabilities.
· I’m not aware of any PADB SQL goodies that go beyond the ANSI standards.
· PADB has a pretty flexible framework for user-defined functions (UDFs). In particular, ParAccel asserts this framework is even better than MapReduce, because it lets you do more steps at once, although I have trouble convincing myself that that makes sense in an important way.
· Anyhow — like Aster Data, ParAccel asserts that the same framework on which its DBMS is built has now been exposed to people wanting to write other kinds of analytic processes. (But Aster Data describes its framework as being pretty straight MapReduce.)
· All of PADB’s analytic process execution capabilities are subsumed in the UDF framework.
· PADB does not yet contain much in the way of fully parallelized analytic libraries. Exception: Like many of its competitors, ParAccel has a Fuzzy Logix partnership.
· ParAccel hasn’t focused yet on analytic development ease of use. (And that’s putting it mildly.)
· The only language now supported for PADB analytics is C++. ParAccel promises more language support, with (at least) Java and R coming in the summer.
· In line with its extreme focus on speed, ParAccel for now offers only in-process analytics execution.
· In a near-future release (just heading into QA now), ParAccel promises that PADB UDFs will be very flexible in terms of what kinds of memory structures it manages. However, if you want a structure to persist past the end of a query, you need to map it to a row architecture.
· ParAccel’s workload management is still primitive — just a short-query bias, rather than any kind of explicit prioritization. Hence, the question as to whether workload management extends to analytic process execution is fairly moot.
In other news, ParAccel’s Bala Narasimhan wrote:
Historically, an analyst who wants to spin up a new data mart with all of this data will have to wait for a number of days for the data copy to be made available. Instead, if you deploy PADB with a SAN that has fast and efficient snapshot and cloning capabilities, you can spin up multi-TB data marts in seconds.
That turns out not to be quite as ridiculous as it sounds. The scenario is:
· You’re using storage-area network technology with a copy-on-write option.
· You use the SAN’s copy-on-write option to make a second virtual copy of the database in question (or of certain tables/files/blocks from it).
· You point a separate instance of PADB at it, either on a separate cluster (“in seconds” — yeah, right) or else via virtualization (e.g. VMware — that sounds more plausible).
Hmm. I have no actual knowledge of this, but it sounds like a capability that EMC should also offer soon, given the historical Greenplum focus on data mart spin-out.
Comments
2 Responses to “ParAccel PADB technical notes”
Leave a Reply
Curt,
Thank you for the write up on PADB and for pointing out our extreme focus on analytic speed. It is something we are indeed very proud of. 🙂
I wanted to further elaborate on some of the points you made.
NOTE: The extensible analytics framework in PADB allows one to write User Defined Scalar Functions, User Defined Aggregate Functions and User Defined Table Functions. References to UDF in the discussion below assume all three.
* PADB has very rich SQL support. We believe, and our customers have shown, that significantly rich analytics can be done quite elegantly and effectively in SQL. We enable this via our extreme focus on analytic performance. Combine that with a UDF framework that allows you to embed analytics in languages beyond SQL (though directly accessed through SQL) and we feel it is a true differentiation.
* The UDF framework in PADB enables one to write multiphase Table Functions. This means that if you had an analytic module that did an arbitrary number of MapReduce jobs, for example, (by the way, it need not be restricted to the MapReduce framework) you wouldn’t need to write explicit functions for each of them. Instead, you could write a single Table Function in PADB that subsumes the entire set of analytic steps. This has important implications from an ease of use perspective. As a consumer of an analytic module I shouldn’t need to be aware that the analytic function has multiple steps in it or that it partitions data one way versus the other. I simply invoke it and I get my results back blazingly fast! 🙂 All the MapReduce artifacts (or Row and Partition artifacts as some others would call them) are hidden from me. This greatly enhances the user experience for the consumer of the analytic module.
* Our UDF framework is a first class citizen in the database. By implementing in this manner, the UDF inherits all the features in the database for both manageability and performance. Therefore, the Workload Management capabilities, for example, that we have also take into account any UDFs that run in the database. In line with our extreme focus on analytic speed, this also means our UDFs are inheriting all the performance related features we have incorporated in the product.
[…] ParAccel’s approach is also on the virtual side, but assumes a SAN (Storage-Area Network). […]