February 18, 2008
ParAccel technical highlights
I recently caught up with ParAccel’s CTO Barry Zane and Marketing VP Kim Stanick for a long technical discussion, which they have graciously continued by email. It would be impolitic in the extreme to comment on what led up to that. Let’s just note that many things I’ve previously written about ParAccel are now inoperative, and go straight to the highlights.
- ParAccel sells a columnar, disk-centric data warehouse DBMS. Similar but not identical data structures are used in RAM cache and on disk. If there’s enough RAM, ParAccel’s system runs entirely in memory, except to the extent it obviously doesn’t (e.g., transaction persistence). In its TPC-H benchmarks and in some customer situations, ParAccel has run entirely in memory.
- ParAccel initially stores updates (whether transactional or bulk load) in cache. At transaction commit time, or when the cache fills, changed blocks are stored on disk. Thus, as in most other DBMS, it is necessary to read a block into memory in the first place before you change it.
- One ParAccel option is “Amigo” mode, in which the ParAccel database is continually synchronized with a SQL Server database, and queries are dynamically routed to one the two systems. (There’s no true federation at this time.) Each resynchronization starts with a new SQL Server query, at a scheduled interval. This interval can be as low as 5 seconds or as high as 10-20 minutes. Barry thinks the overhead of the resulting updates is “noise level” if the interval is 30 seconds or higher.
- Writing a row or reasonably small group of rows in a table with C columns requires C writes to disk, versus the 1 write required in a row-based system. (For a sufficiently large bulk load, of course, that wouldn’t be true. Consider the extreme example in which the whole database is loaded. Then the number of blocks written is the same no matter what architecture you have, except for the differences caused by compression, by any indexes you store on disk, and so on.)
- While single-record inserts are much slower than in row-based systems, Barry thinks that performance sacrifices are minor if rows are loaded a few thousand at a time or more. (I believe that in this and similar estimates he assumes the number of columns to be no more than a few dozen. While accurate for most applications, that might not be true for users who manipulate 1000+ column credit records.)
- ParAccel claims strong SQL Server compatibility, including running TSQL stored procedures (but not other stored procedure languages, Postgres PGPLSQL excepted). However, while the SQL execution itself is parallel, the rest of the stored procedure only executes on a single “leader” node.
- Oracle/PSQL compatibility is a roadmap item.
- ParAccel supports C/C++ UDFs (User Defined Functions). Scalar UDFs execute in parallel. However, a UDF that invokes SQL runs only on the leader node – except, of course, for the SQL part itself.
- In Amigo mode, ParAccel of course runs the same schema as the OLTP SQL Server instance it’s synchronizing with. Thus, they in no way make the Vertica assumption that all data warehouses have star or snowflake schemas. Nor do they replicate fact tables between nodes. Barry claims that ParAccel has done a great job on internode transport speeds, but the details are confidential.
- Even more confidential is support for another claim of Barry’s. Just as columnar systems are slow when writing whole rows, they also are slow when retrieving them. But ParAccel has a deeply-secret way of greatly reducing this penalty.
- Like Vertica, ParAccel supports limited materialized views, called “projections.” A major use of these is to store columns in multiple sort orders.
Categories: Columnar database management, Data warehousing, Emulation, transparency, portability, Microsoft and SQL*Server, ParAccel
Subscribe to our complete feed!
Comments
5 Responses to “ParAccel technical highlights”
Leave a Reply
Curt,
I’m confused by the comment that, in Amigo mode, the schema is the same on the OLTP SQL Server and on ParAccel. That seems to be incredibly limiting. Although it might be true for some very simple reporting applications, no data warehouse I’ve ever seen uses the same schema as the OLTP source system. Also, what if there are many source systems (which is the typical case)?
If true, surely this would be unusable in the vast majority of real-world situations.
Stuart
Stuart,
And thus you’ve neatly explained why not EVERY ParAccel customer buys Amigo mode.
CAM
Curt,
I’ve read comments (yours and others) about columnar databases being slow to retrieve whole rows; but, I don’t hear anyone saying “how slow”. Can you shed any light on this? Are we talking 10’s or 100’s of milli-seconds … or longer?
Doug,
If you’re retrieving N fields in a row, the base case is N * (the work of retrieving one row), because you have to look in N different places.
Obviously, a big part of columnar DBMS design is figuring out ways to outperform the base case. But absent something like the TransRelational architecture (see the category for same) — or some other major deviation from a simple-minded columnar approach — it’s hard.
That’s for single rows. Once you’re retrieving lots of blocks of data, then the factor can be diminished or go away entirely, and be outweighed by columnar’s inherent advantages (you’re not retrieving the WHOLE row, and compression may work better).
CAM
[…] Please do not rely on the parts of the post below that are about ParAccel. See our February 18 post about ParAccel instead. […]