Best practices for analytic DBMS POCs
When you are selecting an analytic DBMS or appliance, most of the evaluation boils down to two questions:
- How quickly and cost-effectively does it execute SQL?
- What analytic functionality, SQL or otherwise, does it do a good job of executing?
And so, in undertaking such a selection, you need to start by addressing three issues:
- What does “speed” mean to you?
- What does “cost” mean to you?
- What analytic functionality do you need anyway?
Key elements of cost* include:
- Software license and maintenance
- Hardware purchase cost, maintenance, electric power, and computer room burden
- Database and system administration
- (For some uses cases) Programming
*Assuming a classical in-house IT shop, where products are typically bought rather than leased/rented. With outsourced and/or monthly-fee structures, the details change but the principles remain the same.
Most of that can be evaluated pretty well via a spreadsheet, although things can get a bit tricky when you get to people costs, which are a large fraction of the whole. In particular, different analytic DBMS product suites have great, high-performance support for different (and often rapidly growing) sets of functionality – basic and advanced SQL, statistics, and more. Figuring out which ones will be best for your programmers, and how significant the differences are — well, that’s a lot like any other programming language evaluation, and those are rarely neat or clean-cut.
But when it comes to evaluating speed, there’s no substitute for a well-designed proof of concept (POC). Many analytic DBMS and appliance vendors are happy to let you do a POC, on your own premises (or remotely if you prefer), under your control, at no cost to you. And that’s great. It is crucial that a POC be run either by you, by a consultant* answerable to you, or – if you decide the vendor must run it for you – at least with you watching every step of the way and knowing exactly what is being done. Appliance vendors do find it cheaper to run POCs on their own premises, so a certain reluctance to ship you a box is understandable. But make no compromises about the transparency of a POC, or about your control of exactly what it is that gets tested.
*Since I sell consulting services for users evaluating analytic DBMS, I naturally am biased to think that consultants can be very useful in the process. 🙂 But whether you should use them a little (sanity check), a medium amount (work with you through the process), or heavily (actually drive the process for you and/or execute the POCs) is very dependent upon your specific situation.
So far as I’ve been able to tell:
- Netezza loves to ship boxes to prospects for POCs, and have them set up the boxes and do POCs themselves. That’s a big reason why Netezza wants to call attention to this subject.
- Oracle has generally been pretty reluctant to ship Exadata boxes out for POCs. That’s the other reason Netezza wants to call attention to the issue. 🙂
- Open source vendors make it easy for you to download and test at least their community editions.
- Vertica makes it pretty easy for you to test its software too (download or cloud).
- ParAccel has generally insisted on running POCs itself, although it will do so on your premises if you insist.
- Teradata naturally tries to do POCs on its own premises, but doesn’t insist too hard. (Edit: Randy Lea of Teradata says that Teradata is now doing over half its POCs onsite.)
Most of the criticisms I’ve heard of vendors’ POC practices have been directed at Oracle or ParAccel.
For most POCs, it’s a good conceptual template to form and then test a hypothesis to the effect of:
- For a given technology product assemblage (brand of DBMS, number of nodes, etc.), and
- For a given level of human effort (e.g., administrative effort), you can
- Run a given a workload, with
- Satisfactory and satisfactorily consistent response times
Sometimes absolute throughput and price/performance are important secondary considerations; sometimes they’re less germane. But either way, it’s almost always right to focus primarily on the questions of “What do I want this system to do?” and “What do I think we’re going to have to invest in it?” By way of contrast, it’s often misleading to focus too much on questions like “What’s the one number that best describes the performance of this system?” — even if you customize that calculation for your environment – or, even worse, “How much speed-up can I get on my single worst Query from Hell?”
The fundamental rule of POC construction is: Model your entire use case as best you can. That means you need to consider, at a minimum:
- Your whole concurrent query, other analytic, and low-latency update workload (peak).
- Your whole query, analytic, load, backup, and maintenance workload (ongoing).
- Partial-failure scenarios.
- Your core SLAs (Service-Level Agreements).
Of course, that’s not as easy as it sounds. Presumably, the main reason you’re getting a new analytic DBMS is that you want to do new kinds of analysis. By the very nature of analytics, you won’t know what analytic operations are most useful until you try them out and see what their results are. On the other hand – if you haven’t done considerable thinking about how you’re going to use your new analytic database, how did you ever get funding for the project in the first place? 😉
Seriously, I could write multiple posts, each as long as this one (but more application-oriented), about how to upgrade your analytic capabilities (and which fool’s gold to avoid). But this has gotten pretty long already, so for now I’ll just stop here.
Note: My clients at Netezza asked me to write something short about POCs they could use as a kind of foreword to some collateral, where by “short” they meant single-paragraph or something like that. They’re great clients, so I said yes, under the condition I could also use it as a blog post. Except … this post didn’t turn out to be nearly as short as they envisioned. Oops. 🙂
Related links
-
My February, 2009 slide deck on how to select an analytic DBMS is in many parts still pretty current
Comments
7 Responses to “Best practices for analytic DBMS POCs”
Leave a Reply
Will POC’s be a fair evaluation if performance parameters are tweaked for better performance only for selected queries? We do not see a wide TPC evaluation participations nowadays.
You offer some great advice in your POCs best practices. Your “Model your entire use case as best you can” advice is spot on but where many customers fall short in their evaluation. This isn’t easy as you say but without dedicating the time and effort to do so POCs typically don’t produce meaningful results to effectively make a good decision. At Teradata we review the pros and cons of customer on-site or vendor location benchmarks and let the customer decide. Currently we’re doing more than half of our POCs on-site at customer locations. Due to the challenges of doing effective POCs, and the tricks some vendors play, we also encourage customers to get real world customer references. POCs and customer references are two key areas that customers should challenge their vendors with, as hesitation by the vendor in either area typically means the vendor’s marketing hype falls far short of reality.
Randy Lea
Teradata
@Ramakrishna,
I’d say relying on TPC results is pretty much a “worst practice” for doing an evaluation.
as far as i know, software providers often play trick to get great performance. Make pocs reliable is difficult due to limited investment.
@zedware,
Yes, which is why I’m reminding people to do “real” POCs.
[…] Best practices for analytic DBMS POCs (dbms2.com) […]
[…] Oracle Exadata users who say that the product works; Gartner also has stopped beating Oracle up for its previous policy of almost never doing onsite POCs (Proofs of Concept); both parts of that ring true with me. But Gartner also rightly dings Oracle for various issues in […]