Data warehousing with paper clips and duct tape
An interesting part of my conversation with Dataupia’s CTO John O’Brien came when we talked about data warehousing in general. On the one hand, he endorsed the view that using Oracle probably isn’t a good idea for data warehouses larger than 10 terabytes, with SQL Server’s limit being well below that. On the other hand, he said he’d helped build 50-60 terabyte warehouses in Oracle years ago.
The point is that to build warehouses that big in Oracle or other traditional DBMS, you have to pull out a large bag of tricks. (John gave a figure of 1-2 dozen.) Areas include schemas, partitioning, specialized SQL, “very specialized file layouts”, and so on. Summing up, John said that using Oracle above 10 terabytes was very much like working with columnar database structures.
Things are a lot simpler in the specialist world. Buy a data warehouse appliance, load it up in any halfway sensible way, and you’ll probably have decent performance from the get go. Tuning is still a really good idea, of course. But your struggles are to make the system run better, or to make a few annoying performance problems go away. You’re not performing heroic acts just to get it running in the first place.
Please subscribe to our feed!
Comments
17 Responses to “Data warehousing with paper clips and duct tape”
Leave a Reply
Schemas and Partitioning are special tricks are they?
Seem like pretty standard features of the Oracle database that any data warehouse would employ, partitioning especially.
I’ve not built a 10Tb+ DW on Oracle…but I’m running a 6-7Tb one perfectly well now, and it’s likely to exceed 10Tb in a year or two…and we’ve no plans, or perceived needs, to switch to any “appliance” solutions to meet user requirements.
Your mileage, of course, may vary.
Jeff,
Is that raw/user data?
What kind of hardware do you use?
And is it more of the single-fact-table variety, or more some of this and some of that?
Thanks!
CAM
Curt,
I do have some trouble consolidating your two posts.
The earlier appears to indicate that Dataupia is concentrating on very simple schemata and tried to be really cheap to do that.
This post now attempts to compare to non-specialized DBMS which run traditionally much more complex data.
Take DB2 at Visa for example:
http://www.ibm.com/software/info/television/html/N323242D54269J25.html
400TB of primary data, 2000 tables, …
I can buy a toaster, even a toaster oven, for a few dollars.
Yet an oven costs me substantially more.
They do entirely different things. The most interesting (yet IMHO obvious) part is: When the toaster tries to become more than a toaster is MUST become more complex. It ceases to be the very thing is so proudly claimed to be:
Simple and cheap.
In my opinion the same laws apply here.
You can’t have your cake and eat it too on this one.
Cheers
Serge
DB2 for LUW Development, Toronto
Serge,
Those are nice figures. Thanks for the link.
However, I don’t think the either/or distinction is quite as stark as you’re making it out to be. I used to have a rotisserie that was a lot like a toaster oven, but bigger and stronger — and I never found a need for a true oven. Of course, if I’d been a baker of complex cakes, the matter might have been a little different — but just as it was sufficient for what I cooked, I think it would have been sufficient for the vast majority of my grad student peers as well. Ultimately, I got rid of that rotisserie only because I moved out of my dorm and into a place with a conventional oven already built in.
Either a system can get the job done — with good speed and TCO — for a given workload, or it can’t. And it’s not obvious to me why upstarts such as Netezza or DATAllegro, as they mature their products, won’t be able to handle a large majority of the remaining workloads where Teradata and DB2 are still superior.
But then, it’s also not obvious to me why — as they face tougher competition and also have a chance to keep enhancing their own products — Teradata and DB2 have to continue to be more expensive than the upstarts.
Best,
CAM
Serge,
Indeed. I think is how the market is designed to work.
Over time standard tasks get commoditized and thus cheaper and more efficient.
I saw you earlier blogged about Enterprise DB which appears somewhat related.
In my opinion the profit margins in software are so high for at least to reasons:
* it’s not easy to build up the sufficient weight
(in both software maturity and market traction/trust) to overcome the
status quo. A significant upfront cost is required to challenge the
big players.
* There is an amazing amount of lock-in happening due to the dialectization
of the APIs. As they say, the only problem with standards is that second ‘s’.
So perhaps EnterpriseDB with their “redwood mode” is the foreboding of the commoditization of the OLTP workload, while database appliance are ringing in the same era for the warehouse – with the obvious potential for overlapping re-enforcing wave fronts?
Meanwhile the traditional vendors will continue to swim upstream to preserve their margins.
Cheers
Serge
*lol* Hmm… note to self… need more coffee 😉
>
I thing you are assuming that you don’t need those tricks with vendors like Teradata. I come from a major Teradata shop and you need some of the same tricks and some new ones that you do not need in an Oracle environment. Even for some small databases (5TB or so) where the queries are very complex we had to resort to partitioning, join indexes, summary tables in Teradata. Some of the loads (such as loading the same tables from multiple sessions reading of a staging area using SQL)
have to be staggered because of Teradata locking architecture. The point is that you have to use a bag of tricks regardless of what platform you use. I speak from real world experience running large warehouses in Oracle/DB2 as well as Teradata. I am just beginning with the Appliance vendors but from what I can tell they are really good at scanning tables very fast – if you throw a large number of users plus a mixed concurrent workload that is where they become less effective. That will definitely change over time and they WILL become more compelling. Teradata (and Oracle and DB2 to a certain extent) are way too expensive. The Appliance vendors can do 80% of what these products can do at a much better price-performance ratio.
Curt
Fair do’s…I’m quoting database size from the Oracle Enterprise Manager front screen which currently shows 6.5Tb…but yes, that includes indexes and temp and scratch and other “non data”…raw data…about 4Tb I think…and growing at 1.5Tb (data) a year.
Hardware is a 32 way HP RP8420 box with 128Gb RAM and a HDS USP100 SAN for storage. The box is unstressed, although the same cannot be said of the IO subsystem.
System has five fact tables.
We do have some performance issues…but they are mainly at the IO level…our system is not well balanced and is not providing what the theoretical hardware limits suggest it should…so we’re investigating things like that…but **generally** the thing runs well, performs well and answers lots of new business questions that couldn’t be answered with the previous MI systems…it’s providing a ROI.
I think I took a little umbridge at your suggestion that things like Partitioning are “tricks”. It’s a feature, not a trick – if using a feature like this is a trick then I’m Penn and Teller! Anybody who tries to build any MI environment with time series data (e.g. a warehouse) needs their head reading if they choose to do it without partitioning – Tim Gorman wrote an excellent article about this called “Scaling to infinity”.
I don’t know you, but you sound like a pro Teradata person and an anti Oracle one. I know nothing about Teradata and I’m sure it’s good at what it does…but I’m also fairly sure that the biggest problems in setting up a warehouse are to do with how you architect it – hardware, OS, filesystem, logical database design etc…get those right and I don’t see why Oracle can’t succeed with 10Tb databases…I’m sure there is an element of marketing in it but the Winter Corporation results for 2005 (http://www.oracle.com/corporate/press/2005_sep/091305_wintertopten_finalsite.html) identify a 100Tb Oracle database back then…and we’ve moved on further since then somewhat.
Serge,
I think you’re thinking in the right direction.
For many years, software vendors had relatively little in the way of economies of scale in software development, to an extent that would be surprising to anybody who hadn’t either worked in the area or, say, read “The Mythical Man-Month”.
But at a certain point the economies of scale became very real, less as a way to gain advantage than as a way to hold advantage gained in the kinds of ways that Geoffrey Moore made a career out of explaining.
The replacement buzz-theory for Crossing the Chasm is The Innovator’s Dilemma, and I think the high-end software vendors are running straight into that kind of disruption. That doesn’t mean they won’t win. They’re flexible enough to make acquisitions, and I think the economies of scale in selling SaaS to mid-range enterprises are still UNDER-appreciated. But I think the odds of “disruptive” TECHNOLOGIES winning is quite strong, even if ultimately that victory takes the form of a high-priced buyout by a market-leading vendor.
CAM
Jeff,
Back in the 1990s I wrote my first vendor sponsored piece ever. Sybase, Ingres, et al. said “Why should we sponsor this? You love Oracle and hate us!” Oracle said “Why should we sponsor this? You love our competitors and hate us!” Thankfully, most of them sponsored anyway …
In the recent past, I’ve been criticized — sometimes as gently as you just did, sometimes more roughly — for, among other things, being:
Anti-Oracle
Anti-Teradata
Pro-Teradata
Anti-Netezza
Pro-Netezza
Anti-relational
Pro-relational
Anti-MySQL
Pro-MySQL
And that’s just off the top of my head. 🙂
By the way, in political discussions I am commonly criticized for being too liberal and too conservative. In my personal life I am criticized for being a rebel and a fuddy-duddy.
Where you stand depends upon where you sit. I’m used to it. 🙂
Anyhow, thanks for sharing those figures. You’re right up in the range where I think Oracle still does a perfectly decent job for lots of folks. But if you had to double the size of your warehouse in a year, would you truly feel comfortable about staying with Oracle? If you had to quadruple it, would you look actively for other alternatives?
Best,
CAM
Hi Curt
I’ll openly admit, I’m an Oracle biased person…not because I’ve compared and contrasted various products, but purely because I’ve been working with Oracle technologies since the days of Oracle 5 and forms 2.0 and that’s longer than I’d care to remember! I’m sure the other vendors in the RDBMS space are all good at what they do…but by and large, I’ve generally been very happy with what Oracle has offered me to deliver projects over the years.
I probably get a bit annoyed and I’m a little too quick to jump off the deep end when I hear, what I perceive to be, anybody “dissing” Oracle…so forgive me if I appeared to be quick to get on the defensive.
With regard to using Oracle as the database engine for our warehouse, I still don’t feel that we will be changing that anytime soon…the problems we have are not inconsequential…but they are definitely not the database engine itself…more like general IO capabilities and system design/procedures really…both of which are big factors in getting a warehouse to work on any database engine.
It’s interesting, the boys who parachuted into this company and said “we need a warehouse” came from a big bank where they had access to a Teradata Warehouse so from day one we’ve had a battle as to the database engine to be used. I’m still of the opinion that until we prove Oracle isn’t capable, then we should leave it as is…they are an Oracle shop, with zero Teradata skills, so converting would be a costly and time consuming process and I’m just not convinced there would be any tangible benefits. They seem to have this fantasy that they can just “convert” it to Teradata by installing the software and then all of a sudden their performance and functionality issues will disappear.
Time will tell…but if it does go Teradata, I’m not likely to be around to see the results as it’s just not my area of expertise…mind you, I should check out the rates for a Teradata contractor before I say that! 😉
Cheers
Jeff
Jeff,
That makes sense — but could you please say more about those “general” I/O problems? They may very well be exactly the kind of thing that MPP shared-nothing architectures are designed to circumvent.
Also — is it really the case that your users are getting all the data they want, as quickly as they want it?
Best,
CAM
Curt
IO issues revolve around a number of things.
1. The system isn’t balanced…so, the SAN has ports which can deliver 1.6GBytes/Sec and the server has CPU capability to drive 3.2GBytes/Sec…but the number/type of HBAs on the back of the server can only handle 780MBytes/Sec so it’s the limiting factor and means all the other components are having an easy time of it. The SAN is also shared by other apps, uses RAID 5 (Yeah I know…don’t tell Mogens!) in 3d+1p format and uses veritas with quick io…there are a lot of layers in the IO stack
2. Somewhere there is/are some IO problems with even reaching this 780MBytes/Sec limit, because even on a clear machine (no other users to contend with), we still only manage to drive about 680MBtes/Sec via a big parallel full table scan of a large (80Gb) table…so something somewhere is wrong in the IO stack…and my point there would be, that the same IO stack is going to sit under whichever database platform you care to use…so if it’s not performing for one DBMS vendor, then it won’t matter if you switch the vendor…your IO still sucks.
Are the users getting all the data they want…No. Are they getting lots more than they were on their previous systems…Yes.
Why don’t they get want they want? Well, many reasons – some of which can be levelled at the performance / variability of performance and some of which are more about their education level (they use SQL and are not as effective with it as they need to be), application/data model design and contention with existing batch load…which if the IO stack was working better, the it would have finished before the users get online during the day.
It’s not about one simple thing for us…it’s many variables we’re juggling. I’ve heard they are actually going to buy some new kit (typical management response to performance issues – throw hardware at the problem)…hopefully that kit will be built around a balanced, high throughput IO stack…otherwise the problems which do exist for our system, will remain.
Why is an MPP system going to be better at doing the IO?
I’m no MPP expert – I’ve never worked on one as it’s not really an Oracle thing…you can get Oracle on MPP I believe. My understanding of MPP is that each node in the cluster will have it’s own disks and that means you need to manually partition your data across the nodes and onto the disks that are available to that node….so what’s the difference between that and a good partitioning strategy on the SMP environment with the shared disk subsystem? You need to lay out your stuff in an appropriate fashion to ensure that the IO is balanced across all the spindles available…which you’d do whether it was SMP or MPP.
As I said, I’m no MPP expert so I don’t get why MPP would help…if anything it’s harder because the kit is more difficult to administer? and you have to manually partition the data across the nodes?…which may or may not be easy/possible and may require more work over time?
Cheers
Jeff
Jeff-
You have mentioned one of the top issues with Oracle data warehouses: under-configured I/O bandwidth. The MPP architecture itself is not immune to this problem, it’s just that the vendors that use it dictate the hardware configuration, much in the same way Henry Ford dictated that every Model T was black. If Teradata (or any other MPP vendor) let its customers choose the storage and how it was connected to the hosts (# of HBAs), they would have exactly the same problems.
I also agree there is no “bag of tricks” needed in using Oracle for data warehousing, at least no more than there is with any other vendor. It’s all about having good design and applying the appropriate feature(s) to the problem(s). Like you, I’d be quite interested to see a list of “tricks” that is “1-2 dozen” long. My guess is these “tricks” are fundamentals of data warehouse design.
Try this one for a tongue in cheek list:
http://structureddata.org/2008/04/28/top-ways-how-not-to-scale-your-data-warehouse/
…of course, now that I have more than 30 seconds to respond to this, I realise that the comment I was responding against was from Greg Rahn and not Curt. Greg being the author of the post I then referred to – doh!
Looks like Greg got bored of waiting for the list of “tricks” and made his own post on it shortly after…great stuff!
Cheers
Jeff
[…] Автор: Curt Monash Дата публикации оригинала: 2008-03-14 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]