Data warehousing
Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
Oracle and IBM workload management
When last night’s Oracle/Exadata post got too long — and before I knew Oracle would request a different section be cut — I set aside my comments on Oracle’s workload management story to post separately. Elements of Oracle’s workload management story include:
- Oracle’s workload management product is called Oracle Database Resource Manager.
- Oracle Database Resource Manager has long managed CPU. For Exadata, Oracle added in management of I/O. Management of RAM is coming.
- Another aspect of Oracle workload management is “instance caging.” If you’re running multiple instances of Oracle on the same box – e.g. one with 128 cores and thus 256 threads – instance caging can keep an instance confined to a specific number of threads.
- Policies can let some classes of user get access to more threads in Oracle Parallel Query than others do.*
- Oracle offers a QoS (Quality of Service) layer, at least on Exadata, that tries to use Oracle’s workload management capabilities to enforce SLAs (Service Level Agreements). For example, if you want a certain query to always be answered in no more than 0.3 seconds, it tries to make that happen. However, this technology is new in the current Oracle release, and will be enhanced going forward.
*Recall that “degrees of parallelism” in Oracle Parallel Query can now be set automagically.
One reason I split out this discussion of workload management is that I also talked with IBM’s Tim Vincent yesterday, who added some insight to what I already wrote last August about DB2/InfoSphere Warehouse workload management. Specifically:
- DB2/InfoSphere Warehouse workload management has multiple ways to manage use of CPU resources.
- DB2/InfoSphere Warehouse workload management doesn’t directly manage consumption of I/O or RAM resources. However, it can influence usage of I/O or RAM by:
- Limiting the number or rows read or returned.
- Adjusting priorities as to which queries get to prefetch the most records.
- DB2/InfoSphere Warehouse workload management doesn’t allow you to directly set an SLA mandating query response time. However, if query response times exceed a target SLA, DB2/InfoSphere Warehouse workload management can cause a statistics dump that might help you tune your way out of the problem.
Categories: Data warehousing, IBM and DB2, Oracle, Workload management | Leave a Comment |
Oracle and Exadata: Business and technical notes
Last Friday I stopped by Oracle for my first conversation since January, 2010, in this case for a chat with Andy Mendelsohn, Mark Townsend, Tim Shetler, and George Lumpkin, covering Exadata and the Oracle DBMS. Key points included: Read more
In-memory, parallel, not-in-database SAS HPA does make sense after all
I talked with SAS about its new approach to parallel modeling. The two key points are:
- SAS no longer plans to go as far with in-database modeling as it previously intended.
- Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface).
The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.
A lot of what’s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:
- SAS wasn’t exploiting the capabilities of individual DBMS to their fullest; rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster’s analytic platform architecture is more flexible or powerful than Teradata’s didn’t help much with making SAS run within the Aster nCluster database.
- Notwithstanding everything else, SAS did make a certain set of modeling procedures run in-database.
- SAS’ previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.
Netezza TwinFin i-Class overview
I have long complained about difficulties in discussing Netezza’s TwinFin i-Class analytic platform. But I’m ready now, and in the grand sweep of the product’s history I’m not even all that late. The Netezza i-Class timing story goes something like this:
- Netezza i-Class was first foreshadowed in February, 2010.
- Netezza i-Class customer testing started in October, 2010 or so. Netezza i-Class evidently has been shipped to 4-5 partners and a single-digit number of end-user organizations, spread across some usual-suspect industries (financial services, telecom, and so on).
- Netezza i-Class 1.0 general availability is still in the (near) future.
My advice to Netezza as to how it should describe TwinFin i-Class boils down to: Read more
Categories: Cloudera, Data warehouse appliances, Data warehousing, GIS and geospatial, Hadoop, IBM and DB2, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics | 5 Comments |
Use cases for low-latency analytics
At various times I’ve noted the varying latency requirements of different analytic use cases, which can be as different as the speed of a turtle is from the speed of light. In particular, back when I wrote more about CEP (Complex Event Processing), I listed some applications for super-low-latency and not-so-low-latency CEP alike. Even better were some longish lists of “active data warehousing” use cases I got from Teradata in August, 2009, generally focused on interactive customer response (e.g. personalization, churn prevention, upsell, antifraud) or in some cases logistics.
In the slide deck for the Teradata 6680/solid-state drive announcement, however, Teradata went in a slightly different direction. In its list of “hot data use case examples”, Teradata suggested: Read more
Categories: Data warehousing, Teradata | 2 Comments |
Comments on EMC Greenplum
I am annoyed with my former friends at Greenplum, who took umbrage at a brief sentence I wrote in October, namely “eBay has thrown out Greenplum“. Their reaction included:
- EMC Greenplum no longer uses my services.
- EMC Greenplum no longer briefs me.
- EMC Greenplum reneged on a commitment to fund an effort in the area of privacy.
The last one really hurt, because in trusting them, I put in quite a bit of effort, and discussed their promise with quite a few other people.
Short-request and analytic processing
A few years ago, I suggested that database workloads could be divided into two kinds — transactional and analytic. The advent of non-transactional NoSQL has suggested that we need a replacement term for “transactional” or “OLTP”, but finding one has been a bit difficult. Numerous tries, including high-volume simple processing, online request processing, internet request processing, network request processing, short request processing, and rapid request processing have turned out to be imperfect, as per discussion at each of those links. But then, no category name is ever perfect anyway. I’ve finally settled on short request processing, largely because I think it does a good job of preserving the analytic-vs-bang-bang-not-analytic workload distinction.
The easy part of the distinction goes roughly like this:
- Anything transactional or “OLTP” is short-request.
- Anything “OLAP” is analytic.
- Updates of small amounts of data are probably short-request, be they transactional or not.
- Retrievals of one or a few records in the ordinary course of update-intensive processing are probably short-request.
- Queries that return or aggregate large amounts of data — even in intermediate result sets — are probably analytic.
- Queries that would take a long time to run on badly-chosen or -configured DBMS are probably analytic (even if they run nice and fast on your actual system).
- Analytic processes that go beyond querying or simple arithmetic are — you guessed it! — analytic.
- Anything expressed in MDX is probably analytic.
- Driving a dashboard is usually analytic.
Where the terminology gets more difficult is in a few areas of what one might call real-time or near-real-time analytics. My first takes are: Read more
Categories: Analytic technologies, Data warehousing, MySQL, NoSQL, OLTP | 34 Comments |
Hadapt (commercialized HadoopDB)
The HadoopDB company Hadapt is finally launching, based on the HadoopDB project, albeit with code rewritten from scratch. As you may recall, the core idea of HadoopDB is to put a DBMS on every node, and use MapReduce to talk to the whole database. The idea is to get the same SQL/MapReduce integration as you get if you use Hive, but with much better performance* and perhaps somewhat better SQL functionality.** Advantages vs. a DBMS-based analytic platform that includes MapReduce — e.g. Aster Data — are less clear. Read more
So how many columns can a single table have anyway?
I have a client who is hitting a 1000 column-per-table limit in Oracle Standard Edition. As you might imagine, I’m encouraging them to consider columnar alternatives. Be that as it may, just what ARE the table width limits in various analytic or general-purpose DBMS products?
By the way — the answer SHOULD be “effectively unlimited.” Like it or not,* there are a bunch of multi-thousand-column marketing-prospect-data tables out there.
*Relational purists may dislike the idea for one reason, privacy-concerned folks for quite another.
Categories: Data warehousing, Surveillance and privacy | 37 Comments |
Teradata, Aster Data, and Teradata/Aster
Teradata is acquiring Aster Data. Naturally, the deal is being presented with a Treaty of Tordesillas kind of positioning — Teradata does X, Aster Data does Y, and everybody looks forward to having X and Y in the same product portfolio. That said, my initial positioning and product strategy thoughts on the Teradata/Aster combination go something like this. Read more