A question on MDX performance
An enterprise user wrote in with a question that boils down to:
What are reasonable MDX performance expectations?
MDX doesn’t come up in my life very much, and I don’t have much intuition about it. E.g., I don’t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. What’s more, I’m heading off on vacation and don’t feel like researching the matter myself in the immediate future. 🙂
So here’s the long form of the question. Any thoughts?
I have a general question on assessing the performance of an OLAP technology using a set of MDX queries. I would be interested to know if there are any benchmark MDX performance tests/results comparing different OLAP technologies (which may be based on different underlying DBMS’s if appropriate) on similar hardware setup, or even comparisons of complete appliance solutions. More generally, I want to determine what performance limits I could reasonably expect on what I think are fairly standard servers.
In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000. In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)? Or is that level of performance only expected with a high end massively parallel software/hardware solution? The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3 Gb/s SAS drive, running Windows Server 2003 (x64).
I realise that caching of query results and pre-aggregation mechanisms can significantly improve performance, but I’m coming from the viewpoint that in purely exploratory analytics, it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.
Comments
16 Responses to “A question on MDX performance”
Leave a Reply
I don’t think it’s possible to answer this question, really – it falls into the ‘how long is a piece of string?’ category. If we were talking about Microsoft Analysis Services 2008 then with those data volumes and that hardware then I would be very confident that for a properly designed and tuned cube you’d get query times of a few seconds or less for most reasonable queries. I don’t know of any benchmarks that compare Microsoft Analysis Services with other competing OLAP platforms that support MDX (such as Mondrian) though.
I realize that the question is about benchmarking solutions supporting MDX.
Nevertheless, there is a possible confusion. MDX is a language. If you have performance or scalability problems, they are probably not due to MDX. MDX has some limitations, and I suppose it is possible to hang yourself with it… but if you use it properly for the type of problems it was designed for, then MDX, in itself, has nothing to do with the performance of your OLAP system.
All in all, I think it is an odd question.
If these guys were my clients, I might suggest translating the MDX queries into SQL and benchmarking those. However, they might push back by pointing out the huge amount of MDX programming they’ve already done, and the resulting lock-in.
As matters stand, however, I don’t know whether or not MDX is truly a requirement.
Mondrian from Pentaho will “translate” your MDX queries to SQL. However, I am an academic, so I do not know whether you can actually use Pentaho Mondrian for production. (It is fine for my students though.)
“but … it is not possible to have all combinations of dimensions calculated in advance, ”
This is not true/necessary. Columnar or other compression doesn’t have to be all or nothing, and even partial pre-aggregation can matter enormously performance-wise in an SSAS implementation.
But, if the inquirer really has something against pre-caching results, I would suggest looking at Gemini or Qlikview to get an idea of what’s possible. I understand both can handle 100M rows and use in-memory column/value compression to do fast aggregates, as they’re explicitly targeting exploratory analytics (i.e. the stage before you know what the good aggregates will be).
Both of these have clients which run fine on laptops as well. This is another argument that there are a lot of other choices that will determine your performance which have nothing to do with whether MDX is involved.
I think the really interesting question might be “why are you asking the question”? Because clearly the inquirer is associating MDX with performance and as Daniel points out, it’s a language so in a way he’s right – it has nothing to do with performance of the underlying platform. It’s kinda like asking how to benchmark C++ or C# – on what platform?
And if I understand correctly, MDX is converted to SQL in all cases right? So then depending on the interpreter, results would vary?
This is all very interesting – you never hear much about MDX (unless you follow Chris’ blog obviously) – I find its learning curve to be exceedingly high and wonder if many people use it “in the raw” –
Actually, rereading the question I think it actually makes more sense than the comments here (including mine) suggest. The questioner here is interested in “assessing the performance of an OLAP technology using a set of MDX queries”; I’m not sure they are confusing the query language with the underlying database technology at all.
What they seem to want is something like a TPC benchmark for OLAP: a dataset that can be loaded onto multiple OLAP platforms, and a set of MDX queries that can be run against that dataset on each platform so their performance can be compared directly. This seems like a reasonable requirement to me, but as I said in my original comment I don’t know of any such benchmarks.
Jerome, to pick up on some of your questions:
– No, MDX isn’t always always converted to SQL. In a pure MOLAP scenario (like Microsoft Analysis Services in MOLAP mode) nothing gets converted to SQL. Even in ROLAP scenarios, while SQL queries are issued to get the raw data needed for an MDX query, it’s still likely that there’ll be a lot of calculation and processing that will happen inside the OLAP engine afterwards.
– MDX is widely used in the Microsoft BI world, but given the number of other platforms that support it I am a bit mystified why you don’t hear so much about it in the wider world…
– …and this is probably because, as Jerome says, there is a steep learning curve. In fact for anyone that’s spent a lot of time thinking in SQL it can be particularly confusing. Which is a shame because I truly, honestly believe that MDX is an immensely powerful language and much better suited for expressing BI queries and calculations than SQL.
Thanks for the insight Chris. I think you’re obviously also right in your last sentence – hands down.
J.
Many thanks for all of the interesting responses to my question, which Curt kindly posted for me. Here are some comments:
– I have evaluated different OLAP technologies that each support MDX. Using the same set of MDX queries on the same dataset loaded into the different OLAP systems seemed like a reasonable to way compare performance. It also means that MOLAP and ROLAP systems can be compared side by side. So Chris is correct in his statement that I am evaluating the performance of the OLAP technology, rather than the performance of MDX itself — a TPC benchmark for OLAP is a good way to describe it.
– John suggests trying Gemini or QlikView. I have tried QlikView on the 32 GB RAM 64-bit system with the 100 million row dataset mentioned above. It ran out of memory trying to load this dataset.
– Chris states “with those data volumes and that hardware then I would be very confident that for a properly designed and tuned cube you’d get query times of a few seconds or less for most reasonable queries.”. Does design and tuning of the cube, include the pre-calculation of aggregate tables (or some pre-loading of a cache)? Can I expect a response in the order of seconds for queries that the cube has not been prepared with (on the hardware mentioned)? One example of a query I have run is a simple crossjoin between two dimensions (50 members, 1000 members), selecting one measure, thus returning 50,000 cells. Is this a reasonable query for performance assessment?
Thanks again for all of your insight.
Tom,
Sorry if I misdirected you with Qlikview. I was listening to a podcast about Gemini [1] where one of the team members described snappy responses in Excel using a 100 million row test data set, and how it compressed down to 180Meg or so. I had assumed they used a pretty similar architecture, but maybe not.
The query you mention is pretty trivial in terms of MDX->SQL. In other words, you could implement your own star schema and write a simple query to bring those items back. (What Mondrian automates).
This would not really test the underlying store (Oracle/Mysql/etc), if that is your goal. It would probably serve to throw out horribly implemented MDX stores, but that’s about it.
I appreciate that you’re trying to simplify the problem to make it tractable. I think that this analysis is just not so amenable to a blog post or an email response.
I would take everyone’s collective unease as a sign that you might not get the meaningful results you require without digging deeper into aggregates and modeling, which I can certainly see why you’d want to avoid.
Given how central caching is to BI performance, and how much variety there is to implementing this, I just don’t think a performance test that ignores platform-specific pre-aggregation strategies is useful.
You may as well do what Curt said and write sql and benchmark the underlying RDMBS stores, for the MDX-fronted ROLAP stores.
[1] http://www.dotnetrocks.com/default.aspx?showNum=490
Tom,
To answer your question, tuning of the Analysis Services cube would include building some aggregations (but not aggregation tables – SSAS creates aggregations internally and is very good at working out how to use them) but not necessarily include warming the cache (which is a widely used tuning technique, but personally I always aim for fast performance on a cold cache).
The volumes you’re talking about are just about average for SQL Server Analysis Services cubes today, although you will need to have a reasonable amount of Analysis Services knowledge to get the best possible performance. Generating a set of test queries is a good technique to use but I would say that a query that returns 50,000 cells is on the large side. In general I encourage users to run queries whose results can be displayed on one screen rather than do a ‘data dump’ style query and then try to manipulate the results in, say Excel. Obviously the larger the amount of data returned the longer the query will take to run and I don’t think there’s any value in a user asking for large amounts of data in one big chunk, rather than in more digestible, smaller pieces.
“In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000.”
OK, even at a generous 20 bytes/column we’re dealing with ~120GB of data here i.e. not a lot.
“In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user
gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)?”
Against a DBMS it depends on the complexity of the query and the database concurrency – what else is running at the same time. Assuming you
have all CPU and IO resources to yourself, and assuming a simple scan only or fact:dimension join query, the performance is largely (but not
wholly) a product of the capability of the disk/IO sub-system, which dictates your database read/scan rate.
“Or is that level of performance only expected with a high end massively parallel software/hardware solution?”
I hope not, as with those data volumes I wouldn’t expect you to go down the MPP route.
“The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3Gb/s SAS drive, running Windows Server 2003 (x64).”
What about the IO sub-system? What DBMS are you running?
“…it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.”
I beg to differ. That’s exactly what Queryobject from CrossZ systems delivers. Full pre-calculation of all measures aross all combination of dimensions at all levels of the hierarchy, for any input dataset size.
Most (all other?) OLAP vendors consider this akin to ‘boiling the ocean’ and consider it not achievable, so don’t even try.
Without full pre-calculation, the problem is that there is scanning to do at query time to satisfy the queries for which the answer is not pre-built.
Once the user gets the egg-timer for 10-15 minutes, how can they distinguish between busy and broken?
There are tier1 telcos in the US that have been using Queryobject in production for many years.
It’s SQL-compliant so you don’t have to learn MDX. Sorry Chris!
See: http://www.queryobject.com
@ Daniel Lemire. I’m using Mondrian in a commercial setting and it is quite adequate as a ROLAP engine. The key to getting high performance is to make a few aggregate tables. For example, once you aggregate and reduce row counts from millions to 10’s of thousands, the responses become ‘snappy’ rather that mildly irritating. This does mean that I had to go out and actually watch users to see where the bottle necks are. Then I only had to build the aggregates that get used, since building the aggregates can be time consuming. We are running daily ETL, so daily rebuilds of the aggregates are practical.
I’ve worked with MSAS a bit and I’ve had engineers who worked for me tell me things that confirm my brief experiences as I got further from the technology.
Most people who get up the rather steep curve for MDX admire it for its elegance and prefer it to SQL on that basis. Nevertheless the experience is that it is generally not worth it to learn the language if you have appreciable experience in SQL.
If you have become something of an expert in tuning databases in the generations of products before the Greenplum and Vertica days, say with Essbase, Microstrategy, Teradata, Oracle Express or Sybase ASE, then you will have some familiarity with arcane 4G languages and dialects of SQL without many cross-product similarities. MDX was the language destined to solve all that, a sort of OLAP Esperanto. Unfortunately, the number of programmers deciding to hack SQL to perform against this and that sort of schema tended to dominate the few cross-platform specialists. In the end, it is my opinion that the lack of a predominating visualization stack obviated the need for widespread MDX adoption. As fat visual programming OLAP clients matured, enterprise customers began demanding thin clients. As full featured stacks became available, developers wanted LAMP, and so on. Now we are met on a battlefield testing whether PHP and MySQL will long endure, with people like me wondering how they ever got started considering the maturity and performance of products like Sybase and Essbase.
My direct experience is that flatly, for applications of any sophistication, on MSAS, stored procedures with T-SQL always performed better than their functional equivalents in MDX. In their aborted product stack, Performance Point, Microsoft engineers created a middle language PEL(?) that would supposedly choose which language was best suited for a task and then generate that code. All jokes about code-generators aside, my engineers always had to second guess PEL and ended up writing it all in T-SQL. So as a practical matter, there was no sense in learning MDX if you had already mastered T-SQL on the platform for which MDX was specifically designed. It would have been nice if MDX performed with speed commensurate with its elegance, but it simply didn’t. This was two years ago.
I know MDX lives on in Essbase but that Essbase expresses it differently than does MSAS. It does rather boil down to ‘it depends’, because language to language different platforms have different strengths, etc.
I expect that we will not learn the definitive answer to this question until there is a shakeout of DBs that survive the transition to cloud infrastructure. And in that regard I think a Greenplum or a Vertica or a Hadoop-based solution will win out. In other words it won’t come down to the semantic layer. The market never forced it to because there is no real OLAP standard (chicken or egg?).
In the end I say MDX iff you love MDX.
Reply to Paul Johnson – regarding preaggregation , you will find that an Essbase cube can be fully calc’d, but this can lead to DATA EXPLOSION and long calculation times. Optimization methods employed to reduce this phenomena can cause erroneous results when run against sparse asymmetrical dimensions , but otherwise I found Essbase to be a fantastic DataMart solution, and very intuitive since the gui bypasses the need to use MDX. Use a star or snowflake schema in a rdbms to feed your Essbase Cubes for best results.
Essbase actually does calculations erroneously?? Could you please say more about that? Thanks!