March 13, 2011
So how many columns can a single table have anyway?
I have a client who is hitting a 1000 column-per-table limit in Oracle Standard Edition. As you might imagine, I’m encouraging them to consider columnar alternatives. Be that as it may, just what ARE the table width limits in various analytic or general-purpose DBMS products?
By the way — the answer SHOULD be “effectively unlimited.” Like it or not,* there are a bunch of multi-thousand-column marketing-prospect-data tables out there.
*Relational purists may dislike the idea for one reason, privacy-concerned folks for quite another.
Comments
37 Responses to “So how many columns can a single table have anyway?”
Leave a Reply
Silly question, what makes multi-thousand column relation superior if the alternative is even somehow normalized schema?
The 1,000 column limit is not just Oracle Standard Edition, it’s an Enterprise Edition limit also.
http://download.oracle.com/docs/cd/E11882_01/server.112/e17110/limits003.htm
Thanks, David. I’m not surprised, but I was being cautious in my phrasing.
Matt Pfeil wrote in to say that Cassandra’s limit was 2 billion, which for me equates to “effectively unlimited.” Haven’t heard about any other relational DBMS yet.
Vilho — other than that it takes less human effort and less hardware to get the job done, nothing.
Microsoft SQL Server is 30K for tables with sparse columns.
http://msdn.microsoft.com/en-us/library/ms143432.aspx
Hubi,
Thanks!
Is the “sparse” just a practical matter — as in one couldn’t fill 30,000 non-sparse columns without blowing through max table size — or is there some more specific issue involved?
With sparse columns the max row size is still a limiting factor, spare columns simply save you space if you have a table with lots of columns that are predominately NULL.
So, in reality if you have many columns that will not be null then you’ll hit the same wall.
Tony.
Hi Curt.
I didn’t really understand from your answer how would using thousands of columns in one table make query execution faster or less consuming. Assuming that the reason is not because it was faster to create a schema with one table.
By the way, solidDB supports 1000 columns. I was just curious to know why would it pay off to change that upper bound value to something else.
I would go with “effectively unlimited.” for ParStream – only limited by the number of files you can have on a distributed server cluster / cloud … or maybe limited by disc space if at all.
I agree with Curt that there are applications that use and need that much columns – e.g. in employee satisfaction surveying where people are asked many different questions. Because of the real-life complexity there are far more questions defined than asked a single employee. Depending on their answers whole question-tracks are left blank.
Every possible question is assigned a column – believe me, more than a 1000.
By storing it in a single table extremly fast pattern recognition can be performed, i.e. how many employees are between 20 and 30, work in a production department, are satisfied or very satisfied with their superiour … and many more.
By using a bitmap index for every column advanced analytics can be made very fast. I would argue that performing the analysis on a single table with (highly compressed row-synchronized) bitmap-indices leads to much shorter response times (or more correlations checked per time) than by joining 1000 tables together.
Any thoughts on that?
Sybase IQ supports 45,000 columns/table.
Oracle 10g has a limit of 1000 columns per table and multicolumn indexes are limited to 32 columns, IBM DB2 (version 8) has
a limit of 750 columns per table whereas MySQL 5.0 has a limit of 4096 columns per table.
Seems with DB2 9.7 the limit has increased to 1012 Maximum number of columns in a table (32K page size limit)
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0001029.html
Vilho,
You seem to be proposing an alternative of doing lots of joins. Yet you also seem to be saying you don’t see how that’s slower than not doing lots of joins.
Presumably, I’m misunderstanding at least half of that.
Teradata has a limit of 2048 columns per table. At least in Teradata, this limit is a function of the 128KB table header limit in its file system.
Dawit:
Are you saying 64 bytes per column are needed for the headers in Teradata, or something like that?
fyi, postgres is limited to 250 – 1600 columns per table, depending on the column type
http://www.postgresql.org/about/
PostgreSQL has a limit of 250-1600 columns per table (depending on column types):
http://www.postgresql.org/about/
In addition to column descriptions, the table header also contains other info like partition and compression info. The 128KB space is for all these info requiring the imit on number of columns per table.
Curt, I’m not proposing anything, I don’t know enough about analytic processing so that I could guess what kind of queries would be executed against that multi-thousand column table. Without knowing that, I don’t even try to claim how should the processing be done. It is just interesting to hear you arguments.
In many applications, the limit on the number of columns per table is much more constraining as it applies to the intermediary or final results produced during a query processing than as it applies to base tables. At least in some DBMSs, the same limit applies for both. It is much more common to have complex queries that produce an intermediary or final result table that exceeds the limit. For example, computing aggregations over multiple categorical variables in SQL that makes only one-pass over the table results in a single row with values converted to columns.
The current limit for Ingres and VectorWise is 1024 cols per table, but for VectorWise we should consider raising it. I’ll second Michael’s comment above. It is not at all crazy for a DW to join all of the dimensions into the fact table, even where it results in a very wide table. Column-wise compression (e.g. dictionary compression) means this can be done with little or not storage overhead. Column selection means queries using a small subset of columns don’t suffer. And then there is no join left to do for queries, except for the special case of putting columns back into rows, for which column stores have very efficient mechanisms.
Hi Curt,
At Infobright, it all depends on the types.
In MySQL and in us, there’s a limit of just under 65536 bytes per row (65532?) so that effectively limits the number of columns. Any column can be up to 64k, but the total width of all columns cannot be more than 64k. Thus, just under 16384 integer columns.
Thanks,
Jeff
Just ran a check on SAS. I was able to successfully construct a dataset with 2,072,576 eight byte columns on my laptop. On a 64bit box I was able to successfully create a dataset with 3,133,440 eight byte columns before I decided to kill it.
I think SAS has a total observation length limit rather than a number of columns limit, as the others do. But I think it may have more to do with available memory rather than any absolute limit. So the 64bit implementation might have a truly impressive limit on the number of columns.
Half a million columns i SAS!
I am working on a paper in Risk Theory at Stockholm University. Since I do not have any C or C+ compiler installed on my home PC, I have used the SAS Data-steg.
It turns out (partly to my surprise) that:
* Processing of 10000-20000 columns is Very fast.
(Please note that I only have one single row,
so this is a Very unusual situation).
* The code is very clear, slightly better in
readability than Matlab.
* Anyway to have Your results in 10000 variables,
means that you have to restructure them, so
you can see the result.
* I think that SAS internally produces a C-like
program, which stores the data in the one
single observation, in real memory.
* I have (for fun) tried with 500000 columns and
performed a simple addition. Also fast – 17
seconds.
Thanks John and Anders (sas users) for getting here first! Its been recently (v8 or v9) that SAS increased our columns per table limit up from 32K (signed short) to essentially unlimited (unsigned long or roughly 2 billion). The “catch” is you need enough 64 bit address space to store all the column descriptors and at least one row of data in-memory.
Genomics models were the primary driver. these folks typically have short but wide datasets of the order of 300,000 columns.
For IBM DB2, the maximum number of columns in a table is 1012 for all page sizes (8K, 16K, 32K) except 4K pages. If you use 4K pages then maximum number of columns in a table is 500.
Curt, Michael, Dewit, and others, thanks for clafying comments. I had look years back to get into correct context. SolidDB is more OLTP and from that perspective things are looked from quite different angle.
Since everybody is voting for ‘collapsed’ schema, one big fact table, is there any role for snowflake schema anymore? Are there any opposite arguments that you guys could, if not agree, but accept as the other way of solving the problem?
Vilho,
The quick answer is that normalization is good (especially performance-wise) for some kinds of updates and a few kinds of queries, while denormalization is good for other kinds of updates and many kinds of queries. So it depends on your workload.
It also depends on what software you use, as different software responds differently to different workloads.
Basically, if updating with data integrity is the most demanding part of the task, normalization may make a lot of sense. If most retrievals are of single short rows, why not normalize? If a whole lot of different applications will use the same database, normalization may well turn out to be best. Absent those three scenarios, however, normalization’s benefits are often overblown.
Hi Curt,
Re: Infobright –
One quick note to follow up on Jeff Kibler’s observation about the limit on the number of columns – it is a function of how MySQL (the database front end for Infobright) handles the column definitions. There isn’t any such limitation inside of the Infobright engine proper.
In terms of the table structures, the maximum number of columns would be 2^32; however the current file abstraction would only allow 100,000 columns. This latter limitation can be easily removed.
.g.
1010data has no hard limit on the number of columns in a single table but as a practical matter system resource considerations probably puts it in the hundreds of thousands. We certainly have handled tens of thousands in a test table but in truth no “real” table has exceeded several thousand. Except possibly in cases of efficiency-motivated denormalization (to avoid joins), which in 1010data is usually unnecessary, or time series, which would be better organized differently, it’s hard to see why tables in almost any business sector would have more than a few thousand columns.
It’s also good to note that for Oracle that the 1000 column limit also includes “hidden columns”. Each column that has a function-based index applied to it, for each function that column gets materialized as a hidden column so effectively you have 1000 columns minus the number of function+column pairs for your function-based indexes.
DeepCloud.co also has a 1024 column maximum. However if Ingres raise the limit for Vectorwise, we will increase the capacity of our MPP engine accordingly.
What would be a reasonable maximum – 4K ?
Today I’m sure 4K would be plenty. However, I don’t have a good feel for the rate of growth.
To re-use one of my favourite quotes, “In theory, there is no difference between theory and practice. But, in practice, there is.” Jan L.A. van de Snepscheut.
The theoretical limit for SAND is 4,294,967,296. Practically it is 8,192. We haven’t had a real world request for anything like that number. As you like to say as a grey beard in Chuck Taylors, we have seen a lot of requests over time.
How to count the no of columns in a table in DB2
[…] opposed to Oracle’s annoying 1000 column limit, Vertica tables can have an effectively unlimited number of columns (whatever the machine’s […]
This is old but indexes very well in google, so for people looking into a columnar database with limitless columns number I would recommend MonetDB which is an open source project that worked on the early days of columnar data stores.
https://www.monetdb.org/