Notes on indexes and index-like structures
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing. 🙂
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
2. Further:
- A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
- Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
- Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.
3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.
- The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
- In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
- Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
- solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.
I’ll try to explain this paradox below.
4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.
Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.
5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.
- To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
- Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
- Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.
6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.
The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:
- For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
- So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
- Then regions on a plane are covered by subsequences (or unions of same).
The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.
7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.
Comments
18 Responses to “Notes on indexes and index-like structures”
Leave a Reply
Curt, thanks – great post! One question – as DB storage moves higher in the memory hierarchy, what’s your take on how indexing strategies are changing?
Good question! Frankly, discussions of in-memory data management and of indexing seem to be pretty orthogonal, except insofar as indexes will often be in memory even when data isn’t.
Perhaps part of the reason is that when data is cached, it’s fluid enough that not a lot of investment is made in indexing it.
I’d say about in-memory structures in general:
1. Pointers are back on the table when random lookups are RAM cheap.
2. Tries have been tried. 🙂 See for example solidDB.
3. Fixed-length fields, so that addresses can just be calculated — i.e. arrays — are popular.
4. Compression isn’t as universally highly valued as I’d expect.
Excellent minihistory lesson on the origin, who, how, and what’s happening with indexing. My take with indexes in analytic databases (vs transaction) is that the range of queries forced a tradeoff between index sprawl vs. scan performance. One that got further skewed with emergence of columnar, data skipping, and compression. To all that, add in-memory storage. While this is NOT to say that indexes are unnecessary in analytic database, my take is that they won’t be as vital vs. transaction stores where the range of queries (and data) is typically more limited by purpose & design.
One of the properties of efficiently distributable indexing structures is that they are all describable with space-filling curves. This leads to significant confusion, since being describable in terms of space-filling curves does not imply that the index implementation linearizes data or access on a space-filling curve. Most large-scale geospatial indexes are a case of the former. Linearizing on space-filling curves is not that useful for indexing; this was extensively studied (and patented by IBM, Oracle, et al) in the 1980s but ultimately abandoned. The resurgence in use of these indexes for geospatial is frequently oblivious to prior research since most of the relevant literature is not readily available on the Internet.
Geospatial data has two properties that mean they cannot be trivially grafted onto most existing database engines. First, you can show that a partitioning function that is both uniform and general does not exist that also preserves spatial relationships. Second, they are not efficiently representable on ordered data structures generally (b-trees, skip lists, etc), which is the only representation mapping most databases support. This leaves you with indexes like R-trees that can work adequately if the scale and data model is tightly constrained.
This blog post captures many of these problems:
http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard/
A number of scalable, general geospatial indexing algorithms exist (despite the lack of literature) but they all use embeddings in very high dimensionality spaces, to get around the first theoretical problem, and are not implemented around ordered-preserving data structures, to get around the second problem. These indexing structures work well for more conventional data too, but they fit poorly in database architectures not designed for them.
Great post Curt. Few comments re lockless indexes.
1. Lockless indexes are great for concurrent reads and writes and scale up to a high number concurrent queries on multi-core machines. This is different from saying that they are essential for scale-out. Clustering, high availability, query planning is essential for scale out and skiplist are great for multi-core workloads on one node. One doesn’t exclude the other, basically you need both.
2. The primary reason to choose skiplists is robustness. They are battle tested in a variety of systems product that require high throughput including OSs, high frequency trading solutions and database internals. There are other solutions that claim to be better suited for database workloads: masstree from MIT and BW-tree from Microsoft. Those haven’t proven to be dominant in the industry yet.
Nikita
Data partitioning can be considered a form of indexing too – at least for data skipping via partition pruning.
Curt, thanks for igniting this indexes discussion.
The competitive advantage of indexes eventually depends on 2 factors: use case and cost.
Use case:
• Workloads such as ETL or wide range reporting are not relevant for indexes. On the other hand, for ad-hoc multiple filter queries there is no replacement for indexes.
• Most general purpose RDMBS are built to handle large range of scenarios: transactional, ETL, complex full data set reporting, BI… In many of those scenario indexes bring little or no value so their overall impact is limited, while the cost is high. However, when focusing on the specific use case of interactive BI which means ad-hoc queries with multiple filters, indexes are a game changer.
Cost:
• It was always a trade-off between value gained from indexes in query performance and cost in write/maintenance. Cost is even more significant when it comes to big data
• Indexes that relay on locks to maintain integrity, that consume large storage or require high memory signature cannot serve big data
• Making indexes cheap in terms of write performance, storage and maintenance cost re-positions the value of indexes
This is JethroData formula for gaining competitive value from indexes:
• Reduce cost of indexes via lockless, append only and compressed indexes structure
• Integrate low cost indexes and columns storage to a single SQL solution
• Focused on interactive BI over Big Data: the more selective the queries are, the more advantageous are the indexes. As a user drills-down into their data, index-based queries become faster and faster.
Boaz Raufman
Co-Founder & CTO
JethroData
At Infor we have an in-memory database that stores column-tokenized data in something very like a patricia trie. Writes (whether updates or inserts) are very fast when the column cardinality doesn’t change, and still good when it does. Data skipping works quite well, and particularly on sparse data. Filtering simply retrieves less data.
Vertica projections are indexes according to the index definition above.
Any details as to when/why red-black trees are better than B-trees? It’s definitely not true on HDD (or SSD) and I seriously doubt it’s true in RAM on modern hardware. Any links to articles and/or benchmarks etc. would be appreciated.
To agree with Igor, I can’t think of a case on real computing hardware where red-black trees are a good choice for databases. When performance matters, you treat RAM like a block storage device because that is how it behaves, albeit with lower latency.
Good topic. Just recently had a discussion about indexes in big data, and would like to remind about nice research work about “Space efficient indexes for the Big data era” by Sidirourgos:
http://dare.uva.nl/document/522766
E.g. Column imprint being something that I cannot wait to see in action, especially if it would be implemented to skip disk resident data blocks also, while the paper discusses it more as “cache conscious” index. That would improve Actian Vector over its current min-max block indexes, which are good for naturally ordered data.
Sybase IQ’s 9 index types reduces to 6, if “date”, “datetime” and “time” index types are seen as datatype specific “high group” implementations. High group is b-tree index type improved with G arrays (having lists of row IDs). Unique HG would be only b-tree (since only one row can have the particular value). At least in 15 version HGs were not local but global if range partitioning was used for a table, which is not good at all. Haven’t checked that is it the case with 16 version.
IQ’s “fast projection” index type is mandatory i.e. the columns actually are fast projection indexes. There was this optimized flavor for the fast projections (FP1, FP2, and in 15 version FP3), but it seemed to be dropped away from 16 version. It was storage optimization. 1/2/3 byte lookup tables were used to map distinct values of a column to surrogate byte values. It would be nice to hear reason why optimized FPs were dropped from 16 version. I could guess that FP3 was not that good and/or it made columns with big number of distinct values to be turned flat in much later phase (i.e. bigger rewrite) than e.g. FP2 did in earlier versions. In general optimized FPs make data loads to be not predictable, if/when flattening/rewriting of data happens. Most certainly predictability can be improved a lot by choosing properly when to use and not to use optimized FPs (so when knowing the cardinality of a column beforehand well enough).
Vertica’s projections are like secondary clustered indexes with a subset of (carrying/included) columns. Whereas traditionally (in other products) clustered indexes are primary, since they order the whole base table by the selected ordering column(s).
Poor “secondary clustered indexes” comment from me about Vertica’s projections, since they are still primary. Overlapping/multi-dimensional/subset/decomposed clustered indexes probably would be better descriptions, yet projection is descriptive name altogether. Vertica’s own description seems to refer to the clustered indexes also:
http://184.106.12.19/2011/09/06/the-power-of-projections-part-3/
Correction to this:
“Sybase IQ’s 9 index types reduces to 6, if “date”, “datetime” and “time” index types are seen as datatype specific “high group” implementations.”
It should be “high non group” (HNG). Rest of the “high group” are meant to be “high group” still.
[…] rich it’s worth doubling back to see what other folks added. I think the recent geek-out on indexes is one such case. Good stuff was added by multiple smart […]
The 1/2/3-byte FP indexes in SAP (Sybase) IQ were replaced in version 16.0 with n-bit FPs that serve the same role and provide higher compression for low-cardinality data.
The date/time/datetime indexes are hybrid B-Tree/bitmapped indexes that provide optimized access on datepart boundaries.
In general, the secondary indexes still provide the best performance for highly-selective conditions but are outpaced by IQ’s MPP capabilities in other cases. Secondary indexes remain relevant and supported but the guidelines for using them have evolved substantially.
Thanks Kurt for giving an answer for SAP Sybase IQ’s optimized FP index change. Seems that NBit FP can also flatten implicitly after the number of unique values have exceeded IQ UNIQUE setting for a column. But with proper schema design that can be prevented.
About cost efficient secondary indexes:
It seems already that MonetDB is having Column Imprint index type automatically created for fixed data type columns:
https://www.monetdb.org/Documentation/Manuals/MonetDB/Kernel/Modules/Imprints
[…] if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on […]
[…] April, 2015 post on indexing technology reminds us that one DBMS can do multiple […]