5 kinds of data structure and 16 kinds of data access method
My recent post about datatype extensibility zoomed over at least one head, as per the comment thread. Since then I’ve googled, and come to suspect that part of what I was assuming as common knowledge may not be so common after all. So I’m going to back up and explain a bit about data access methods, as well as the sub-topic of data structures. If you take nothing else away from this post, I hope it will at least remind of you of the sheer variety of ways data can be stored on disk or in RAM.
First, let’s define the concept of data access method in three steps:
-
The basic mission of a database management system is to, upon request, return part of the database to the requester. In a relational database, what’s returned will be a record or set of records; in an object-oriented database it will be an object or set of objects; and so on.
-
In a few major brands of data warehouse appliance, the DBMS may call pretty much the whole database into memory, then find what it wants from there. Most other DBMS are more finicky, and try to retrieve from disk only the data requested. More precisely, they usually retrieve data blocks of a fixed size, and try to bring back only the blocks that may contain the information being looked for.
-
How they make these selections is their data access method. More precisely, it’s the part of the data access method that deals with getting data off of disk. Often, there will be other parts that govern what happens to the data once it’s in memory.
At their core, most data access methods consist of two things:
-
A data structure.
-
(In some cases) A design for indexes that point into that data structure.
There are many kinds of data structure, with some of the most important being:
-
Simple files. A single file can be stored as, well, a file. Or maybe as a BLOB or CLOB (Binary/Character Large OBject) managed by a DBMS. This used to be the way all data was managed, but now it’s mainly used in three contexts:
-
For surviving old-style legacy applications.
-
For single documents, images, videos, sound files, and the like.
-
To hold something whose contents will be indexed by a DBMS or similar technology.
-
-
Arrays with fixed-length entries. There’s nothing faster than getting data from a completely regular array. Just calculate the address and you’re there. But lack of compression is a huge issue, unless you’re dealing with purely numeric data that can’t be compressed anyway. The relational SAS Intelligence Storage has a fairly pure form of this data structure, and MOLAP (Multidimensional OnLine Analytic Processing) products may use it in modified forms.
-
Tables with variable-length entries. The bread-and-butter data structure of modern database management is the table, storing numeric, date, or character-string data. By default, entries take up however many bytes are needed to express them, but various kinds of compression are increasingly used on top of that.
-
Linked list (hierarchy or network). Before there were relational or other index-based database management systems, there were linked-list products. Each data value would be accompanied by a pointer to the next one you were likely to want. Specifically, you were linked to parents, children, and next-records. These systems were called hierarchical if there only was one parent (e.g., IBM’s IMS and it’s cousins), or network if there could be more (e.g., Cullinet’s IDMS). The CODASYL standard covered network DBMS, but people soon stopped caring, as index-based DBMS obsoleted linked-list ones.
-
“It’s all in the index.” In some cases, data is wholly copied into the index used to find it, and the underlying data isn’t really managed at all. This is particularly common in text search, data federation, and columnar RDBMS.
There’s a huge number of possible data access methods, and each major kind can have a broad variety of tweaks. Discussing them was meant to be the main subject of this post, but it seems to have gotten plenty long enough already. And in my flu-like-symptoms funk, I’m temporarily lacking in the enthusiasm to keep going.
So I’ll just finish with a list of some data method alternatives. Please feel free to add others in the comment thread. Some time when I’m more dug out from my backlog, I’ll try to circle back and actually describe how some or all of them work.
-
Simple file manager.
-
Classic relational b-tree.
-
Hash index.
-
Bitmap.
-
Other columnar.
-
Star.
-
Geospatial r-tree.
-
Hierarchy-of-arrays MOLAP.
-
Inverted index.
-
Classic full-text index.
-
Classic linked-list.
-
Classic object-oriented.
-
Native XML (all).
-
Trie.
-
File plus calculated metadata (e.g., for still image).
-
Simple key/value lookup.
Comments
3 Responses to “5 kinds of data structure and 16 kinds of data access method”
Leave a Reply
You missed three of my favorites.
1) skip-list: Neat alternative to balanced binary trees.
2) z-order (I did original research on this one): data transformation that turns any data structure supporting random and sequential access into a spatial data structure.
3) linear hashing.
Models and Efficiency, part 2: Databases…
There are endless possibilities for data structures and access methods. The implementers of early database systems, such as System R, chose to use a simple one-to-one mapping between the model and the physical storage. It was a natural choice as a fi…..
Well nice topic. But there are some half-true statements <>
The original CODASYL model did not have index. But Cullinet’s IDMS did give full support for index in 1984 itself. And in 1992 gave full SQL support – against the NONSQL (sic) database as well as native table definitions. Today’s CA-IDMS is a database supporting two models – CODASYL and Relational in one DBMS.