In this article, we will consider the difference between clustered and secondary indexes. We will get acquainted with the B-Tree data structure in which indexes are stored.
Database index
Database index — an object of DB created with the goal of improving search data performance.
Since tables in a database may contain many rows stored in random order, searching for rows according to the specified criteria can be time-consuming when you browse through the table sequentially. An index is created based on the values of one or more table columns along with pointers to the corresponding table rows, which allows you to quickly find rows that meet the search criteria. The use of an index improves speed due to its search-optimized structure, such as a balanced B-Tree.
Clustered index
Clustered index – a type of index in DBMS with a tree structure, where index values together with data are stored as an ordered tree, usually as a balanced search tree – B-Tree (or its B-Tree variations).
In a clustered index, each level of the tree represents index pages, and the end pages (leaves, Leaf) contain the actual table row data.
Consider the clustered index (Fig.1) built on a column post_id
(unique identifier of post an article).
The structure of the clustered index is as follows:
- Root node – tree top, containing sorted index values and pointers to child levels. The root level does not store data.
The root level provides the starting point of the search when performing a query. - Intermediate nodes – store sorted index values and pointers to child levels (in our case Leafs). Intermediate levels do not store column data.
- Leaf nodes – the lowest levels of the tree containing the full table rows ( columns with data ). Leaves are stored in the index-ordered, fragmented form on disk.
Let’s look at an example:
The database search starts with the Root node
, after finding in the list that the value we are looking for is 5 – it is the range between 5-6, then we go to the Intermediate node
in it, we find a pointer to the Leaf node
, where all data about post_id=5
is stored.
This example was simplified because in most cases the range in the lists will be much wider (for example, the first row from 1-100, and the second row from 101-200), and the depth of the tree will have more levels.
The cluster index can be only one for each table, because it is used to sort and fragment the data of the whole table on disk.
B-Tree
The B-Tree structure allows you to quickly and efficiently search, insert, delete, and update rows based on the key values of the indexed columns.
The complexity of the B-Tree algorithm is O(log n).
The table below is an illustrative example of how many comparisons you need to make to find a record in a database table with a different number of rows:
Lines in the table | Number of comparisons |
10 | 3,32 |
100 | 6,64 |
1 000 | 9,96 |
10 000 | 13,28 |
100 000 | 16,61 |
1 000 000 | 19,93 |
100 000 000 | 26,57 |
When you add a new row to the table, it is added not to the end of the file, not to the end of the flat list, but to the right branch and node of the tree structure that corresponds to it in terms of sorting. Sometimes the tree performs rebalancing.
MySQL InnoDB peculiarity
In MySQL for InnoDB, every table must have a clustered index. Normally, a clustered index is synonymous with a primary key.
All data in InnoDB is stored in a clustered index.
- MySQL uses PRIMARY KEY for the clustered index.
- If it was not created, InnoDB will take the first UNIQUE index if it exists.
- If no UNIQUE index was created, then MySQL will create a hidden cluster index named
GEN_CLUST_INDEX
, which will contain the index ID, and when a new string is added, the ID will automatically be increased.
MySQL MyISAM peculiarity
MyISAM, unlike InnoDB, does not support clustered indexes, all indexes there are secondaries.
Secondary index
Secondary index (non-clustered) — is a type of DBMS index similar to cluster index, which creates additional data structure (B-Tree, Hash, etc.) for a table by column/multiple columns for faster search. Such index stores data of indexed column as well as index cluster key.
Unlike the cluster index, the secondary index does not affect the physical order of the records, but is an auxiliary structure to speed up data retrieval.
The secondary index stores on the tree structure leaves to the clustered index entry. The index cluster key is needed if you want to get the data of another column that is not in the secondary index.
In this case, DB will perform an additional search of the row by the cluster index and retrieve the values of these columns from the Leaf
cluster index. In this case, we get 2 B-Tree searches: first by secondary index and then by clustered index.
For InnoDB in MySQL
Keep in mind that the size of the cluster index affects the size of the secondary index, because the non-cluster index contains the key of the cluster index.
Tables can have several different secondaries indexes.