The Key Differences Between Clustered vs. Nonclustered Index

Database Optimization Guides

In relational databases, indexes are data structures associated with columns in a table, primarily created to improve lookup speeds. These structures essentially sort the content of the related columns and contain pointers to the original table. When looking for a particular value or set of values, the database engine first checks if any usable indexes exist on the columns of the search argument. It then uses the index pointers to find the associated rows in the base table. Database indexes are forms of B-tree data structures, composed of root, intermediary, and leaf nodes.

This article will review the key differences between clustered and nonclustered indexes, when to use them, and some practical examples. It uses SQL Server® and PostgreSQL® as a frame of reference. There are three types of sorting indexes:

  • Clustered indexes: These indexes usually reside in a database table. There is only one clustered index in a table.
  • Nonclustered indexes: These indexes exist outside a table and are connected to the original table by pointers. A table can be associated with multiple nonclustered indexes.
  • Columnstore indexes: These indexes store data on a column-by-column basis (non-B-tree), which is ideal for analytical queries in big data applications. Columnstore indexes are beyond the scope of this article. 

Summary of main clustered vs. nonclustered index differences

Clustered index

Nonclustered index

Structure

  • Exists on the base table
  • Data stored physically in the index’s order
  • One per table only
  • Exists outside the base table and stores pointers to data
  • Physical storage order not affected
  • Multiple per table is possible

Read performance

  • Faster for full-table scans/range queries
  • Faster for small, specific retrievals

Write performance

  • Write performance slow if rearranging required
  • Write operations require an additional insert into the index table

Maintenance

  • Defragmentation or reorganization to prevent storage inefficiencies
  • Update statistics to prevent stale contexts
  • Defragmentation or reorganization to prevent storage inefficiencies
  • Update statistics to prevent stale contexts

Structure

Clustered Indexes

Clustered indexes are database indexes that typically exist within a table. They determine the actual order of physical data storage and are responsible for “clustering” the rows of a table in a specific order—hence the name. In SQL Server, primary keys are automatically organized as clustered indexes unless specified otherwise. Therefore, there can only be one clustered index per table since it’s not possible to arrange a table in multiple ways. In the occasional case where a primary key or clustered index is not defined, SQL Server stores the data in an unorganized manner, known as a heap.

The following command creates a clustered index on a SQL Server table (assuming there isn’t already a primary key defined):

CREATE CLUSTERED INDEX idx_OrderDate ON Orders(OrderDate);

While PostgreSQL offers users the ability to cluster indexes, it behaves differently from SQL Server. You can create indexes and run the CLUSTER command to physically reorder the table’s data, but inserts and updates won’t adhere to this clustering rule. The CLUSTER command must be rerun to reorder data. An example of PostgreSQL clustering is as follows:

CREATE INDEX idx_OrderDate ON Orders(OrderDate);
CLUSTER Orders USING Idx_OrderDate;

Note that PostgreSQL offers several kinds of indexes, such as the Generalized Inverted Index (GIN) and Block Range Index (BRIN), that, which have specific use cases but are outside the scope of this article.

Here are some of the key characteristics of SQL Server clustered indexes:

  • Data access path: It uses a direct path since the index refers to the physical order of data, and no row identifier (RID) lookup is necessary.
  • Storage overhead: This overhead is limited to the B-tree nodes and row data in leaf nodes.
  • Column coverage: All columns are readily available since the indexes are part of the table. INCLUDE keywords are not necessary. 
  • Effect on caching/page reads: Data is ordered in 8KB pages. Since rows are ordered, range queries are efficient. Caching is also optimized, as sequential pages can be kept in memory. 
  • Foreign keys impact: Since primary keys are clustered, foreign keys are easily found when needed.
  • Concurrency/thread safety: Hotspots become possible when data is added to the same page often. SQL Server implements latches on pages and locks on rows to maintain thread safety. 

Oracle offers its own version of clustered indexes, known as index-organized tables (IOTs). IOTs offer improved speed for applications that make heavy use of range scans and other scenarios. However, while IOTs are helpful in many situations, they are not as essential to database performance as clustered indexes are to SQL Server.

Nonclustered Indexes

Unlike clustered indexes, nonclustered indexes are data structures that exist outside a database table. They store only a part of the parent table’s data, which comprises the columns being indexed. Nonclustered indexes also contain pointers linking data back to the original table. These pointers can be RIDs if the parent table is stored as a heap, or clustered keys if the table already has a clustered index.

Nonclustered indexes are particularly useful when specific subsets of data are frequently queried (other than the clustered key). They allow quick lookups on columns that are frequently filtered or sorted without scanning the entire table; however, they come with storage overhead.

An SQL Server table can be associated with up to 999 nonclustered indexes and can be created using the command below:

CREATE NONCLUSTERED INDEX idx_CustomerID ON Orders(CustomerID);

Nonclustered indexes can also be created on multiple columns:

CREATE NONCLUSTERED INDEX idx_Customer_OrderDate ON Orders (CustomerID, OrderDate);

Going one step further, if we want to add data to a nonclustered index to cover our queries and be available to speed up response times for specific data requests, but not be part of the nonclustered index, we could use the INCLUDE keyword instead. This is demonstrated below:

CREATE NONCLUSTERED INDEX idx_Customer_OrderDate_Include ON Orders (CustomerID, OrderDate) INCLUDE (Amount);

In PostgreSQL, all indexes are naturally nonclustered, unless the CLUSTER keyword is executed, as stated earlier:

CREATE INDEX idx_customerid ON Orders(CustomerID);

Nonclustered indexes have the following characteristics:

  • Data access path: The data access is indirect; this is so because the database engine must first refer to the external index, find the correct pointer, and use it to retrieve the required data from the database table.
  • Storage overhead: This overhead is higher, since the external data structure contains the indexed columns and pointers.
  • Column coverage: Only the columns that were specified when the index was created are covered.
  • Effect on caching/page reads: Range queries and caching tend to be less efficient, since more page reads might be required.
  • Foreign keys impact: Lookup speeds can be improved if foreign key columns are included in the index, but not otherwise.
  • Concurrency/thread safety: The indexes are usually thread safe as they exist outside of the table’s storage. However, heavy write operations can cause indexes to become fragmented over time.

Read performance

In the case of clustered indexes, reads are immediate since there’s no need to refer to pointers. Searching for a specific row by the clustered key is fast because it’s already structured as a B-tree. Similarly, range-based queries are quick due to sequential reading.

The read performance of nonclustered indexes is more complicated. Typically, reading requires an extra step to use the index pointer and find the corresponding data in the original table (bookmark lookup), which slows down read performance. However, there are some specific cases where nonclustered indexes can provide significant improvements to read performance. For example, fully covering queries (when the index includes all required columns) are exceptionally fast.

Nonclustered indexes are also suitable for querying with commonly used, selective conditions. 

Let’s look at an example that makes this clearer.

Let’s assume we have an SQL Server table with one  million rows, as defined below:

CREATE TABLE Orders (
      OrderID INT IDENTITY(1,1) PRIMARY KEY,
      CustomerID INT,
      OrderDate DATE,
      Amount DECIMAL(10,2)
);

Here, OrderID is the primary key and, by default, the clustered key. The keywords INT IDENTITY(1,1) attribute an auto-incremental OrderID to every new row added, starting at 1, and incrementing by 1.

A query using the clustered index would look like this:

SELECT * FROM Orders WHERE OrderID = 500000;

This results in a clustered index seek (implying the usage of an index to retrieve specific information). As the data is ordered incrementally by OrderID, the database may be able to satisfy the query with minimal reads; however, other factors such as fragmentation, available memory, and query optimizer plan choices may affect the actual performance.

On the other hand, directly querying the table without involving the primary key would not be as efficient: 

SELECT * FROM Orders WHERE CustomerID = 12345;

This clustered index scan, which includes a full table lookup, would be relatively less efficient, involving thousands of page reads and noticeably higher response times.

Let’s add a nonclustered index to the table and run the same query as before:

CREATE NONCLUSTERED INDEX idx_CustomerID ON Orders(CustomerID);

SELECT * FROM Orders WHERE CustomerID = 12345;

This nonclustered index search involves first looking through the index for the required customerID and then retrieving all the required rows from the source table. This includes a few page reads for the index seek, and another one for every matching order in the source table, depending on the cache status and I/O performance. This approach would, in general, be relatively much faster than a full-table lookup.

Let’s assume we frequently request data involving only customerID and orderID . We could benefit from creating a covering index by including the necessary column data in our index:

CREATE NONCLUSTERED INDEX idx_CustomerID_Covering ON Orders(CustomerID) INCLUDE (OrderDate);

Now that all required data exists in the index, the database engine can satisfy the query directly from the index, without having to perform lookups on the source table. This can improve performance, particularly in read-heavy systems, depending on other factors, such as memory constraints and caching.

img_0.png
Reading clustered and nonclustered indexes

Source: https://chsamii.medium.com/clustered-index-vs-non-clustered-index-542f84e4be83

Write performance

A write operation on a clustered index has higher overhead than simply adding to a heap, as the physical ordering still needs to be maintained. When a new row is inserted, the database engine must identify the appropriate page based on the clustered key’s ordering. When a row needs to be added to a page, but there isn’t enough space on the page, a page split occurs.

Page splits

A page split divides the page into two, with approximately half the original rows in each new page. The clustered index’s B-tree is updated to maintain accurate pointers. The new page may be located near the old page or sometimes (inefficiently) at a faraway location on disk. Over time, this inefficiency leads to fragmentation, resulting in performance degradation due to poor page organization.

While page splitting is expected when frequently updating or inserting data, the following  precautions can be taken to avoid excessive page splitting:

  • Use clustered sequential IDs over random IDs to ensure new rows are always added to the end of the table.
    OrderID INT IDENTITY(1,1) PRIMARY KEY,
  • Create indexes with the keyword FILLFACTOR, which determines how much free space is left at the end of a page, allowing you to easily add data and delay page splits in SQL Server. In the example below, with FILLFACTOR set to 80, it leaves pages with 20% of free space.
    CREATE CLUSTERED INDEX idx_OrderID ON Orders(OrderID) WITH (FILLFACTOR = 80);

Write-ahead logging

Database engines maintain a transaction log, which records all changes to the database before they are physically applied. Once the change is securely stored in the log, only then does the engine modify physical memory. This process is called write-ahead logging (WAL). Its main purpose is to prevent corruption of old and new data in the event of a system or server crash during a write or update operation. Most relational database management systems implement WAL before modifying any pages.

Naturally, WAL applies to index-related write operations, too. Before pages are modified, the expected change is first noted in the transactional log. Additional indexes would require more write operations and logging, creating higher I/O pressure. Therefore, while both clustered and nonclustered indexes can provide optimization benefits, performance may degrade in cases of frequent write operations.

Nonclustered indexes introduce additional overhead when dealing with write or update operations. When a row is added to the source table, the engine must update the nonclustered index. If the index is covered (using the INCLUDE keyword) and includes the data of certain columns, then additional operations are required. When clustered keys are modified, all associated nonclustered indexes may need to have their pointers updated, which can be an expensive operation. Therefore, nonclustered indexes must be added with caution, keeping these cost-benefit trade-offs in mind.

Let’s look at an example that deals with writing to clustered and nonclustered indexes. Let’s assume we have the same table as our previous example, containing one million rows, with OrderID  serving as the primary and clustered key:

CREATE TABLE Orders (
      OrderID INT IDENTITY(1,1) PRIMARY KEY,
      CustomerID INT,
      OrderDate DATE,
      Amount DECIMAL(10,2)
);

We insert a new row:

INSERT INTO Orders (CustomerID, OrderDate, Amount) VALUES (12345, '2025-01-01', 500.00);

Since OrderID is incrementally added to the end of the table, we would expect a maximum response time of 1  millisecond (ms) and no more than one page read.

If OrderID was not based on an incremental ordering, the engine would have to look for the correct place to insert the row, leading to a response time of around 30 ms along with a page split.

Let’s now add a nonclustered index:

CREATE NONCLUSTERED INDEX idx_CustomerID ON Orders(CustomerID);

Every insert would now imply one clustered index B-tree insert along with one nonclustered index B-tree insert, increasing response times by approximately 2 – 3 ms.

Let’s assume we used an index with the INCLUDE keyword instead:

CREATE NONCLUSTERED INDEX idx_CustomerID_Covering ON Orders(CustomerID) INCLUDE (OrderDate, Amount);

Since we now need to add column data to the new index, we can expect a further increase of 1 – 2 ms in insert operation times.

As indexes grow in number and complexity, write operations can become expensive. A cost–benefit analysis must be conducted by engineers to gauge whether additional indexes might help improve performance or degrade it.

Maintenance

Indexes can be highly valuable when the goal is to improve performance, but they must be regularly maintained to ensure they continue to benefit the database. Maintenance is an integral part of every database administrator's indexing strategy.

Defragmentation

One of the primary reasons indexes need to be maintained is to handle fragmentation. As stated earlier, fragmentation occurs when frequent write/update operations cause data to be stored nonsequentially. Database engines prefer sequential data as it reduces unpredictability and improves efficiency.

Index fragmentation can be of two types:

  • Internal fragmentation: This happens when pages contain excessive free space. This is usually caused by deletes and page splits.
  • External fragmentation: This occurs when the logical order of index pages does not follow the physical order, so pages are stored nonsequentially on disk. This is typically caused by frequent inserts into a B-tree, leading to numerous page splits.

SQL Server provides two methods for defragmenting indexes:

  1. REORGANIZE : This command reorganizes the leaf level of the index to remove free space and is ideal for internal and mild fragmentation. Intermediate pages remain fragmented. This can be done using the following command.
    ALTER INDEX idx_CustomerID ON Orders REORGANIZE;
  2. REBUILD : This command drops the index and completely recreates it. This method is ideal for cases involving high fragmentation. Engineers should keep in mind that rebuilding an index can be resource intensive if structures are complex. An index can be rebuilt using the following command.
    ALTER INDEX idx_CustomerID ON Orders REBUILD;

While it’s not possible to completely avoid fragmentation, you can delay it by using FILLFACTOR, as stated earlier. A lower fill factor delays further fragmentation but results in an index larger than the default fill factor. This includes pages that are 100% full, resulting in more pages being read, which is a slightly slower process. Indexes with a high insert/update expectancy can benefit from a lower fill factor, while indexes expected to be read-only should be maintained at a 100% fill factor.

The fill factor can be set while creating an index:

CREATE NONCLUSTERED INDEX idx_CustomerID ON Orders(CustomerID) WITH (FILLFACTOR = 80);

It can also be specified when rebuilding an index:

ALTER INDEX idx_CustomerID ON Orders REBUILD WITH (FILLFACTOR = 80);

Update statistics

In SQL Server, statistics are objects that describe how data in a database table is distributed. When the query optimizer creates query execution plans, it uses statistics to optimize the execution performance. If statistics are stale, the engine may generate an inefficient plan to execute a query. Therefore, when defragmenting a nonclustered or clustered index, it is necessary to update statistics to ensure the index improvements are appropriately utilized.

It is also important to note that when you REBUILD an index, statistics are automatically updated, and manual intervention isn’t required. However, when you REORGANIZE an index, it is recommended to manually update statistics using a command similar to this one:

UPDATE STATISTICS Orders idx_CustomerID;

Let’s examine another case where not updating statistics could cause inefficiencies. Suppose we added 500,000 new orders into the Orders table at once, with 400,000 orders attributed to a single customer. If statistics are not updated, the engine may underestimate the number of rows linked to this customer. It would perhaps access the nonclustered index idx_CustomerID with a nested loop join instead of a hash join (a faster option in this case). After the statistics are updated, the query optimizer will make more efficient plans.

Best practices when indexing

Follow these practices for the best results:

  • Clustered index keys should be stable and preferably incrementally ordered because updates can be expensive due to page splitting. Clustered indexes are better suited to range-based queries, e.g., seeking all orders after OrderID 5,000.
  • Nonclustered indexes should target frequent and selective queries that mainly use the WHERE, JOIN, or ORDER BY keywords. Nonclustered indexes can be created on composite columns, if used together frequently, and can also INCLUDE data from these columns, if necessary, to speed up queries.
  • Fragmentation should be avoided as it can cause severe performance problems. Indexes must be rebuilt or reorganized as needed to defragment page storage. After reorganizing specifically, users should update statistics to ensure performance improvement.
  • Over-indexing can be counterintuitive if not carefully planned. While indexing can improve read speeds, it can drastically affect write speeds. The balance between read and write costs should be frequently monitored by considering the following:
    • Focusing only on indexing columns that frequently appear in WHERE, JOIN, or ORDER BY related queries is preferable.
    • Calling sys.dm_db_index_usage_stats in SQL Server can inform users about unused or underutilized indexes.
    • Using write-heavy systems, such as online transaction processing (OLTP) systems, results in benefiting from fewer indexes. 
    • Using read-heavy systems, such as online analytical processing (OLAP) systems, can be made much faster by utilizing multiple, strategically crafted indexes.
  • Modern SQL monitoring tools provide index recommendations to improve query performance. For example, the SolarWinds® Database Performance Analyzer (DPA) provides index advisors with the ability to identify missing indexes, show index contents, and estimate time savings. This diagram shows a list of index advisors that DPA displays.
  • Here’s what you see when you expand the first advisor for the table tpch.dbo.lineitem, with columns l_tax (include l_orderkey):
img_1.png

The system tells you that by creating a nonclustered index, you could save approximately 28 minutes of wait time, which would be significant in any real-world application.

img_2.png

Conclusion

Both clustered and nonclustered indexes, when used effectively, can provide significant performance improvements to database operations. Clustered indexes, which determine the physical order of data, are better suited for range-based queries and full table scans. Nonclustered indexes are external objects that contain pointers to the physical data. These indexes are beneficial for speeding up read queries that frequently access specific data.

When implementing indexes, read-write costs must be carefully considered. While indexes can improve read performance, they can negatively affect write performance. SQL monitoring tools, such as SolarWinds Database Performance Analyzer, can help identify important indexing opportunities users might miss through manual investigation.

Stop guessing. Start optimizing.