This past week my team was dealing with data warehouse ETL performance challenges. The challenge would manifest in a process that would not finish. It would simply stall with no errors or obvious symptoms that could aid troublehooting on the path to a fix.
In the end, through the terrific sleuthing from the team, we were able to achieve success though the optimization of a single query within a single stored procedure. The approach was to manage (drop and create) the related indexes and statistics through code.
One of the remaining questions that lingered at the end of the week was related to our drop index process within the ETL load. The question is:
Is there a benefit to dropping a table's clustered index in order to achieve a faster load?
My hypothesis was that given the "it's the data" nature of a clustered index's leaf level, there would not be significant benefit to dropping and then rebuilding a clustered index.
This article summarizes the two approaches that I took to answer the question. First let's review the literature, i.e. what are the experts, the SQL Server gurus are saying about this topic in their blogs and articles. Secondly, a quick validation test with timings was performed.
Finally I'll post my conclusion and ask the Twitterverse to weight in with opinions.
Literature Review
I could find no one that seemed to say definitively that an insert into a heap was distinctly faster than an insert into a clustered index. Although there was always room for an "it depends" situation depending on the specific context, I came away from this part of the exercise without a justification for dropping the clustered indexes in order to improve performance.
Here is a sample of what I found:
Kimberly Tripp: The Clustered Index Debate Continues...
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.
Microsoft Support KB Article ID: 297861
A common cause for these symptoms is that inserts to a heap (a table without a clustered index) are often slower than inserts to a table with a clustered index (a clustered table).
Microsoft Technet: SQL Server Best Practices Article
Based on these results, we concluded that the performance of inserting data into a table with a single clustered index is slightly better (3%) than inserting the same data into a table with a corresponding nonclustered index.
Tibor Karszi: Are inserts quicker to heap or clustered tables?
From the comments:
I thing (sic) that if there is such a "heaps are better for insert" thinking that it probably mostly is an urban legend.
Testing
In order to compare the performance of the two approaches, I performed a test based on the following secquence:
- Create database for testing called TestPerf. Make the database large enough to avoid a autogrow impacts on performanct.
- Create a 3 column table with a Primary Key Clustered index.
- Insert one million rows and note the timing. (I ran this process twice and captured both timings.
- Perform a clean up and reboot getting the server back to a pre-test state.
- Create a 3 column table with no primary key or clustered index.
- Insert one million rows and note the timing. (I ran this process twice and captured both timings.
The SQL script for this test is inlcluded at the bottom of this article.
Here were the results:
|
Test
|
Query Duration Minutes:seconds
|
|
Insert to Clustered Index
|
Result 1: 7:32
Result 2: 7:05
|
|
Insert to Heap
|
Result 1: 7:12
Result 2: 7:09
|
The performance levels of the two approaches were nearly identical.
Conclusion
The "Urban Legend" quotation of Tibor Karszi above, most accurately reflects my sentiment following this effort. With a clear understanding of the structure of clustered indexes, a review of the literature, and after my personal testing, I see no clear benefit for dropping clustered indexes in order to improve load performance.
Your comments or opinions are very much appreciated.
SQL Code used in testing
-- SQL that will create heap and clustered index/PK table for comparison of
-- INSERT performance. Don Frederiksen
-- Clustered Index Test
Use PerfTest
CREATE TABLE testcluster(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Code] [varchar](50) NOT NULL,
[Name] [varchar](50) NOT NULL,
CONSTRAINT [PK_testcluster] PRIMARY KEY CLUSTERED
(
[Id] ASC
))
DECLARE @max int
SET @max = 1000000
WHILE @max > 0
BEGIN
insert into Testcluster values(cast (@max as varchar), 'test' + cast (@max as varchar))
SET @max = @max - 1
END
-- Heap Test
Use PerfTest
CREATE TABLE testheap(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Code] [varchar](50) NOT NULL,
[Name] [varchar](50) NOT NULL,
)
DECLARE @max int
SET @max = 1000000
WHILE @max > 0
BEGIN
insert into testheap values(cast (@max as varchar), 'test' + cast (@max as varchar))
SET @max = @max - 1
END