Efficiently Paging Through Large Result Sets in SQL Server 2000
By Scott Mitchell
| A Major Improvement in Performance |
|---|
| A couple of alert readers have written in to provide some tips on improving the performance of this approach. Greg Hamilton has been kind enough to write up an article exploring various improvements that can speed up my approach by more than an order of magnitude. In short, don't use the stored procedure discussed in this article, but rather check out Greg's advice in A More Efficient Method for Paging Through Large Result Sets. |
Introduction
When displaying large amounts of data to a user, it's important that the information be presented in digestible chunks rather
than slamming it all down their throats at once. For example, searching for "Microsoft" on Google returns approximately 2,190,000,000
results, but thankfully Google only shows me ten of those at a time. When providing paging access to particularly large
result sets - tables with tens of thousands, hundreds of thousands, or millions of records - care must be taken in querying
the data such that only the particular page of data requested is returned.
Both the DataGrid in ASP.NET 1.x and GridView in ASP.NET 2.0 providing two paging flavors:
- Default paging - easy to implement but naively retrieves all records from the database and then trims the results to show only those for the requested page
- Custom paging - requires the developer to correctly retrieve only the precise subset of records to show for the current page; requires more effort to implement
ROW_NUMBER() keyword. This new keyword simplifies
efficiently retrieving a particular subset of data ordered by row number. After authoring these articles, I received many requests
from readers for a look at how to implement this type of efficient custom paging using SQL Server 2000 (which lacks the
ROW_NUMBER() keyword).
A previous article on 4Guys, Paging through Records using a Stored Procedure by Daniel Anderson, provides one approach that far outperforms the default paging implementation. However, it has a few areas that can be updated to improve the performance. This article looks at updating Daniel's stored procedure to provide an even-more efficient approach. The stored procedure presented at the end of this article can be used for classic ASP applications, custom paging with the DataGrid in ASP.NET 1.x, or used by the ObjectDataSource to provide custom paging for the GridView in ASP.NET 2.0 applications. Read on to learn more!
Examining Daniel's Approach
In Paging through Records using a Stored Procedure, Daniel Anderson provided a stored procedure named sp_PagedResults
that accepted two integer input parameters - @Page and @RecsPerPage - and returned those records from
an underlying database table that belonged to the specified page of date. For example, if you wanted to show the second page of
data, where a "page" is 10 records, you'd simply call sp_PagedRecords passing in 2 and 10 as values for
@Page and @RecsPerPage.
Grabbing a page of data from a table that has a sequentially increasing unique ID without any "holes" is a cinch and can be
done using a simple WHERE clause. For example, if we have data in a table that looks like:
Employees | |||||
|---|---|---|---|---|---|
EmployeeID | LastName |
FirstName | DepartmentID |
Salary | HireDate |
| 1 | Mitchell | Scott | ... | ... | ... |
| 2 | Lee | Sam | ... | ... | ... |
| 3 | Smith | Hans | ... | ... | ... |
| 4 | Johnson | Ernie | ... | ... | ... |
| 5 | Gonzalez | Laura | ... | ... | ... |
| 6 | Wang | Hugo | ... | ... | ... |
| 7 | Jackson | Tito | ... | ... | ... |
| 8 | Maher | Todd | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
Since there are no gaps or holes in the EmployeeID values, we can easily get a particular page of data using
the generic query:
SELECT ...
|
This assumes that page is indexed starting at zero, meaning that page would be 0 for the first page of data, 1 for the second, and so on. Both the DataGrid and GridView start the page index at zero.
For example, if we wanted to show three employees per page, and wanted the first page of data, we'd do:
SELECT ...
|
Which would return the first three employees. Requesting to see the second page, we'd have:
SELECT ...
|
Which would return employees with EmployeeIDs 4, 5, and 6 - the second page of data.
This approach works great if there are no holes in primary key column values of the data being paged through. But holes can
be created when records are deleted. For example, if Sam Lee and Hans Smith were fired, our Employees table would
now look like:
Employees | |||||
|---|---|---|---|---|---|
EmployeeID | LastName |
FirstName | DepartmentID |
Salary | HireDate |
| 1 | Mitchell | Scott | ... | ... | ... |
| 4 | Johnson | Ernie | ... | ... | ... |
| 5 | Gonzalez | Laura | ... | ... | ... |
| 6 | Wang | Hugo | ... | ... | ... |
| 7 | Jackson | Tito | ... | ... | ... |
| 8 | Maher | Todd | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
We now have a hole from EmployeeID 1 to 4 and the simple algorithm examined a moment ago will fail.
To accommodate for these holes, Daniel Anderson's stored procedure creates a synthetic increasing ID. It does so by with the following algorithm:
- Create a temporary table that has a column for each
column in the data to page through along with a column named
IDthat's marked as anIDENTITYcolumn. AnIDENTITYcolumn is one whose value is auto-incremented with each record inserted. - Dump the entire contents of the table to page through into the temp table. This will set the values in
the
IDcolumn to 1, 2, 3, ... - Apply the earlier, simplistic formula to a
SELECTquery against the temp table and itsIDcolumn
At first glance it may seem like such an approach would be no better than just returning all of the records. After all, we're still
working with all of the records in Employees when dumping them into the temp table. While this is true,
Daniel's approach greatly reduces the traffic between the database server and web server. While both approaches work with
the Employees table's data in its entirety, the stored procedure only returns 10 records (or however many
records per page are configured).
Modernizing Daniel's Stored Procedure
Before we look at improving the performance of Daniel's stored procedure, let's first modernize it a bit. The stored procedure
was written back in 1999 and uses some techniques that aren't "best practices." For one, the stored procedure name is prefixed with
sp_, which causes SQL Server to first search the master database. This only impacts a slight
performance impact, but why not clean it up here while we have the chance? (See Should I
Use the sp_ Prefix for Procedure Names? for a more thorough discussion on this topic.)
Second, rather than using a temporary table, let's use a table variable. Moreover, Microsoft recommends: "use table variables whenever possible except when there is a significant volume of data and there is repeated use of the table." Since this stored procedure just dumps data into the temp table once, reads from it, and then is done with it, using a table variable would adhere to Microsoft's recommendation.
Finally, in ASP.NET 2.0 the custom paging logic used by the ObjectDataSource and GridView defines the records to access not
by page index and records per page, but by the start row index and the maximum number of rows to return. Therefore, let's
augment the stored procedure to take in the integer input parameters @startRowIndex and @maximumRows
rather than @Page and @RecsPerPage. This has the nice side-effect of simplifying the formula used
in the final query's WHERE clause.
Improving the Stored Procedure's Performance
Daniel's approach is costly in part because it dumps all of the columns from the table to be paged through into
the temp table. This requires a complete table scan on the data to be paged through. Instead, we can create the table variable
to hold only the primary key column(s) of the table to be paged through. Since these records likely exist in a clustered index,
populating the table variable likely be done entirely from memory, without needing to hit the disk.
The remaining columns in the table whose data is being paged through can be accessed and returned in the stored procedure's final query, which grabs the particular subset of records from the table variable. After modernizing and optimizing Daniel's original stored procedure we end up with:
|
| A Major Improvement in Performance |
|---|
| A couple of alert readers have written in to provide some tips on improving the performance of this approach. Greg Hamilton has been kind enough to write up an article exploring various improvements that can speed up my approach by more than an order of magnitude. In short, don't use the stored procedure discussed in this article, but rather check out Greg's advice in A More Efficient Method for Paging Through Large Result Sets. |
Comparing the Performance of the Old and New Approach
To ascertain the improvements in performance by restricting the table variable to only including the primary key column(s)
from the table to be paged through I used SQL Profiler and
compared the different queries on the Employees table with different record counts. Keep in mind that these
results are highly unscientific. I ran each of them numerous times to "warm them up" and then recorded nine
test runs and computed the average. But still, I was doing this on my personal machine while running a variety of other programs.
Additionally, I did not simulate any load that multiple, concurrent visitors to a website would incur. The results follow and
are also avaiable at the end of this article as an Excel spreadsheet. In the New stored procedure results, the percentage improvement
over the old approach is included for reference:
| Records in Employees | Avg. Reads | Avg. Duration (in milliseconds) | |
|---|---|---|---|
| Old | 50,000 | 55,360 | 425 |
| 100,000 | 110,492 | 850 | |
| 200,000 | 220,726 | 1,567 | |
| New | 50,000 | 51,637 6.7% improvement | 307 27.7% improvement |
| 100,000 | 103,028 6.7% improvement | 536 36.9% improvement | |
| 200,000 | 205,876 6.7% improvement | 1,033 34.1% improvement |
During my testing I wondered if specifying the table variable's ID column as a primary key would boost performance.
Marking the ID column as a primary key would create a clustered index, meaning that the results would be sorted
by the ID column and the index could be scanned when finding correct subset of records in the final query.
However, the index would incur additional overhead when dumping the contents of the table to be paged through into the table
variable. The spreadsheet at the end of this article shows the numbers, but in my tests marking ID column as a primary key
exhibited worse performance than not, although it was still better than the "old" approach of dumping the entire table contents
into the temp table (the "old" approach).
While this approach is, perhaps, acceptable for paging through large result sets in SQL Server 2000, as illustrated in
Custom Paging in ASP.NET 2.0 with SQL Server 2005,
SQL Server 2005's ROW_NUMBER() technique affords much better performance, taking on average a mere 3 milliseconds
on average to return 10 records from a total table of 50,000. Compare that to the approach examined in this article, which consumed
307 milliseconds - two orders of magnitude worse!
Before you swear off the approach discussed in this article, keep in mind that bringing back all the records takes considerably longer than
dumping them into a table variable. From the Custom Paging in ASP.NET 2.0 with SQL Server 2005 article, paging through
50,000 records using the GridView's default paging, which retrieves all records from the Employees table, took on
average 1,411 milliseconds. Also, since the ROW_NUMBER() option is a feature new to SQL Server 2005, if you're still
using SQL Server 2000 you're stuck with the table variable technique.
Conclusion
In this article we looked at how to retrieve a specific page of data from a table using a stored procedure. This technique is
useful for providing efficient, custom paging through large result sets with the ASP.NET 1.x DataGrid and ASP.NET 2.0 GridView.
This technique can also be used in classic ASP applications or any other data-driven application that needs to work with paged
data. If you are using SQL Server 2005, you should instead use the ROW_NUMBER() keyword as discussed in
Custom Paging in ASP.NET 2.0 with SQL Server 2005.
Happy Programming!
Attachments:
| A Major Improvement in Performance |
|---|
| A couple of alert readers have written in to provide some tips on improving the performance of this approach. Greg Hamilton has been kind enough to write up an article exploring various improvements that can speed up my approach by more than an order of magnitude. In short, don't use the stored procedure discussed in this article, but rather check out Greg's advice in A More Efficient Method for Paging Through Large Result Sets. |