Summarizing Data with CUBE
Queries
By Scott Mitchell
Introduction
One of the motivations behind using databases to hold information is so that decisions can be made
on the data. For example, the benefit of a retailer storing product sales information in a database
is that the retailer can examine reports on the data, showing interesting facts, such as what
products are the top sellers, where a particular item is selling best, the gross revenue generated from
all sales of a particular item or store, and so on.
Typically, interesting data has different dimensions. A dimension can be thought of as an attribute with a set of possible values. In the database world, this maps to a table column. For example, an eCommerce store might have a table tracking all sales, with a record for each item sold. One dimension is the item that was sold; another, the store from which is was sold. Oftentimes, it is useful to view summary data for a particular dimension, such as all items sold at a particular store, or all of the sales for a particular store. In this article we'll examine how to use native SQL commands to provide such summary queries.
First Things First  Examining GROUP BY
Before we can discuss what ROLLUP
and CUBE
are, we must first take a look
at the GROUP BY
clause, which is a powerful clause of the SQL SELECT
statement that gets far less press than it deserves. Simply, the GROUP BY
clause
allows for aggregate functions to be applied to particular partitions of a database table. These
partitions are specified via the column list in the GROUP BY
clause.
Now that I have you thoroughly confused, let me attempt to demystify this confusion by presenting
an example. Consider that we have a table to track stats for various basketball players in the NBA.
This table, let's call it PlayerStats
, might have the following structure:
PlayerStats  

Name  varchar(50) , Primary Key 
Team  varchar(50) 
Position  varchar(10) 
Points  int 
Rebounds  int 
Blocks  int 
For our example, let's assume this table has the following data:
PlayerStats  

Name  Team  Position  Points  Rebounds  Blocks 
Kobe Bryant  Lakers  Shooting Guard  2,461  564  67 
Shaq  Lakers  Center  1,841  742  159 
Robert Horry  Lakers  Power Forward  522  514  61 
Chris Webber  Kings  Power Forward  1,542  704  88 
Mike Bibby  Kings  Point Guard  875  147  8 
Yao Ming  Rockets  Center  1,104  675  147 
Earlier I mentioned that the GROUP BY
clause is useful when using aggregate functions.
Before we can examine the GROUP BY
function, let's take a moment to discuss SQL
aggregate functions. An aggregate function is a function whose result is based on the results in a
number of rows. Common SQL aggregate functions include COUNT
, MIN
,
MAX
, AVG
, SUM
, and so on. For example, we can determine the
total number of points scored by players in the PlayerStats
table by executing
the following SQL query:
SELECT SUM(Points)

This query would return a single row with a single column, the value in that column being 8,345  the
sum of all of the points of all of the players. However, what if we wanted to view the points by
position, or by team? Note that for this particular problem, we don't want to apply to the SUM aggregate
function to all of the records in the PlayerStats
table; rather, we want this function
to be applied to various partitions of the table. So, if we want the total points by team, we want
the function applied to the partition of the players who play for the Lakers; we also want the function
applied to the partition of players who play for the Kings; furthermore, we want the function applied to
the partition of players who play for the Rockets.
To construct such a query we use the GROUP BY
clause. The GROUP BY
clause
specifies what table column(s) should be used to partition the records in the table. For example,
to partition the table by the Team
column, we'd use the following SQL query:
SELECT Team, SUM(Points)

The results of this query would be three rows, one for each unique team. Specifically, the results would be:
Team  

Lakers  4,824 
Kings  2,417 
Rockets  1,104 
For more information on the GROUP BY
clause, refer to the previous article,
Using the GROUP BY
Clause.
Grouping on Multiple Columns
The GROUP BY
clause can be used on multiple columns in a table. For example, imagine
that we wanted to partition our table into subsets by team and position. That is, instead
of wanting to see the sum of all points for each team, we want to see the sum of points for the
Laker point guards, shooting guards, small forwards, power forwards, and centers. Similarly, we want
to see the sum of points for the Kings point guards, shooting guards, and so on. And similarly for
all teams in the table. To accomplish this, we could use the following SQL statement:
SELECT Team, [Position], SUM(Points)

The reason Position
has to be in brackets is because Position is a reserved word in
SQL Server 2000...
The results returned by this query for our given data set would not be terribly interesting (since there are no records in the table where there are more than one player on the same team that play the same position). The results of this query on our current data set would be:
Team  Position  

Lakers  Center  1,841 
Rockets  Center  1,104 
Kings  Point Guard  875 
Kings  Power Forward  1542 
Lakers  Power Forward  522 
Lakers  Shooting Guard  2,461 
(If an additional record were added to the PlayerStats
table, say, Bobby Jackson, Kings,
Point Guard, 150 points, the value for the Kings, Point Guard result would be 1,025 (the sum of all points
by Kings point guards).)
Summing Up the Points for Team and Position
Grouping on multiple columns is nice, but commonly for reports not only are we interested in the points per
team per position, but also in the total number of points per team and the total number of points
per position. That is, we would want the following results:
Point Guard  Shooting Guard  Power Forward  Center  SUM  
Lakers  2,461  522  1,841  4,824  
Kings  875  1,542  2,417  
Rockets  1,104  1,104  
SUM  875  2,461  2,064  2,945  8,345 
Note that the bottommost row and rightmost column are "SUM" columns, summing up their respective rows and columns. The bottom right hand corner is a "sum of sums". To put it another way, each of the bottommost row's column shows the sum of points for each position. The rightmost column shows the sum of points for each team. The bottom right hand corner shows the sum of points across all teams and positions. And, of course, in the table one can quickly find the sum of points for the players of a particular position on a particular team.
This data structure is, in the database world, referred to as a cube. A cube is a structure that provides aggregate functions to multiple dimensions. The output above is an example of a twodimensional cube. However, cubes of any dimension can be created. (In fact, you can have onedimension cubes.) Specifically, a data cube presents summary data for each of the dimensions used to partition a table.
Imagine for a moment that we were asked by
our boss to create a Web report that displayed the NBA player statistics as shown above. How would we accomplish
this? One way would to be to use a SQL statement with a GROUP BY
clause on both the
Team
and Position
columns. Then, in our ASP or ASP.NET Web page, we'd create
two arrays to hold the sum of points for each "dimension" (points by team and points by player). Finally,
to compute the bottom right hand cell, we'd need to just sum up the values in one of the two arrays.
While this approach is plausible, there's an even easier way to accomplish the same thing using one SQL statement. We'll look at this in Part 2 of this article.