Published: Wednesday, April 30, 2003
Summarizing Data with CUBE Queries
By Scott Mitchell
Introduction
One of the motivations behind using databases to hold information is so that decisions can be made
on the data. For example, the benefit of a retailer storing product sales information in a database
is that the retailer can examine reports on the data, showing interesting facts, such as what
products are the top sellers, where a particular item is selling best, the gross revenue generated from
all sales of a particular item or store, and so on.
Typically, interesting data has different dimensions. A dimension can be thought of as an attribute
with a set of possible values. In the database world, this maps to a table column. For example, an
eCommerce store might have a table tracking all sales, with a record for each item sold. One
dimension is the item that was sold; another, the store from which is was sold. Oftentimes, it is
useful to view summary data for a particular dimension, such as all items sold at a particular store,
or all of the sales for a particular store. In this article we'll examine how to use native SQL
commands to provide such summary queries.
First Things First - Examining GROUP BY
Before we can discuss what ROLLUP and CUBE are, we must first take a look
at the GROUP BY clause, which is a powerful clause of the SQL SELECT
statement that gets far less press than it deserves. Simply, the GROUP BY clause
allows for aggregate functions to be applied to particular partitions of a database table. These
partitions are specified via the column list in the GROUP BY clause.
Now that I have you thoroughly confused, let me attempt to demystify this confusion by presenting
an example. Consider that we have a table to track stats for various basketball players in the NBA.
This table, let's call it PlayerStats, might have the following structure:
PlayerStats |
Name | varchar(50), Primary Key |
Team | varchar(50) |
Position | varchar(10) |
Points | int |
Rebounds | int |
Blocks | int |
For our example, let's assume this table has the following data:
PlayerStats |
| Name | Team | Position | Points | Rebounds | Blocks |
| Kobe Bryant | Lakers | Shooting
Guard | 2,461 | 564 | 67 |
| Shaq | Lakers | Center | 1,841 | 742 | 159 |
| Robert Horry | Lakers | Power
Forward | 522 | 514 | 61 |
| Chris Webber | Kings | Power
Forward | 1,542 | 704 | 88 |
| Mike Bibby | Kings | Point Guard | 875 | 147 | 8 |
| Yao Ming | Rockets | Center | 1,104 | 675 | 147 |
Earlier I mentioned that the GROUP BY clause is useful when using aggregate functions.
Before we can examine the GROUP BY function, let's take a moment to discuss SQL
aggregate functions. An aggregate function is a function whose result is based on the results in a
number of rows. Common SQL aggregate functions include COUNT, MIN,
MAX, AVG, SUM, and so on. For example, we can determine the
total number of points scored by players in the PlayerStats table by executing
the following SQL query:
SELECT SUM(Points)
FROM PlayerStats
|
This query would return a single row with a single column, the value in that column being 8,345 - the
sum of all of the points of all of the players. However, what if we wanted to view the points by
position, or by team? Note that for this particular problem, we don't want to apply to the SUM aggregate
function to all of the records in the PlayerStats table; rather, we want this function
to be applied to various partitions of the table. So, if we want the total points by team, we want
the function applied to the partition of the players who play for the Lakers; we also want the function
applied to the partition of players who play for the Kings; furthermore, we want the function applied to
the partition of players who play for the Rockets.
To construct such a query we use the GROUP BY clause. The GROUP BY clause
specifies what table column(s) should be used to partition the records in the table. For example,
to partition the table by the Team column, we'd use the following SQL query:
SELECT Team, SUM(Points)
FROM PlayerStats
GROUP BY Team
|
The results of this query would be three rows, one for each unique team. Specifically, the results
would be:
Team | |
| Lakers | 4,824 |
| Kings | 2,417 |
| Rockets | 1,104 |
For more information on the GROUP BY clause, refer to the previous article,
Using the GROUP BY Clause.
Grouping on Multiple Columns
The GROUP BY clause can be used on multiple columns in a table. For example, imagine
that we wanted to partition our table into subsets by team and position. That is, instead
of wanting to see the sum of all points for each team, we want to see the sum of points for the
Laker point guards, shooting guards, small forwards, power forwards, and centers. Similarly, we want
to see the sum of points for the Kings point guards, shooting guards, and so on. And similarly for
all teams in the table. To accomplish this, we could use the following SQL statement:
SELECT Team, [Position], SUM(Points)
FROM PlayerStats
GROUP BY Team, [Position]
|
The reason Position has to be in brackets is because Position is a reserved word in
SQL Server 2000...
The results returned by this query for our given data set would not be terribly interesting (since
there are no records in the table where there are more than one player on the same team that play the
same position). The results of this query on our current data set would be:
Team | Position | |
| Lakers | Center | 1,841 |
| Rockets | Center | 1,104 |
| Kings | Point Guard | 875 |
| Kings | Power Forward | 1542 |
| Lakers | Power Forward | 522 |
| Lakers | Shooting Guard | 2,461 |
(If an additional record were added to the PlayerStats table, say, Bobby Jackson, Kings,
Point Guard, 150 points, the value for the Kings, Point Guard result would be 1,025 (the sum of all points
by Kings point guards).)
Summing Up the Points for Team and Position
Grouping on multiple columns is nice, but commonly for reports not only are we interested in the points per
team per position, but also in the total number of points per team and the total number of points
per position. That is, we would want the following results:
| | Point Guard | Shooting Guard |
Power Forward | Center | SUM |
| Lakers | | 2,461 | 522 | 1,841 | 4,824 |
| Kings | 875 | | 1,542 | | 2,417 |
| Rockets | | | | 1,104 | 1,104 |
| SUM | 875 | 2,461 |
2,064 | 2,945 | 8,345 |
Note that the bottom-most row and right-most column are "SUM" columns, summing up their respective rows and
columns. The bottom right hand corner is a "sum of sums". To put it another way, each of the
bottom-most row's column shows the sum of points for each position. The right-most column shows the sum of points
for each team. The bottom right hand corner shows the sum of points across all teams and positions.
And, of course, in the table one can quickly find the sum of points for the players of a particular
position on a particular team.
This data structure is, in the database world, referred to as a cube. A cube is a structure that
provides aggregate functions to multiple dimensions. The output above is an example of a two-dimensional
cube. However, cubes of any dimension can be created. (In fact, you can have one-dimension cubes.)
Specifically, a data cube presents summary data for each of the dimensions used to partition a table.
Imagine for a moment that we were asked by
our boss to create a Web report that displayed the NBA player statistics as shown above. How would we accomplish
this? One way would to be to use a SQL statement with a GROUP BY clause on both the
Team and Position columns. Then, in our ASP or ASP.NET Web page, we'd create
two arrays to hold the sum of points for each "dimension" (points by team and points by player). Finally,
to compute the bottom right hand cell, we'd need to just sum up the values in one of the two arrays.
While this approach is plausible, there's an even easier way to accomplish the same thing using
one SQL statement. We'll look at this in
Part 2 of this article.
Read Part 2