Summarizing Data with
By Scott Mitchell
One of the motivations behind using databases to hold information is so that decisions can be made on the data. For example, the benefit of a retailer storing product sales information in a database is that the retailer can examine reports on the data, showing interesting facts, such as what products are the top sellers, where a particular item is selling best, the gross revenue generated from all sales of a particular item or store, and so on.
Typically, interesting data has different dimensions. A dimension can be thought of as an attribute with a set of possible values. In the database world, this maps to a table column. For example, an eCommerce store might have a table tracking all sales, with a record for each item sold. One dimension is the item that was sold; another, the store from which is was sold. Oftentimes, it is useful to view summary data for a particular dimension, such as all items sold at a particular store, or all of the sales for a particular store. In this article we'll examine how to use native SQL commands to provide such summary queries.
First Things First - Examining
Before we can discuss what
CUBE are, we must first take a look
GROUP BY clause, which is a powerful clause of the SQL
statement that gets far less press than it deserves. Simply, the
GROUP BY clause
allows for aggregate functions to be applied to particular partitions of a database table. These
partitions are specified via the column list in the
GROUP BY clause.
Now that I have you thoroughly confused, let me attempt to demystify this confusion by presenting
an example. Consider that we have a table to track stats for various basketball players in the NBA.
This table, let's call it
PlayerStats, might have the following structure:
For our example, let's assume this table has the following data:
|Kobe Bryant||Lakers||Shooting Guard||2,461||564||67|
|Robert Horry||Lakers||Power Forward||522||514||61|
|Chris Webber||Kings||Power Forward||1,542||704||88|
|Mike Bibby||Kings||Point Guard||875||147||8|
Earlier I mentioned that the
GROUP BY clause is useful when using aggregate functions.
Before we can examine the
GROUP BY function, let's take a moment to discuss SQL
aggregate functions. An aggregate function is a function whose result is based on the results in a
number of rows. Common SQL aggregate functions include
SUM, and so on. For example, we can determine the
total number of points scored by players in the
PlayerStats table by executing
the following SQL query:
This query would return a single row with a single column, the value in that column being 8,345 - the
sum of all of the points of all of the players. However, what if we wanted to view the points by
position, or by team? Note that for this particular problem, we don't want to apply to the SUM aggregate
function to all of the records in the
PlayerStats table; rather, we want this function
to be applied to various partitions of the table. So, if we want the total points by team, we want
the function applied to the partition of the players who play for the Lakers; we also want the function
applied to the partition of players who play for the Kings; furthermore, we want the function applied to
the partition of players who play for the Rockets.
To construct such a query we use the
GROUP BY clause. The
GROUP BY clause
specifies what table column(s) should be used to partition the records in the table. For example,
to partition the table by the
Team column, we'd use the following SQL query:
The results of this query would be three rows, one for each unique team. Specifically, the results would be:
For more information on the
GROUP BY clause, refer to the previous article,
GROUP BY Clause.
Grouping on Multiple Columns
GROUP BY clause can be used on multiple columns in a table. For example, imagine
that we wanted to partition our table into subsets by team and position. That is, instead
of wanting to see the sum of all points for each team, we want to see the sum of points for the
Laker point guards, shooting guards, small forwards, power forwards, and centers. Similarly, we want
to see the sum of points for the Kings point guards, shooting guards, and so on. And similarly for
all teams in the table. To accomplish this, we could use the following SQL statement:
Position has to be in brackets is because Position is a reserved word in
SQL Server 2000...
The results returned by this query for our given data set would not be terribly interesting (since there are no records in the table where there are more than one player on the same team that play the same position). The results of this query on our current data set would be:
(If an additional record were added to the
PlayerStats table, say, Bobby Jackson, Kings,
Point Guard, 150 points, the value for the Kings, Point Guard result would be 1,025 (the sum of all points
by Kings point guards).)
Summing Up the Points for Team and Position
Grouping on multiple columns is nice, but commonly for reports not only are we interested in the points per team per position, but also in the total number of points per team and the total number of points per position. That is, we would want the following results:
|Point Guard||Shooting Guard||Power Forward||Center||SUM|
Note that the bottom-most row and right-most column are "SUM" columns, summing up their respective rows and columns. The bottom right hand corner is a "sum of sums". To put it another way, each of the bottom-most row's column shows the sum of points for each position. The right-most column shows the sum of points for each team. The bottom right hand corner shows the sum of points across all teams and positions. And, of course, in the table one can quickly find the sum of points for the players of a particular position on a particular team.
This data structure is, in the database world, referred to as a cube. A cube is a structure that provides aggregate functions to multiple dimensions. The output above is an example of a two-dimensional cube. However, cubes of any dimension can be created. (In fact, you can have one-dimension cubes.) Specifically, a data cube presents summary data for each of the dimensions used to partition a table.
Imagine for a moment that we were asked by
our boss to create a Web report that displayed the NBA player statistics as shown above. How would we accomplish
this? One way would to be to use a SQL statement with a
GROUP BY clause on both the
Position columns. Then, in our ASP or ASP.NET Web page, we'd create
two arrays to hold the sum of points for each "dimension" (points by team and points by player). Finally,
to compute the bottom right hand cell, we'd need to just sum up the values in one of the two arrays.
While this approach is plausible, there's an even easier way to accomplish the same thing using one SQL statement. We'll look at this in Part 2 of this article.