An Extensive Examination of LINQ: Grouping and Joining DataBy Scott Mitchell
As discussed in An Introduction to LINQ, one of the cornerstones of LINQ is the standard query operators, which are a set of extension methods on the
IEnumerableinterface added to the .NET Framework version 3.5. The standard query operators can be applied to any enumeration - any collection of "things." In The Standard Query Operators installment we looked at some of the more common query operators, such as
OrderBy, and others. Each standard query operator can be classified as a certain type of operator. There are aggregation operators like
Max; element operators like
ElementAtlet you pick out a specific element from a sequence; and the ordering operators
OrderByDescendingorder the elements of a sequence based on a specified sorting criteria.
Another class of query operators that we've yet to explore are grouping and joining operators. The grouping and joining operators work with two (or more) sequences
and combine them together, much like how a
JOIN in SQL combines records from two (or more) tables into a single resultset. Through LINQ's standard query operators
(or via its query syntax), it is possible to perform: nested (or grouped) queries; cross joins, or
the Cartesian product of two sequences; inner joins; and left outer joins.
This article explores the grouping and joining operators available in LINQ with a number of examples in both Visual Basic and C#. As with the previous installments in this article series, the complete code is available for download. Read on to learn more!
Examining the Sample Data
The examples in this article look at how to create LINQ queries that group and join the same set of sample data. If you download the demos (available at the end of this article) you'll find a class file named
App_Code/CSCodefolder. This file defines a namespace (
FauxNorthwind) and three classes within that namespace. The first two,
Product, mimic the schema of the
Producttables in the Northwind database. The
CategoryNameproperties, while the
Productclass has the properties
The third class,
Data, is a
static class that defines the "records" in the sample data via two
static arrays. There's an array of type
Categories and an array of type
Categories array is populated with six elements, while the
Products array defines 15 products. (These arrays are populated using the collection initializer and object initializer enhancements added to C# 3.0 and discussed
in the Extension Methods, Implicitly Typed Variables, and Object Initializers installment in this article
A snapshot of the data defined in the
Products arrays follows:
When working with two parent/child-related sequences - categories and products, in this example - you may want to access all of the records in the parent sequence one at a time and, for each parent record, work with its children. Such functionality is possible in LINQ in two ways: by writing a nested query or by using LINQ's
GroupByoperator. Underneath the covers, both approaches are the same. They differ only in the syntax. As we'll see, a nested query is a bit more verbose and not as readable as using the
To understand how a nested query works, let's first look at writing a query that returns the categories. Using the query syntax added to C# 3.0 and Visual Basic 9 we could write:
Now, imagine that for each category we wanted to get back the total number of products. We could accomplish this by using a nested query. A nested query is a query that appears
Select portion of the query. Consequently, the nested query is executed once per item returned from the outer query. If that doesn't quite make sense,
perhaps this syntax will clarify the issue:
This new query uses projection to create an anonymous type with two members:
CategoryName, which contains the name of the category record; and
which indicates the total number of products associated with this particular category. Note that the nested query gets those products whose
Category property matches
Category object from the outer query. In plain English, for each category in the
FauxNorthwind.Data.Categories array a new object
is created. In creating that object the
ProductCount is saying, "Get me all of the products that have the same category as the one that is currently
being processed and then count up how many elements are in this nested query's results." The end result is that for each category in the
array a new object with the category's
CategoryName property and the total number of products associated with that category is returned.
We could add another nested query, one that returns the entire sequence of related products. To do so simply add another member to the anonymous type like so:
The demos available at the end of this article bind the grouped and joined data to a GridView to illustrate how each query operator works. The above query generates a grid with the following data:
Note that resulting data includes all of the categories regardless of whether they have any associated products. This is because the outer query gets all categories
and then gets the related products (and the number of related products) via a nested query. Consequently, each category in the
will be returned; it doesn't matter if there are any related products or not.
While nested queries certainly will get the job done, LINQ includes a more terse and readable syntax for grouping data. The
GroupBystandard query operator allows the items in query to be grouped by some property. Like many other standard query operators, the
GroupByoperator returns a sequence. Specifically, each element returned by
GroupByimplements the interface
IGrouping<TKey, TSource>. This interface has a strongly-typed key (of type
TKey) and a sequence of grouped items of type
TSource. That description can be a bit confusing. Perhaps an example will help.
Imagine that we want to get all of the products and group them by category so that we end up with a sequence of categories and for each category we have a sequence of its
corresponding products. This could be accomplished using the
GroupBy standard query operator like so:
GroupBy operator uses a lambda expression to indicate what property the source data (the
Product objects in
should be grouped by. In this case we group by the
Category property. The net result is that the
GroupBy operator returns a sequence with
five elements - one element for each of the five different Category objects found in the
FauxNorthwind.Data.Products array. Each element has a
(that returns the Category object) along with a sequence of the
Product objects that belong to that group.
The statement in the code snippet above takes these results from the
GroupBy operator and uses the
Select operator to project them into a more
usable format. Namely, the
Select operator creates an anonymous type with three properties:
CategoryName- the name of the category. Note that the category information is accessible via the
ProductCount- the total number of elements in the group's sequence. In other words, the number of products associated with this category.
Products- the sequence associated with this group. Namely, the products that belong to the category group.
Categoriesarray and then ran a nested query to get the related products. In other words, that query started by getting all of the categories and then each category's products (if any existed). However, the
GroupByexample explicitly gets all products and then groups them by category. Those categories that have no products are never retrieved.
You can also use the
GroupBy operator using the query syntax. The code snippet below shows an equivalent statement to the standard query operator usage above, but
uses the query syntax rather than extension methods.
When using the query syntax you use the
Into keyword to name the sequence of groups returned by the
GroupBy operator. In C# you can use any name
categoryProducts), but in Visual Basic you must use the keyword
Group. Furthermore, in Visual Basic you specify the name of the
property directly in the grouping syntax -
Group product By key = product.Category Into Group.
Nested queries and grouping in LINQ is useful, but differs from the manner by which different entities are typically combined in relational databases. In SQL, you can return data from two (or more) tables into a single result set using a join. There are different types of joins in SQL, including inner joins, outer joins, and cross joins. I don't want to belabor how joins work in SQL in this article, as this topic is covered in depth elsewhere. See the SQL Joins tutorial at w3schools.com, or Jeff Atwood's A Visual Explanation of SQL Joins post for more information.
LINQ includes a
Join standard query operator that allows for SQL
JOIN-like behavior and
syntax (when using LINQ's query syntax). This article shows how to perform four different types of joins in LINQ:
- Group Join,
- Inner Join,
- Left Outer Join, and
- Cross Join
A group join is similar to using the
GroupByoperator, although the syntax and semantics differ slightly. In a nutshell, a group join uses the
Joinoperator and the
Intokeyword to group the results of the join. The following query returns all categories and uses a group join to create a group of related products for each category.
Note that to use a group join in Visual Basic you must use the syntax
Group Join instead of
join, as is used in C#. In both cases, the results
are grouped via the
Into keyword. As in the
GroupBy operator example, C# can group the data into a group of any name whereas in Visual Basic you must
use the name
The results of the group join are identical to the nested query's. Namely, the result is a sequence with one element for each category, regardless of whether the category has any associated products. Each element has a "sub-sequence" that contains the products associated with that category (if any). Refer to the first screen shot in this article for a visual representation of these results.
An inner join queries two sequences. For each pair of elements examined, the pair is returned as output if the pair match on some specified criteria. Otherwise the pair is omitted from the results. Let's look at an example. The following query syntax queries two sequences -
FauxNorthwind.Data.Products- joining together the Product and Category objects with matching categories. The
Joinoperator generates a sequence whose elements have two properties:
product. This result is projected via the
Selectoperator into an anonymous type that returns the just the
CategoryNameproperty from the
categoryobject along with the entire
productobject. (An equivalent query using extension methods instead of the query syntax is available in the code download; examine the comments in the
InnerJoin.aspxpage's code-behind class.)
The above query syntax is quite similar to SQL's
JOIN syntax. Here we query "from" the
FauxNorthwind.Data.Categories "table" and join with
FauxNorthwind.Data.Products "table" where the Category objects line up. If it helps, you can envision the above LINQ query as the following SQL statement:
The LINQ query generates a sequence with 15 elements, one for each product in the "database." Each item in the resulting sequence includes the category's name and all fields
Product object. The following screen shot shows a GridView that displays the
CategoryName property in the first column and the product's
UnitPrice properties in the second column.
The result from a LINQ
Join differs than those generated by a nested query, the
GroupBy operator, and by group joins. The nested query,
GroupBy operator, and group joins return a sequence that contains a "sub-sequence." For example, when using the
GroupBy operator to group products
by category, the resulting output was a sequence that contains a
Category) and a sub-sequence of
Product objects that
belonged to that category. On the other hand, the
Join operator returns a sequence that contains the two elements being queries (each
Product object whose categories align).
Left Outer Joins
An inner join only returns items where the two input sets intersect. For example, in the above screen shot the only categories that are returned are those that have a matching product. Similarly, any products without a category would be omitted. An left outer join returns all records from the first sequence (the "left" one) regardless of whether there is a matching element in the second sequence (the "right" one).
A left outer join is implement in LINQ by using a group join on the "left" table and then using another
From operator to get the matching products, if any
exist. The two
From operators are needed to return the data in the same format as the inner join (having the category and product properties side-by-side rather
than in a sub-sequence). The following query uses a left outer join to return a record for all categories, regardless of whether there are matching products.
Note that there is both a group join and an additional
From operator that queries the resulting group join sub-sequence, which are the products that match
the category. The
DefaultIfEmpty() extension method is used here to return
null if there are no items in the sub-sequence (i.e., if there are no matching
products for the category).
The resulting output of the above query is a sequence with 16 elements. One element for each product plus one element for the category that has no products associated with it (Meat/Produce). The screen shot below shows this data when bound to a GridView.
A cross join is the Cartesian product between the two entities performing the join. In other words, the cross join outputs an element for every single possible combination of elements in the two inputs. When doing a cross join on two sequences with sizes A and B, the resulting cross join has A x B results. A cross join in LINQ is accomplished by having two
Fromoperators that specify the two sequences to combine, as the following query syntax illustrates:
The above query generates a resulting sequence with 90 elements and contains all of the category/product combinations, even if the combinations aren't valid. For instance, the resulting sequence includes the pairing of category and product Meat/Produce and Chai, even though Chai product belongs to the Beverages category and the Meat/Produce category has no associated products.
Because so many records are returned by the cross join, a screen shot would be impractical here. To see the output of the cross join download the code available at the end of
this article and run the