An Extensive Examination of LINQ: The Ins and Outs of Query OperatorsBy Scott Mitchell
As discussed in An Introduction to LINQ, LINQ is composed of three main components:
- Language Extensions,
- Standard Query Operators, and
- LINQ Providers.
LINQ's standard query operators are a collection of query operators. Query operators are methods that walk through a sequence of data and perform some task based
on that data, and are implemented as extension methods on the
IEnumerable<T> interface. We've already seen a handful of standard query operators in use in
previous installments, such as
Average. This article explores the inner workings of query operators, which is essential
to understanding how LINQ's standard query operators work. Read on to learn more!
Iterating Over Sequences of Data
Collections are a common construct in all programming languages. The simplest type of a collection is an array, which contain a fixed-size, homogeneous collection of data that is indexed ordinally. The .NET Framework's
System.Collections.Genericnamespaces house a number of more interesting collection types, including: lists, stacks, queues, dictionaries, and so forth. Moreover, it is possible to create strongly-typed collection classes, which are classes that serve as a collection of data of a specific type. For instance, the ADO.NET libraries include a
DataRowCollectionclass that maintains a collection of
Developers working with collections oftentimes need to enumerate the contents of the collection and work with element one at a time. Such functionality is performed by
foreach loop, like so:
The above code defines an array,
fib, with the first seven Fibonacci numbers. The array's elements
are then enumerated by the
and each element's value is added to a running total,
fibSum. What's important to understand is that
foreach loops use an enumerator
behind the scenes. At runtime, the above code actually gets rewritten to code similar to the following:
Note: Technically speaking,
foreach loops get converted into a
for loop for arrays. But for other collection types, a foreach loop uses an enumerator
like the code above suggests.
You can think of the enumerator (
fibEnumerator) as a bookmark. It remembers the place in the collection the caller last accessed an item. It starts
just before the beginning of the collection and is advanced via a call to
MoveNext. The enumerator's
Current property returns the element in the
collection at the current position. The enumerator used to walk through the elements of the
fib array is retrieved via a call to the array's
GetEnumerator method. All collection types defined in the .NET Framework include a
The .NET Framework's
IEnumerable<T> interface is useful for indicating that a
particular class elements of type
T that can be enumerated. Any class that implements
IEnumerable<T> must provide a
GetEnumerator method. As you might expect, all collection classes in the .NET Framework implement
IEnumerable<T> and therefore can be
enumerated via a
Creating An Enumerator For A Collection Class
As we just discussed, the collection classes in the .NET Framework all provide a mechanism by which an enumerator can be retrieved to iterate through the elements of the collection. The enumerator is created as a separate class that implements the
IEnumerator<T>interface. Let's look at how to create a collection class that can be enumerated. Specifically, let's create a class named
Fibonaccithat holds the first N Fibonacci numbers, where N is defined by the developer using the
Fibonacciclass. The following C# code shows the germane portions of this class. (For a look at this class in Visual Basic, download the demo available at the end of this article.)
Fibonacci class starts by creating a
List of type integer (
data); this list will hold the first N elements of the Fibonacci
Fibonacci class's constructor accepts an integer value as input. This integer value defines the upper bounds of the Fibonacci numbers (N)
and is assigned to the
Capacity property. Next, the
data member variable is populated with the desired number of
Fibonacci numbers. The
Fibonacci class also contains a read-only indexer property.
A developer could use the above
Fibonacci class in the following manner:
A developer might need to enumerate the numbers in the
fib object. But currently the
Fibonacci class does not contain any mechanism for enumeration.
We can add such functionality by having the
Fibonacci class implement
IEnumerable<T> and by providing a
Fibonacci class contains a series of integers it implements
GetEnumerator method must
IEnumerator<int> object. We need to create such an enumerator object and have it returned. This is typically accomplished by creating a
private class within the collection class that implements
IEnumerator<T> and then returning an instance of that class from the
The following code shows the germane parts of the enumerator class and the updated
GetEnumerator method code:
The above code defines a new class,
FibonacciEnumerator, which implements
defines a number of methods and properties, the two germane ones being
MoveNext. Recall that the enumerator object serves as a bookmark.
Therefore, it must have some mechanism to keep track of where in the object the calling code is currently positioned. This information is stored by the
member variable, which is initialized to -1. The
MoveNext method increments
index and returns True if index is still within the bounds of the
collection and False otherwise. The
Current property returns the element at the current position. Additionally, the
GetEnumerator method has been
updated to return a new instance of the
With the above code in place we can now write code that uses the
Fibonacci class and enumerates its elements.
Using C# Iterators
The syntax needed to create an enumerable class is a bit overwhelming and involves: having the class implement
IEnumerable<T>; creating the
GetEnumeratormethod; and designing a private class that implements
IEnumerator<T>. In order to allow developers to enumerate the elements of the
Fibonacciclass we had to add roughly 50 lines of code.
Fortunately, this code bloat was addressed by Microsoft through iterators. Iterators provide for a much more terse syntax for defining how a collection's elements
are enumerated. An iterator is a method or property that loops through the elements within the collection class and using the
yield return keyword. With
iterators we don't need to create an explicit enumerator object (such as
FibonacciEnumerator). Instead, we can simply loop through the collection's elements
and return each element using
Let's rewrite the
GetEnumerator method as an iterator.
When a developer uses a
foreach loop on a
Fibonacci instance, the
GetEnumerator method is called.
Rather than returning a new enumerator object explicitly, the
GetEnumerator method instead loops through the list of Fibonacci numbers. But at each iteration
it invokes the
yield return statement, which returns control to the calling code (the
foreach loop), passing back the current element
number). So the
foreach loop is handed back the first element in the
Fibonacci object. When the
foreach loop starts the next iteration,
control returns to the
GetEnumerator method and it picks up where it left off, moving to the second element in the
data list. The
statement again runs and returns control to the
foreach loop, passing back the second Fibonacci number. The
yield return keyword implements
enumerator-like behavior but without all of the code bloat from our earlier example.
While iterators were introduced to C# 2.0 (the version of C# that shipped with Visual Studio 2005 and the .NET Framework 2.0), they are not available in Visual Basic 9. Instead, Visual Basic developers still need to use the longhand enumerator syntax.
Query Operators: Defining Operations on Enumerations
At this point we have a class that models the first N Fibonacci numbers, which can be enumerated. We might also want to add functionality to return the sum of the Fibonacci numbers. This could be done by adding a property named
Fibonacciclass that enumerated the elements in the
datalist, totalling up their values, and returning that sum. While adding a
Summationproperty would certainly do the trick, it's not useful outside of the
Fibonacciclass. Wouldn't it be grand if we had a mechanism by which we could compute the sum of any collection of integers?
Query operators provide a more general mechanism for performing some query on any collection of elements and are implemented as extension methods on the
IEnumerable<T> interface. Rather than create a
Summation property on the
Fibonacci class, let's instead create a query operator
Summation that sums the elements of any collection of integers.
The above code creates an extension method named
Summation that operates on an object of type
IEnumerable<int>. Specifically, the extension
method: creates a variable named
total; enumerates the data, adding the value of each element in the enumeration to
total; and then returns the
total). With this extension method defined we can now call it from any collection that implements
IEnumerable<int>, such as
instances of the
Fibonacci class. For example, the following code computes the summation of the first seven Fibonacci numbers:
Summation query operator we just created is an example of an aggregate query operator - it collapses a collection into a scalar value. Other types of query
operators return an enumeration. For example, you could have a filtering query operator that is applied to a collection and returns a collection, namely those elements
in the source collection that meet the filtering criteria. There are also partitioning query operators that return a consecutive subset of elements in the source collection
based on some criteria.
|LINQ's Standard Query Operators|
The .NET Framework 3.5 ships with a number of standard query operators, which make up the backbone of LINQ. We've seen a couple of standard query operators in previous
installments, including: |
To demonstrate these alternate types of query operators, let's create a query operator that returns the elements in an
IEnumerable<int> collection only
after the running total exceeds a developer-specified value. In other words, a developer could use this extension method to partition the collection into those elements
whose previous elements' summations exceeds some value. This method, which I've called
SkipUntilSummationSurpassed, is defined below. (Note: I did not create a
VB version of this extension method because it would entail creating an enumerator.)
Note that the
SkipUntilSummationSurpassed query operator returns an
IEnumerable<int> collection. Specifically, it loops through the
IEnumerable<int> collection and computes the running total. Once this total exceeds the specified
sum value, the
elements in the collection are returned via the
yield return keyword.
SkipUntilSummationSurpassed query operator can be used to return those elements in a collection of integers whose previous elements running total exceeds some
defined sum. For instance, to work with the Fibonacci numbers whose earlier elements exceed the sum 6, we could do:
The above code starts by creating a new
Fibonacci object that holds the first seven Fibonacci numbers. It then creates a variable named
that returns the Fibonacci numbers whose preceding elements' running total is greater than or equal to 6. If you compute the running total, you'll see that 1 + 1 + 2 = 4 and
1 + 1 + 2 + 3 = 7, so it's not until after the fourth element that the running total reaches or exceeds 6. Therefore, the elements outputted by the
loop would be 5, 8, and 13.
Chaining Query Operators
Query operators that return collections can be chained with other query operators. For example, we could chain the
Summationquery operators we just created to determine the sum of those Fibonacci numbers whose preceding elements' sums reach or exceed 6 by using the following syntax:
After the above code executes, the value of sum will be 26 - the sum of 5, 8, and 13.
We could take the above code a step further and use the
Where method to only compute the sum of the odd numbers whose preceding
elements' sums are greater than or equal to 6. (We first looked at the
Where standard query operator in
An Introduction to LINQ.) The following code says: "I want to skip elements until the sum surpasses
6, and then take those elements and just work with the odd numbers, and then compute that sum."
When looking at the above code your first inclination may be that when this code is executed the
SkipUntilSummationSurpassed(6) call will start by
getting those elements whose preceding elements' sums are greater than or equal to six, namely that it will return a collection of elements that contain 5, 8, and 13.
And that the
Where method is then passed this collection of 5, 8, and 13, which it whittles down to 5 and 13, which is then passed to the
But such an inclination would be incorrect.
Recall that enumerators work with one element at a time. Because the query operators walk through the records one at a time they pass one record at a time through the pipeline.
What's really happening is that the
SkipUntilSummationSurpassed(6) starts by looping through the elements in
Fibonacci until it reaches that first
element whose preceding elements' sums reach or exceed 6, namely the number 5. This element (5) is then passed to the
Where method, which says, "Ah, yes, this is
odd," and passes it along to the
Summation method. The
SkipUntilSummationSurpassed(6) then gets the next element from the
8, and passes it to the
Where method. The
Where method says, "Whoops, this is even," and the iteration stops there. Finally,
returns the final element in the
Fibonacci object, 13. Because this is odd, it passes through the
Where method and to
Summation, where is is summed with 5 to
return a final value of 18 (namely, 5 + 13).
This explanation is, actually, a bit backwards, but it is a helpful mental model to understand that the elements in a chain of query operators are processed one at a time.
What's really happening is that the order of execution starts from the outside, in. So the
Summation method is invoked first, and it loops through the
source, which is the
Where method. The
Where method loops through it's source, which is the call to
SkipUntilSummationSurpassed(6) method call loops through its source, which is the
Fibonacci object. The
call returns 5 to
Where via a
yield return. The
Where method evaluates 5 against the function supplied to it; because 5 is odd it
is returned via a
yield return to
Summation method then says, "Give me the next value," to its source.
This prompts the
Where method to pick back up in the midst of its loop, which causes the
SkipUntilSummationSurpassed(6) method to continue its
loop where it left off.
SkipUntilSummationSurpassed(6) returns the next element to the
Where method (8), but because this is not odd the
method keeps on iterating without a
yield return. This prompts
SkipUntilSummationSurpassed(6) to return the next value, 13, which is odd and therefore
is returned by the
Where method to the
Summation method. At that point the
SkipUntilSummationSurpassed(6) method has iterated entirely
through its source and so the operation is complete, with the return value being the sum of 5 and 13.
Query Operators and Deferred Execution
In the above explanation of how query operators can be chained together, I noted that the outer-most query operator is evaluated first. In the case of our previous example, the outermost method was
Summation, which does not contain a yield return, but rather has a regular
return, which returns the sum of the source collection's elements. The
Summationmethod is an example of a greedy query operator. A greedy query operator exhaustively enumerates its source in order to return a value.
Where operators are lazy query operators. Because they are iterators, they do not return anything until
they are iterated over. In fact, they do not return anything until they are iterated over. When you see a code statement like:
It is only natural to think that
oddFibs "contains" the odd numbers of the first 10 Fibonacci numbers. However,
oddFibs doesn't hold anything. Don't think of
oddFibs as a collection of data, but rather as a query that operates on data. The
oddFibs query is not evaluated until it is used by a greedy
query operator or until it is enumerated in a
To demonstrate this concept, let's augment the
Fibonacci class to include a new method called
Grow that, when called, doubles the size of the
Capacity. In other words, if you have a
Fibonacci object named
fib that contains the first 10 Fibonacci numbers, calling
fib.Grow() will have
fib double in size, containing the first 20 Fibonacci numbers.
What do you expect will happen if we have the following code?
The above code outputs the odd numbers found in the first twenty Fibonacci numbers: 1, 1, 3, 5, 13, 21, 55, 89, 233, 377, 987, 1597, 4181, and 6765.
oddFibs is defined after the
fib has been created with the first 10 Fibonacci numbers, so it's natural to think that it
"contains" the odd numbers in the first ten Fibonacci numbers. But
oddFibs does not actually hold any values; rather, it is a query definition that operates over
fib when the
foreach statement is reached. Before the
foreach statement is reached, the
Grow method is called and
is updated to include the first 20 Fibonacci numbers. Therefore, when
oddFibs is enumerated it queries the first 20 Fibonacci numbers rather than the first 10.
You can force immediate execution of a query operator by using the
ToList standard query operator, which is a greedy query operator that immediately executes
a query operator and returns the value. That is, had we tacked on a call to
ToList() to the end of the
oddFibs declaration, as in:
Where query operator would be evaluated immediately and
oddFibs would contain the odd numbers in the first 10 Fibonacci numbers:
1, 1, 3, 5, 13, 21, and 55.
Query operators are extension methods on the
IEnumerable<T>interface that can be used to perform processing on a collection. The .NET Framework 3.5 ships with a number of built-in query operators, which are referred to as the standard query operators and will be examined in detail in a future installment.
Query operators can be chained together to form a pipeline. During execution of a chain of query operators, lazy query operators return one record at a time through the
pipeline. Greedy query operators act like a bottleneck and require that data be computed in total before returning data to subsequent links in the chain. Lazy query operators
also exhibit delayed execution. When using lazy query operators it is important to keep in mind that the query operator isn't "invoked" until a greedy query operator is applied
to it or until the query operator is enumerated via a
IEnumerable<T>Interface (technical docs)