Techniques for Randomly Reordering an ArrayBy Scott Mitchell
While reading through some of Jeff Atwood's old blog entries, I stumbled across this gem: The Danger of Naïveté. In it, Jeff discussed the pitfalls the can befall a programmer who implements a naïve algorithm and calls it a day. Consider an algorithm to randomly reorder an array. If you have a collection of items to reorder, the naïve approach is to enumerate each element and swap its position with an element in some other randomly-selected position. However, such an algorithm (as we discuss in this article), does not have an even distribution of results - certain permutations are more likely than others when starting from a set point.
Jeff's blog entry caught my attention because of a past article I wrote here on 4Guys, Randomly Reordering an Array. That article, written back in 2000, showed various techniques for reordering an array in classic ASP using VBScript. And, wouldn't you know it, the first technique uses the exact same naïve algorithm Jeff warns about in his blog post! Oops! (I've since updated the old article.)
After learning of the problems with the naïve algorithm I decided to spend some time exploring approaches that produce valid distributions. This article shows why the naïve algorithm produces less than ideal results and includes code for two alternative solutions. Read on to learn more!
Dissecting the Naïve Algorithm
The naïve way of reordering an array (and the way I showed in my earlier article, Randomly Reordering an Array) is to enumerate through each item and swap it with another randomly selected item in the array. The following snippet of code implements the naïve algorithm, reordering an integer array named
If you run this algorithm a handful of times it appears to correctly do a random in-place reordering of the elements in
deck. But the results
aren't truly random - some permutations are over-weighted while others are under-weighted.
Take a three item array with element values 0, 1, and 2. These three elements can be combined into six permutations:
With the naïve algorithm, each item in the array is randomly switched with another item in the array (including, perhaps, itself), meaning that for an array of n elements there are nn possible outcomes. For a three element array, then, there are 33 = 3*3*3 = 27 possible outcomes. But there are only six ways a three element array can be arranged, and six doesn't divide 27. Consequently, some outcomes in the naïve algorithm must be overrepresented and others underrepresented. This can best be illustrated by decomposing the algorithm to show the actual results.
The following (very tall) figure illustrates this breakdown for a three item array that stores the values 0, 1, and 2, and that starts with the values ordered
0 1 2. The node in the far left of the diagram is the start state. Here the first position is evaluated and randomly swapped with the left, center, or right value. These three
possibilities are illustrated in the next level of nodes. This process is then repeated, but with the middle element, and then again with the third.
The final list of potential results are on the far right.
The various outcomes are not evenly represented. The arrangements
0 2 1
1 0 2
1 2 0
0 1 2
2 0 1
2 1 0
This uneven weighting can be illustrated by reordering the array thousands of times and tallying up the resulting re-orderings. The download available at
the end of this article includes a demo that does this tallying for a user-specified array size and number of re-orderings. The demo also includes an option
to view the results in Microsoft Excel, which you can use to quickly chart the distribution. The following graph shows the distribution of a three item
array randomly ordered 60,000 times. Because there are six possible outcomes, each outcome should occur roughly 10,000 times. But as the following graph
shows, the permutations
0 2 1,
1 0 2, and
1 2 0 are overrepresented.
|Does an Imperfect Random Reordering Really Matter?|
If you have implemented the naïve algorithm to display the results of some array in a random order, you may wonder whether you'll need to update your
implementation with a correct one. (We examine two correct algorithms in the remainder of this article.) Whether you need to update an incorrect
implementation depends on how important it is that the reordering is truly random. If you are building an online poker site where the naïve algorithm
is used to shuffle a deck of cards, then it is of paramount importance that the shuffling be truly random, otherwise savvy players would know that certain
deck permutations are more likely than others, and use that knowledge to game the system. Likewise, if you are building an ecommerce site and are showing
random products on the homepage based on the naïve algorithm then it is important that each product is equally likely to be displayed. If you unknowingly
are showing certain products more than others because of a faulty random reordering algorithm then you cannot be sure if a certain product that has fewer sales
is because that product is less popular or because it's appearing less often in the product rotation. Ditto for code that is used to randomly selecting
an advertisement to display.
On the other hand, if you are displaying some random quote of the day at the bottom of your page, or are showing a random headline from an RSS feed, then true randomness isn't that vital.
Fixing the Naïve Algorithm
The subtle problem with the naïve algorithm is that the number of outcomes does not divide the number of actual permutations. Consequently, there are overrepresented and underrepresented outcomes. In general, given an array size of n the naïve algorithm will always have nn outcomes while the number of permutations will always be n!. (n!, pronouced n factorial, is defined as n * (n - 1) * (n - 2) * ... * 2 * 1. For example, 3! = 3 * 2 * 1 = 6, and 5! = 4 * 3 * 2 * 1 = 120.) What we need is in algorithm that has precisely n! outcomes to match up with the n! permutations.
In 1938 statisticians Ronald Fisher and Frank Yates devised the Fisher-Yates Shuffle algorithm, which could be used to randomly order a set of items with a correct distribution of results. It being 1938, their algorithm was actually expressed in terms of steps to be performed by a human. In his seminal book The Art of Computer Programming, Donald Knuth provided the algorithm in pseudo-code. Today the algorithm is commonly called the Knuth shuffle, or the Knuth-Fisher-Yates shuffle.
The way the Knuth-Fisher-Yates works is by enumerating n - 1 positions, and only considering a swap with positions not yet enumerated. This is typically implemented in code by looping from the right-most array position to the second to left position. Each enumerated element is randomly swapped with an element position to its left (including, possibly, its own location). The code is deceptively similar to the naïve algorithm:
There are two subtle but important changes in the code. First, the algorithm loops from
deck.Length - 1 (the rightmost element) to the second to the
leftmost element (note that it loops while
i is strictly greater than 0). Second, the
swapWithPos choice is limited
to the positions from 0 to
i, inclusive. (In the naïve algorithm the
swapWithPos can be from any element in the array regardless
of the current element being enumerated.)
The following diagram illustrates how this algorithm unfolds for a three item array. As you can see, the algorithm has six possible outcomes, which match up to the six possible permutations. Consequently, no permutation is over- or under-weighted.
The following chart compares the tally of arrangements using the Knuth-Fisher-Yates versus the naïve algorithm when performing 60,000 re-orderings. The red bars represent the tallies for Knuth-Fisher-Yates. As you can see, they're tight around 10,000 outcomes, which is expected. The naïve algorithm, over-represents three permutations.
Ordering the Contents of an Array by a Random Value
If you need to bring back rows from a database query in random order perhaps the easiest approach is to simply add an
ORDER BY NEWID()to the
SELECTstatement. The SQL Server function
NEWID()returns a new
uniqueidentifier, which is a globally unique identifier (GUID). Each row is "assigned" a new uniqueidentifier, which then is used to order the results. The net result is that the rows are returned in random order. (For more information on returning rows in a random order, see: Using
NEWID()to Randomly Sort Records.)
This same concept can be applied to randomly ordering arrays. In short, if you can associate some random value with each element in an array then you can sort by that value to get a random ordering. In his blog entry titled Shuffling, Jeff Atwood proposes a simple LINQ-based approach:
As with our previous example,
deck is an array or
List of some sort.
OrderBy is an
extension method that is part of the LINQ
to Objects library. Calling
MyObject.OrderBy(func) returns an ordered enumeration of of the data in MyObject sorted
by the criteria returned by the function func. The above code uses a lambda expression to define func; for each element in the array this
function says to order by a new
System.Guid value. The net result of the
OrderBy method is an object that orders each element
deck by a GUID value, which is similar in concept and functionality to adding
ORDER BY NEWID() to a
Upon first glance there are two main differences between the OrderBy GUID algorithm and Knuth-Fisher-Yates. The most apparent one is that Knuth-Fisher-Yates does in place shuffling whereas the OrderBy GUID returns a new object that represents the random order. In other words, when using Knuth-Fisher-Yates you are affecting the actual order of elements in the array. OrderBy GUID, on the other hand, does not disturb the elements in the array.
The second difference between the two has to do with the asymptotic performance. Knuth-Fisher-Yates examines each element in the array once, giving it
an asymptotic running time of O(n). The OrderBy GUID approach must sort the results; the best sorting algorithms have a running time of
O(n log n). This difference in running time will is not profound until you get to substantially large values of n, so if you are
sorting a deck of 52 cards, for instance, the asymptotic running time difference is insignificant. (The demo available for download at the end of this article
lists the time taken to perform the tallies for each algorithm; you'll see that in most cases the OrderBy GUID takes 50-75% longer than the Knuth-Fisher-Yates
implementation. For small values of n, the bulk of this time difference is likely accounted for by the fact that the OrderBy GUID approach must
create a new
System.Guid object for each element, whereas Knuth-Fisher-Yates is just performing swaps in an array.)
The more subtle yet important question is, Does the OrderBy GUID algorithm produce a valid, even distribution among the permutation? I (incorrectly) assumed that the naïve algorithm worked as desired when writing Randomly Reordering an Array, and didn't learn of my mistake until years later. To convince myself that the OrderBy GUID approach works correctly, I tallied the permutations for five item arrays shuffled 120,000 times and compared the OrderBy GUID distribution against the Knuth-Fisher-Yates algorithm. As you can see, both are tight around 1,000, which is how many times each of the 120 permutations should occur.
Compare this to the distributions for the naïve algorithm when dealing with a five element array shuffled 120,000 times. The following chart shows the distribution discrepancy between the OrderBy GUID and the naïve algorithm. The X-axis represents the various permutations and the Y-axis the number of times each permutation occurred in the 120,000 shuffles. The black triangles show the distribution for the OrderBy GUID algorithm; they are tight around 1,000. The red squares are the distributions for the naïve algorithm. As you can see, they're all over the place. Certain permutations are wildly overrepresented while others are wildly underrepresented.
Granted, this emperical analysis does not "prove" the correctness of the OrderBy GUID approach, but it does give me more confidence in its correctness. To more thoroughly convince yourself of the randomness of the OrderBy GUID algorithm you can use the Chi-square test.
There are many ways to randomly reorder the contents of an array. The most intuitive approach is, unfortunately, incorrect in that it over- and under-weights particular permutations. To generate a true random sampling you need to use an alternative algorithm. In this article we looked at two: Knuth-Fisher-Yates and OrderBy GUID. The download available at the end of this article includes an implementation of both algorithms as well as a test page that you can use to tally the permutations for shuffles of different sized arrays.