The SQL Guru Answers your Questions...
Today's question comes from Stephen V.:
I have a DB which includes a table that includes 2 sets of 4 fields that the
user wants to do an or search on (i.e. (field1=x or field2=x or field3=x or
field4=x) and (field5=y or field6=y or field7=y or field8=y)). I've seen
mention that an OR query can tend to be rather slow. The queries that I
will be performing will be coupled with other conditions that are ANDs (i.e.
there are other criteria that could be evaluated first).
Someone I know recommended trying a temp table and 8 updates to see if this
would be faster than 6 ORs. I'm just greatly confused in general about
exactly which operations are slow, by how much, etc. The DB in question has
several hundred thousand records. The fields being searched on are integers
(either with meaningful data or foreign keys to other tables). It is in SQL
Server 7. So, to summarize, the question is, which of the following is
likely to be faster:
More specifically, the fields contain the names of 4 people, and then their
"rankings". Users may want to search for up to 4 people, and may want to
search for rankings that include any one person with a ranking between x and
y. If they search for all people between x and y it's easy because then I
can just store ranking min and ranking max. Same if it's > x only or
< x only.
So if they search for person joe and anyone with ranking between 3 and 4, it
currently searches for: (field1=joe or field2=joe or field3=joe or
field4=joe) AND ((field5>=3 and field5<=4) OR (field6>=3 and field6<=4) OR
(field7>=3 and field7<=4) OR (field8>=3 and field8<=4)).
Unfortunately the rankings change for each record, otherwise the ranking
thing would be a moot point.
It started out as an Access DB project, back before I knew much about DB
issues. There is a server (via telnet) where people can log in and play
chess. There is a variant of chess played there which involves 4 players (2
per team). At the time, there were no "good" online resources. We decided
to observe, record, and publish via a DB as many of these games as possible.
The original Access DB was very very primitive.. it looked something like
this:
| ColumnName | Datatype |
|---|
|
GameID | [Autonumber] | |
File Location | [String] | |
Team1W | [String] | |
Team2W | [String] | |
Team1B | [String] | |
Team2B | [String] | |
Team1wRating | [Int] | |
Team2wRating | [Int] | |
Team2bRating | [Int] | |
Team1bRating | [Int] |
There are at most 25,000 usernames who could fill up the Team1w...Team2b
slots. Meaning if the handles are more than 4 chars it was wasting space.
The db form page allows the user to search for up to 4 names (they can
specify to look for any of the 4 or all of the 4) as well as look for a
range of ratings for one, average, or all players. The slow stuff comes in
when they search for player a or b or c or d AND any one player with a range
of 1500-1800 for example. Right now the query would look like:
Select * from Games where (Team1W = @Name or Team1B = @Name or Team2W =
@Name or Team2B = @Name or Team1W = @Name2 ... or Team2B = @Name) AND
((Team1wRating > 1500 and Team1wRating < 1800) or (Team2wRating > 1500 ... )
etc.
This obviously turns out to be pretty slow. The new structure is being
built now. I am trying to figure out how to structure it and how to write
the forms/queries so that searches don't take forever. It will be migrated
to a sql server 7 DB. There are about 400,000+ records (maybe more).
What I was thinking is:
| GamesTable |
|---|
|
GameID | [Increment] | [PK] | |
File Location | [VarChar(255)] | [Unique] | |
Team1w | [Int] | [FK] | |
Team2w | [Int] | [FK] | |
Team1b | [Int] | [FK] | |
Team2b | [Int] | [FK] | |
Team1wRat | [Int] | | |
Team1bRat | [Int] | | |
Team2wRat | [Int] | | |
Team2bRat | [Int] | | |
AvgRat | [Int] | | |
MinRat | [Int] | | |
MaxRat | [Int] | |
| Handles Table |
|---|
|
HandleID | [Increment] | [PK] | |
Handle | [VarChar(20)] | |
This structure makes most searches a little easier. However the problem is
that the query might still involve the same number of ORs. I can't seem to
determine whether the query above would be slower than doing something like
creating a temp table of bool flags:
#Temp_Query_Table
GameID
Name1_Matched
Name2_Matched
Name3_Matched
Name4_Matched
Team1wr_Matches
Team1br_Matches
Team2wr_Matches
Team2br_Matches
So that if they search for name1 or name2 or name3 or name4 and rating
between 1500-1800 it will do an insert where (some other criteria that will
narrow it down a little) and then 8 updates. Then I would do a select on
this.
I have a feeling there's a better solution to this problem. Possibly
involving structuring it in a different way. I can't put the rating fields
into the Handles tables because the ratings change after each game event.
I don't think a GamesToHandles table is a viable alternative since this
would have 450,000 * 30,000 entries. I had thought about including standard
deviation and a complex formula but then the searches would not be 100%
accurate. I would not rule out semi-temporary tables like the one above for
pre-defined ranges and some batch process to update them regularly (the DB
itself is not updated in real time, but rather every 12 hours).
Thanks,
-Stephen Vakil, clueless underqualified DB Admin :)
|
Stephen,
I touched on three different topics in my answer: 1) normalizing the data
model, 2) queries, and 3)p erformance
1.) Normalization
Your best bet in this case is to normalize your data model. That will make
the queries you have to perform easier. I can't cover all of the
intricacies of normalization, but I'll walk you through the steps I
menatally used. If you want, I can recommend a few books that cover data
modeling in more detail.
It sounds to me like you have two basic things (gurus call them "entities"
:) that you want to track: a "handle", which is a person who plays chess
online, and a "game", which is a game between four "handles".
So, let's start with that. We get two tables:
| Handle |
|---|
|
HandleID (PK) |
|
Name |
|
e-mail |
|
etc. |
| Game |
|---|
|
GameID (PK) |
|
FileLocation |
|
etc. |
Notice that everything in the Handle table relates directly to a handle,
and everything in the Game table relates directly to a game. Now, we need
some way to relate a Handle to a Game. A Handle can relate to multiple
Games (Sean played 42 games), and a Game relates to multiple Handles (Game
42 had 4 players). This is called a many-to-many relationship, and it
requires another table (called an "associative entity", or sometimes a "join
table").
Let's do that relationship this way:
| GameHandle |
|---|
|
HandleID (PK, FK) |
|
GameID (PK, FK) |
|
Rating |
Note that this allows for any number of players per game. If we wanted to
enforce the 4 players per game business rule in the database, we could also
do it this way:
| GamePlayer |
|---|
|
GameID (PK, FK) |
|
PlayerNumber (PK) |
|
HandleID (FK) |
|
Rating |
The first table is a little more space efficient; I'd probably do it using
the first design and use a set of stored procedures to make updates to that
table that enforce the business rules.
I'm not entirely sure about the rating... if you wanted to historically
store each player's rating after the game, you'd put it where I did (in the
GameHandle table). If you just wanted to store a current rating, it really
belongs in the Handle table.
2.) Queries
Now your query looks something like: (
IN is a shorthand
OR)
SELECT whatever
FROM GameHandle
WHERE (GameHandle IN (A,B,C,D))
AND
(Rating BETWEEN 1500 AND 1800)
|
This will give you GameIDs that you can then join back to
Game to get any relevant details.
You can also compute ratings statistics for each game on the fly: (Of
course, you can cheat a little and store these values in the Game table,
also)
SELECT Avg(Rating), Min(Rating), MAX(Rating)
FROM GameHandle
WHERE GameID = 42
|
3.) Performance
I don't think a GamesToHandles table is a viable alternative since this
would have 450,000 * 30,000 entries.
|
Nope, at least not using the GameHandle table.
Okay, so you have 450,000 games. You also have 30,000 handles. However,
with the GameHandle table, you only have 4 rows per game (only 4 players per
game, right?)
So... you have 450,000 x 4 records in the GameHandle table. That's, oh, 1.8
million records? SQL Server 7.0 will eat that table for lunch, and ask for
dessert. :) Especially if you index it properly. Try three nonclustered
indexes on GameID, HandleID, and Rating.
Hope this helps, and let me know where that web site is... I have to admit,
I'm curious about the 4 player chess!
Sean