Today's question comes from Stephen V.:
I have a DB which includes a table that includes 2 sets of 4 fields that the user wants to do an or search on (i.e. (field1=x or field2=x or field3=x or field4=x) and (field5=y or field6=y or field7=y or field8=y)
). I've seen mention that anOR
query can tend to be rather slow. The queries that I will be performing will be coupled with other conditions that areAND
s (i.e. there are other criteria that could be evaluated first).Someone I know recommended trying a temp table and 8 updates to see if this would be faster than 6
OR
s. I'm just greatly confused in general about exactly which operations are slow, by how much, etc. The DB in question has several hundred thousand records. The fields being searched on are integers (either with meaningful data or foreign keys to other tables). It is in SQL Server 7. So, to summarize, the question is, which of the following is likely to be faster:1) Using
OR
s. If this one, is there an easy way to structure the query so that the other conditions are tested before the ors? In some Platform SDK docs for sql server 7 it mentioned that the deepest level of parentheses are evaluated first.2) Create a temp table of boolean values specifying whether each field qualifies.
3) Some other method.
More specifically, the fields contain the names of 4 people, and then their "rankings". Users may want to search for up to 4 people, and may want to search for rankings that include any one person with a ranking between x and y. If they search for all people between x and y it's easy because then I can just store ranking min and ranking max. Same if it's
> x
only or< x
only.So if they search for person joe and anyone with ranking between 3 and 4, it currently searches for:
(field1=joe or field2=joe or field3=joe or field4=joe) AND ((field5>=3 and field5<=4) OR (field6>=3 and field6<=4) OR (field7>=3 and field7<=4) OR (field8>=3 and field8<=4))
.Unfortunately the rankings change for each record, otherwise the ranking thing would be a moot point.
It started out as an Access DB project, back before I knew much about DB issues. There is a server (via telnet) where people can log in and play chess. There is a variant of chess played there which involves 4 players (2 per team). At the time, there were no "good" online resources. We decided to observe, record, and publish via a DB as many of these games as possible. The original Access DB was very very primitive.. it looked something like this:
ColumnName Datatype GameID [Autonumber] File Location [String] Team1W [String] Team2W [String] Team1B [String] Team2B [String] Team1wRating [Int] Team2wRating [Int] Team2bRating [Int] Team1bRating [Int] There are at most 25,000 usernames who could fill up the
Team1w...Team2b
slots. Meaning if the handles are more than 4 chars it was wasting space.The db form page allows the user to search for up to 4 names (they can specify to look for any of the 4 or all of the 4) as well as look for a range of ratings for one, average, or all players. The slow stuff comes in when they search for player a or b or c or d AND any one player with a range of 1500-1800 for example. Right now the query would look like:
Select * from Games where (Team1W = @Name or Team1B = @Name or Team2W = @Name or Team2B = @Name or Team1W = @Name2 ... or Team2B = @Name) AND ((Team1wRating > 1500 and Team1wRating < 1800) or (Team2wRating > 1500 ... ) etc.
This obviously turns out to be pretty slow. The new structure is being built now. I am trying to figure out how to structure it and how to write the forms/queries so that searches don't take forever. It will be migrated to a sql server 7 DB. There are about 400,000+ records (maybe more).
What I was thinking is:
GamesTable GameID [Increment] [PK] File Location [VarChar(255)] [Unique] Team1w [Int] [FK] Team2w [Int] [FK] Team1b [Int] [FK] Team2b [Int] [FK] Team1wRat [Int] Team1bRat [Int] Team2wRat [Int] Team2bRat [Int] AvgRat [Int] MinRat [Int] MaxRat [Int]
Handles Table HandleID [Increment] [PK] Handle [VarChar(20)] This structure makes most searches a little easier. However the problem is that the query might still involve the same number of
OR
s. I can't seem to determine whether the query above would be slower than doing something like creating a temp table of bool flags:#Temp_Query_Table GameID Name1_Matched Name2_Matched Name3_Matched Name4_Matched Team1wr_Matches Team1br_Matches Team2wr_Matches Team2br_MatchesSo that if they search for
name1 or name2 or name3 or name4 and rating between 1500-1800
it will do aninsert where
(some other criteria that will narrow it down a little) and then 8updates
. Then I would do aselect
on this.I have a feeling there's a better solution to this problem. Possibly involving structuring it in a different way. I can't put the rating fields into the
Handles
tables because the ratings change after each game event.I don't think a
GamesToHandles
table is a viable alternative since this would have 450,000 * 30,000 entries. I had thought about including standard deviation and a complex formula but then the searches would not be 100% accurate. I would not rule out semi-temporary tables like the one above for pre-defined ranges and some batch process to update them regularly (the DB itself is not updated in real time, but rather every 12 hours).Thanks,
-Stephen Vakil, clueless underqualified DB Admin :)
Stephen,
I touched on three different topics in my answer: 1) normalizing the data model, 2) queries, and 3)p erformance
1.) Normalization
Your best bet in this case is to normalize your data model. That will make the queries you have to perform easier. I can't cover all of the intricacies of normalization, but I'll walk you through the steps I menatally used. If you want, I can recommend a few books that cover data modeling in more detail.
It sounds to me like you have two basic things (gurus call them "entities" :) that you want to track: a "handle", which is a person who plays chess online, and a "game", which is a game between four "handles".
So, let's start with that. We get two tables:
Handle |
---|
HandleID (PK) |
Name |
etc. |
Game |
---|
GameID (PK) |
FileLocation |
etc. |
Notice that everything in the Handle
table relates directly to a handle,
and everything in the Game
table relates directly to a game. Now, we need
some way to relate a Handle
to a Game
. A Handle
can relate to multiple
Game
s (Sean played 42 games), and a Game
relates to multiple Handle
s (Game
42 had 4 players). This is called a many-to-many relationship, and it
requires another table (called an "associative entity", or sometimes a "join
table").
Let's do that relationship this way:
GameHandle |
---|
HandleID (PK, FK) |
GameID (PK, FK) |
Rating |
Note that this allows for any number of players per game. If we wanted to enforce the 4 players per game business rule in the database, we could also do it this way:
GamePlayer |
---|
GameID (PK, FK) |
PlayerNumber (PK) |
HandleID (FK) |
Rating |
The first table is a little more space efficient; I'd probably do it using the first design and use a set of stored procedures to make updates to that table that enforce the business rules.
I'm not entirely sure about the rating... if you wanted to historically
store each player's rating after the game, you'd put it where I did (in the
GameHandle
table). If you just wanted to store a current rating, it really
belongs in the Handle
table.
2.) Queries
Now your query looks something like: (
IN
is a shorthand OR
)
|
This will give you GameID
s that you can then join back to
Game
to get any relevant details.
You can also compute ratings statistics for each game on the fly: (Of
course, you can cheat a little and store these values in the Game
table,
also)
|
3.) Performance
I don't think a GamesToHandles
table is a viable alternative since this would have 450,000 * 30,000 entries.
Nope, at least not using the GameHandle
table.
Okay, so you have 450,000 games. You also have 30,000 handles. However,
with the GameHandle
table, you only have 4 rows per game (only 4 players per
game, right?)
So... you have 450,000 x 4 records in the GameHandle
table. That's, oh, 1.8
million records? SQL Server 7.0 will eat that table for lunch, and ask for
dessert. :) Especially if you index it properly. Try three nonclustered
indexes on GameID
, HandleID
, and Rating
.
Hope this helps, and let me know where that web site is... I have to admit, I'm curious about the 4 player chess!
Sean
Read Other SQL Guru Questions |