When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips

Sections:
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
XML Info
Information:
Feedback
Author an Article
ASP ASP.NET ASP FAQs Message Board Feedback
Print this page.
The SQL Guru Answers your Questions...


Today's question comes from Stephen V.:

I have a DB which includes a table that includes 2 sets of 4 fields that the user wants to do an or search on (i.e. (field1=x or field2=x or field3=x or field4=x) and (field5=y or field6=y or field7=y or field8=y)). I've seen mention that an OR query can tend to be rather slow. The queries that I will be performing will be coupled with other conditions that are ANDs (i.e. there are other criteria that could be evaluated first).

Someone I know recommended trying a temp table and 8 updates to see if this would be faster than 6 ORs. I'm just greatly confused in general about exactly which operations are slow, by how much, etc. The DB in question has several hundred thousand records. The fields being searched on are integers (either with meaningful data or foreign keys to other tables). It is in SQL Server 7. So, to summarize, the question is, which of the following is likely to be faster:

    1) Using ORs. If this one, is there an easy way to structure the query so that the other conditions are tested before the ors? In some Platform SDK docs for sql server 7 it mentioned that the deepest level of parentheses are evaluated first.

    2) Create a temp table of boolean values specifying whether each field qualifies.

    3) Some other method.

More specifically, the fields contain the names of 4 people, and then their "rankings". Users may want to search for up to 4 people, and may want to search for rankings that include any one person with a ranking between x and y. If they search for all people between x and y it's easy because then I can just store ranking min and ranking max. Same if it's > x only or < x only.

So if they search for person joe and anyone with ranking between 3 and 4, it currently searches for: (field1=joe or field2=joe or field3=joe or field4=joe) AND ((field5>=3 and field5<=4) OR (field6>=3 and field6<=4) OR (field7>=3 and field7<=4) OR (field8>=3 and field8<=4)).

Unfortunately the rankings change for each record, otherwise the ranking thing would be a moot point.

It started out as an Access DB project, back before I knew much about DB issues. There is a server (via telnet) where people can log in and play chess. There is a variant of chess played there which involves 4 players (2 per team). At the time, there were no "good" online resources. We decided to observe, record, and publish via a DB as many of these games as possible. The original Access DB was very very primitive.. it looked something like this:

ColumnNameDatatype
GameID [Autonumber]
File Location [String]
Team1W [String]
Team2W [String]
Team1B [String]
Team2B [String]
Team1wRating [Int]
Team2wRating [Int]
Team2bRating [Int]
Team1bRating [Int]

There are at most 25,000 usernames who could fill up the Team1w...Team2b slots. Meaning if the handles are more than 4 chars it was wasting space.

The db form page allows the user to search for up to 4 names (they can specify to look for any of the 4 or all of the 4) as well as look for a range of ratings for one, average, or all players. The slow stuff comes in when they search for player a or b or c or d AND any one player with a range of 1500-1800 for example. Right now the query would look like:

Select * from Games where (Team1W = @Name or Team1B = @Name or Team2W = @Name or Team2B = @Name or Team1W = @Name2 ... or Team2B = @Name) AND ((Team1wRating > 1500 and Team1wRating < 1800) or (Team2wRating > 1500 ... ) etc.

This obviously turns out to be pretty slow. The new structure is being built now. I am trying to figure out how to structure it and how to write the forms/queries so that searches don't take forever. It will be migrated to a sql server 7 DB. There are about 400,000+ records (maybe more).

What I was thinking is:

GamesTable
GameID [Increment] [PK]
File Location [VarChar(255)] [Unique]
Team1w [Int][FK]
Team2w [Int][FK]
Team1b [Int][FK]
Team2b [Int][FK]
Team1wRat [Int] 
Team1bRat [Int] 
Team2wRat [Int] 
Team2bRat [Int] 
AvgRat [Int] 
MinRat [Int] 
MaxRat [Int] 

Handles Table
HandleID [Increment][PK]
Handle [VarChar(20)] 

This structure makes most searches a little easier. However the problem is that the query might still involve the same number of ORs. I can't seem to determine whether the query above would be slower than doing something like creating a temp table of bool flags:

	#Temp_Query_Table
	GameID
	Name1_Matched
	Name2_Matched
	Name3_Matched
	Name4_Matched
	Team1wr_Matches
	Team1br_Matches
	Team2wr_Matches
	Team2br_Matches

So that if they search for name1 or name2 or name3 or name4 and rating between 1500-1800 it will do an insert where (some other criteria that will narrow it down a little) and then 8 updates. Then I would do a select on this.

I have a feeling there's a better solution to this problem. Possibly involving structuring it in a different way. I can't put the rating fields into the Handles tables because the ratings change after each game event.

I don't think a GamesToHandles table is a viable alternative since this would have 450,000 * 30,000 entries. I had thought about including standard deviation and a complex formula but then the searches would not be 100% accurate. I would not rule out semi-temporary tables like the one above for pre-defined ranges and some batch process to update them regularly (the DB itself is not updated in real time, but rather every 12 hours).

Thanks,

-Stephen Vakil, clueless underqualified DB Admin :)

Stephen,

I touched on three different topics in my answer: 1) normalizing the data model, 2) queries, and 3)p erformance

1.) Normalization


Your best bet in this case is to normalize your data model. That will make the queries you have to perform easier. I can't cover all of the intricacies of normalization, but I'll walk you through the steps I menatally used. If you want, I can recommend a few books that cover data modeling in more detail.

It sounds to me like you have two basic things (gurus call them "entities" :) that you want to track: a "handle", which is a person who plays chess online, and a "game", which is a game between four "handles".

So, let's start with that. We get two tables:

Handle
HandleID (PK)
Name
e-mail
etc.

Game
GameID (PK)
FileLocation
etc.

Notice that everything in the Handle table relates directly to a handle, and everything in the Game table relates directly to a game. Now, we need some way to relate a Handle to a Game. A Handle can relate to multiple Games (Sean played 42 games), and a Game relates to multiple Handles (Game 42 had 4 players). This is called a many-to-many relationship, and it requires another table (called an "associative entity", or sometimes a "join table").

Let's do that relationship this way:

GameHandle
HandleID (PK, FK)
GameID (PK, FK)
Rating

Note that this allows for any number of players per game. If we wanted to enforce the 4 players per game business rule in the database, we could also do it this way:

GamePlayer
GameID (PK, FK)
PlayerNumber (PK)
HandleID (FK)
Rating

The first table is a little more space efficient; I'd probably do it using the first design and use a set of stored procedures to make updates to that table that enforce the business rules.

I'm not entirely sure about the rating... if you wanted to historically store each player's rating after the game, you'd put it where I did (in the GameHandle table). If you just wanted to store a current rating, it really belongs in the Handle table.

2.) Queries


Now your query looks something like: (IN is a shorthand OR)

SELECT whatever
FROM GameHandle
WHERE (GameHandle IN (A,B,C,D)) 
         AND
      (Rating BETWEEN 1500 AND 1800)

This will give you GameIDs that you can then join back to Game to get any relevant details.

You can also compute ratings statistics for each game on the fly: (Of course, you can cheat a little and store these values in the Game table, also)

SELECT Avg(Rating), Min(Rating), MAX(Rating)
FROM GameHandle
WHERE GameID = 42

3.) Performance


I don't think a GamesToHandles table is a viable alternative since this would have 450,000 * 30,000 entries.

Nope, at least not using the GameHandle table.

Okay, so you have 450,000 games. You also have 30,000 handles. However, with the GameHandle table, you only have 4 rows per game (only 4 players per game, right?)

So... you have 450,000 x 4 records in the GameHandle table. That's, oh, 1.8 million records? SQL Server 7.0 will eat that table for lunch, and ask for dessert. :) Especially if you index it properly. Try three nonclustered indexes on GameID, HandleID, and Rating.

Hope this helps, and let me know where that web site is... I have to admit, I'm curious about the 4 player chess!

Sean


Read Other SQL Guru Questions


ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article