By Alex Buchanan
Related Article: Multilingual Databasing, Part 2
The need for this kind of script arose when I was asked to write an online phonebook for the Quebec Department of Justice. This was to be done in ASP accessing a database through ODBC. Most databases containing proper names are usually stripped of accents and other extended foreign characters (é, à, ç...). Naturally, Quebec is predominantly French and French names often contain accents.
The major constraint I was imposed was to leave the accents in the fields. Furthermore, the database should be searchable with or without accents. There are many names in French which are spelt with or without accents (i.e. Eric, Éric...). Evidently, SQL takes care of character case, but not foreign characters.
This article will cover the solution to this kind of situation.
I - Database structure
Before modeling the database, you must figure out if you want to prioritize speed or space.
The space-efficient solution would be to compare the queried word to all the possible combinations of a word, taking into consideration the characters which can be accented.
For example, if a user submits "Réjean" as a keyword, the script would search for "Rejean", "Réjean", "Rejèan"... -- until it finds a suitable match or exhausts all possibilities (which are almost endless). Granted you don't need to create additional tables or fields, it seems like overkill considering storage space is cheap and databases are more scalable and robust nowadays. I strongly suggest you don't use this method.
The speed-efficient solution would be to store an additional field or table in the database with the particular name/word stripped from its accents. There are two feasible ways of implementing this:
Adding an extra field
This method consists of adding an extra field to an existing table which would contain the stripped version of the name. For example the following table
Entry(EID, Fname, Lname, PhoneNum) would become
Entry(EID, Fname, Fname2, Lname, Lname2, PhoneNum) where
Lname2 are the stripped versions of
the fields with similar names (no accents).
These two fields would then become indexed (allowing duplicates) since searching would be done using these fields. How the query works, is that the user submits a particular keyword which is stripped of its accents. A search is then performed comparing it to the stripped field (Fname2 / Lname2). If it is found, the field with accents is returned and displayed to the user. Implementation will be discussed further in this article.
This method works fine but keep in mind that there is a considerate amount of redundancy in the table design (although it is controlled). Read the Data Modeling articles on 4GuysFromRolla for more information.
Adding an extra table
This method consists of adding a table which avoids double-storing names by mapping accented names onto their accentless versions. Consider this new table:
MappedNames(AccentedName (PK), AccentlessName). Only
names with accents would appear in this table; first and last names can be
mapped using the same fields.
A SELECT statement to query a name would go as follows:
PARAMETER queryname: Text; SELECT Entry.Fname, Entry.Lname, Entry.PhoneNum FROM Entry INNER JOIN MappedNames ON Entry.Lname = MappedNames.AccentedName WHERE Entry.Lname = queryname OR MappedNames.AccentlessName = queryname;
queryname will naturally be stripped of accents
before being used in the query. This method avoids the redundancy of
common names being stored multiple times and the event where first names
can also be last names. Both methods vary in complexity but accomplish the
same task with similar efficiency.
II - Implementation
For clarity and ease of understanding, I will illustrate the implementation by using the extra field method. The concept applies to any similar method.
The first step is to create the stripped-down fields (i.e. Fname2, Lname2), setting them to "Indexed" (allowing duplicates). This is probably easiest done directly in your DBMS. Add one for each field which is used for searching.
Next, we insert or update the stripped fields by stripping down the main accented fields. This can be done by using the following functions and SQL statements:
Unapostrophize() doubles any apostrophes found. This bypasses the problem where SQL interprets apostrophes as string delimiters should a "D'Angelo" be entered into the database.Function Unapostrophize(theword) Unapostrophize = Replace(theword, "'", "''") End Function Function StripAccents(theword) Dim letter, tempWord For i = 1 to Len(theword) letter = Mid(theword,i,1) Select case LCase(letter) case "é","è","ê","ë" letter = "e" case "à","â","å","ä" letter = "a" case "ù","û" letter = "u" case "ò","ö","ô" letter = "o" case "ç" letter = "c" case "ï","ì" letter = "i" case "ÿ" letter = "y" case "'" letter = "''" '. '. Add more characters as needed '. End Select tempWord = tempWord & letter Next StripAccents = tempWord End Function '... SQL = "INSERT INTO Entry (Fname, Fname2) VALUES ('" & _ Unapostrophize(theFname) & "', '" & _ StripAccents(theFname) & "')" SQL = "UPDATE Entry SET Fname2 = '" & _ StripAccents(theFname) & _ "' WHERE EID = theEID"
StripAccents() replaces all accented letters by unaccented versions as well as takes care of the apostrophe.
theFname, theEID are variables which assumingly have correct Fname and EID values (This might have been returned from a query).
The fields should now all be filled as needed. We will proceed to the querying part.
Observe the following sample program using the functions and concepts mentioned above:
qLname = Request.QueryString("lname") If qLname <> "" Then SQL = "SELECT Fname, Lname, PhoneNum FROM Entry WHERE Lname2 LIKE '%" & StripAccents(qLname) & "%' ORDER BY Lname" 'Connect to database SET DbObj = Server.CreateObject("ADODB.CONNECTION") DbObj.Open "DSN=phonebk;UID=;PWD=;" SET oRs = DbObj.Execute(SQL) 'Execute above SQL statement While Not oRs.EOF Response.Write oRs.Fields("Lname") & ", " & _ oRs.Fields("Fname") & _ " - " & oRs.Fields("PhoneNum") 'output a linebreak Response.Write "<" & "BR>" oRs.MoveNext Wend DbObj.Close SET DbObj = Nothing SET oRs = Nothing End If
As you can see, we strip the query
qLname and compare it to the
Lname2 in the database. Since both fields are
on the same level, finding results is no problem. You'll notice that in
the results, we output the actual fields with accents, so the user has no
clue what goes on.
As a quick review, we create an extra field or table containing a version of a string without accents. We then accept a query which has been stripped too of accents, which is then compared to the stripped fields. Matches can then be found. Fields displayed or used will contain accents.
III - Credits
- Nicole Calinoiu, for offering the speed-efficient solution of adding an extra table as well as including a good example.
IV - Final Notes
This article covers conceptually how to deal with a database containing foreign characters, making it easily queryable by users. These characters are output directly to HTML, which is fine in a controlled environment where you know all users will be running the same operating system and platform. If this is not the case, you may have problems. This will be the topic of my next article, a follow-up of this one.
I hope this article has been of help to you.
I strongly believe in the sharing of information and talent between professionals and amateurs alike. Should you know a better way of doing the same thing, please share your knowledge. I'm open to comments and criticism and can be emailed at email@example.com