When you think ASP, think...
Recent Articles
All Articles
ASP.NET Articles
ASPFAQs.com
Message Board
Related Web Technologies
User Tips!
Coding Tips

Sections:
Sample Chapters
Commonly Asked Message Board Questions
JavaScript Tutorials
MSDN Communities Hub
Official Docs
Security
Stump the SQL Guru!
XML Info
Information:
Feedback
Author an Article
ASP ASP.NET ASP FAQs Message Board Feedback
Print this page.
Published: Monday, April 05, 1999

Multilingual Databasing Techniques
By Alex Buchanan


         Related Article: Multilingual Databasing, Part 2

- continued -

The need for this kind of script arose when I was asked to write an online phonebook for the Quebec Department of Justice. This was to be done in ASP accessing a database through ODBC. Most databases containing proper names are usually stripped of accents and other extended foreign characters (é, à, ç...). Naturally, Quebec is predominantly French and French names often contain accents.

The major constraint I was imposed was to leave the accents in the fields. Furthermore, the database should be searchable with or without accents. There are many names in French which are spelt with or without accents (i.e. Eric, Éric...). Evidently, SQL takes care of character case, but not foreign characters.

This article will cover the solution to this kind of situation.

I - Database structure
Before modeling the database, you must figure out if you want to prioritize speed or space.

The space-efficient solution would be to compare the queried word to all the possible combinations of a word, taking into consideration the characters which can be accented.

For example, if a user submits "Réjean" as a keyword, the script would search for "Rejean", "Réjean", "Rejèan"... -- until it finds a suitable match or exhausts all possibilities (which are almost endless). Granted you don't need to create additional tables or fields, it seems like overkill considering storage space is cheap and databases are more scalable and robust nowadays. I strongly suggest you don't use this method.

The speed-efficient solution would be to store an additional field or table in the database with the particular name/word stripped from its accents. There are two feasible ways of implementing this:

Adding an extra field
This method consists of adding an extra field to an existing table which would contain the stripped version of the name. For example the following table Entry(EID, Fname, Lname, PhoneNum) would become Entry(EID, Fname, Fname2, Lname, Lname2, PhoneNum) where Fname2 and Lname2 are the stripped versions of the fields with similar names (no accents).

Table Entry
EID Fname Fname2 Lname Lname2 PhoneNum
3 Pierre Pierre Côté Cote 452-4252
4 Réjean Rejean Préçus Precus 863-5218

These two fields would then become indexed (allowing duplicates) since searching would be done using these fields. How the query works, is that the user submits a particular keyword which is stripped of its accents. A search is then performed comparing it to the stripped field (Fname2 / Lname2). If it is found, the field with accents is returned and displayed to the user. Implementation will be discussed further in this article.

This method works fine but keep in mind that there is a considerate amount of redundancy in the table design (although it is controlled). Read the Data Modeling articles on 4GuysFromRolla for more information.

Adding an extra table
This method consists of adding a table which avoids double-storing names by mapping accented names onto their accentless versions. Consider this new table: MappedNames(AccentedName (PK), AccentlessName). Only names with accents would appear in this table; first and last names can be mapped using the same fields.

Table Entry
EID Fname Lname PhoneNum
3 Pierre Côté 452-4252
4 Réjean Préçus 863-5218
Table MappedNames
AccentedName AccentlessName
Côté Cote
Réjean Rejean
Préçus Precus

A SELECT statement to query a name would go as follows:

PARAMETER queryname: Text;
SELECT Entry.Fname, Entry.Lname, Entry.PhoneNum
FROM Entry 
  INNER JOIN MappedNames ON 
    Entry.Lname = MappedNames.AccentedName
WHERE Entry.Lname = queryname 
	OR 
      MappedNames.AccentlessName = queryname;

The parameter queryname will naturally be stripped of accents before being used in the query. This method avoids the redundancy of common names being stored multiple times and the event where first names can also be last names. Both methods vary in complexity but accomplish the same task with similar efficiency.

II - Implementation
For clarity and ease of understanding, I will illustrate the implementation by using the extra field method. The concept applies to any similar method.

Data modeling
The first step is to create the stripped-down fields (i.e. Fname2, Lname2), setting them to "Indexed" (allowing duplicates). This is probably easiest done directly in your DBMS. Add one for each field which is used for searching.

Data definition
Next, we insert or update the stripped fields by stripping down the main accented fields. This can be done by using the following functions and SQL statements:

Function Unapostrophize(theword)
  Unapostrophize = Replace(theword, "'", "''")
End Function

Function StripAccents(theword)
  Dim letter, tempWord

  For i = 1 to Len(theword)
    letter = Mid(theword,i,1)

    Select case LCase(letter)
      case "é","è","ê","ë"
        letter = "e"
      case "à","â","å","ä"
        letter = "a"
      case "ù","û"
        letter = "u"
      case "ò","ö","ô"
        letter = "o"
      case "ç"
        letter = "c"
      case "ï","ì"
        letter = "i"
      case "ÿ"
        letter = "y"
      case "'"
        letter = "''"
     '.
     '. Add more characters as needed
     '.
    End Select

    tempWord = tempWord & letter
  Next

  StripAccents = tempWord
End Function

'...

SQL = "INSERT INTO Entry (Fname, Fname2) VALUES ('" & _
 Unapostrophize(theFname) & "', '" & _
 StripAccents(theFname) & "')"

SQL = "UPDATE Entry SET Fname2 = '" & _
	StripAccents(theFname) & _
	"' WHERE EID = theEID"
Unapostrophize() doubles any apostrophes found. This bypasses the problem where SQL interprets apostrophes as string delimiters should a "D'Angelo" be entered into the database.

StripAccents() replaces all accented letters by unaccented versions as well as takes care of the apostrophe.

theFname, theEID are variables which assumingly have correct Fname and EID values (This might have been returned from a query).

The fields should now all be filled as needed. We will proceed to the querying part.

Querying
Observe the following sample program using the functions and concepts mentioned above:

qLname = Request.QueryString("lname")

If qLname <> "" Then
  SQL = "SELECT Fname, Lname, PhoneNum FROM Entry WHERE Lname2 LIKE
 '%" & StripAccents(qLname) & "%' ORDER BY Lname"

  'Connect to database
  SET DbObj = Server.CreateObject("ADODB.CONNECTION") 

  DbObj.Open "DSN=phonebk;UID=;PWD=;"

  SET oRs = DbObj.Execute(SQL) 'Execute above SQL statement

  While Not oRs.EOF
    Response.Write oRs.Fields("Lname") & ", " & _
	oRs.Fields("Fname") & _
 	" - " & oRs.Fields("PhoneNum")

    'output a linebreak
    Response.Write "<" & "BR>" 
    oRs.MoveNext
  Wend

  DbObj.Close
  SET DbObj = Nothing
  SET oRs   = Nothing
End If

As you can see, we strip the query qLname and compare it to the stripped field Lname2 in the database. Since both fields are on the same level, finding results is no problem. You'll notice that in the results, we output the actual fields with accents, so the user has no clue what goes on.

As a quick review, we create an extra field or table containing a version of a string without accents. We then accept a query which has been stripped too of accents, which is then compared to the stripped fields. Matches can then be found. Fields displayed or used will contain accents.

III - Credits

  • Nicole Calinoiu, for offering the speed-efficient solution of adding an extra table as well as including a good example.

IV - Final Notes
This article covers conceptually how to deal with a database containing foreign characters, making it easily queryable by users. These characters are output directly to HTML, which is fine in a controlled environment where you know all users will be running the same operating system and platform. If this is not the case, you may have problems. This will be the topic of my next article, a follow-up of this one.

I hope this article has been of help to you.

I strongly believe in the sharing of information and talent between professionals and amateurs alike. Should you know a better way of doing the same thing, please share your knowledge. I'm open to comments and criticism and can be emailed at alex@bh.qc.ca

Happy Programming!


Alex is a computer science student at Vanier College in Montreal, Quebec, graduating in May 1999 with intentions to pursue his education at Concordia University. He has been developing web sites and web applications for several years now, only dabbling in ASP for the last few months.


ASP.NET [1.x] [2.0] | ASPMessageboard.com | ASPFAQs.com | Advertise | Feedback | Author an Article