Published: Thursday, January 04, 2001
Efficiently Reading Large Text Files
By Bret Hern
Is there a big, honking text file standing between you and performance
nirvana? Wondering how you can find the needle in that 10 MB haystack?
About to give up on the FileSystemObject? What follows is a way to
read those big files quickly enough to make Evelyn Wood jealous.
At relatively small sizes - ~100K or less - using the standard methods
of the FileSystemObject and TextStream object to read in entire files
are reasonably snappy. However, once the file sizes get into megabyte
territory, the standard approaches begin to have, er, issues. Let’s
take a scenario where you need to read a text file to determine if
a keyword is present. For the text file, I set up three test cases - a
file size of 10 KB, 100 KB and 1,000 KB. In each test, the keyword to
be found was placed at the tail end of the file.
Here’s the standard, "brute-force" method of loading up the entire
file into a single buffer variable:
const ForReading = 1
dim strSearchThis
dim objFS
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objTS = objFS.OpenTextFile(Server.MapPath("myfile.txt"), _
ForReading)
strSearchThis = objTS.ReadAll
if instr(strSearchThis, "keyword") > 0 then
Response.Write "Found it!"
end if
|
While this works fine at smaller file sizes, once we break the
megabyte barrier, we’re looking at script timeouts. Notice the
explosion in time required to complete the above task against the
1 MB file - only the most dedicated of users will hang around that
long.
Test #1: Brute Force
(all times in seconds) |
| | 10 KB File | 100 KB File
| 1000 KB File |
| TextStream ReadAll |
0.01 | 0.62 | 73.56 |
Clearly that won’t work for our large file search. You might then be tempted
to simply parse the file line by line to get it into our search
string, thinking that a hard-working loop might outperform the
ReadAll method. You would be wrong. The following method,
with just a simple string concatenation, is even slower:
const ForReading = 1
dim strSearchThis
dim objFS
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objTS = objFS.OpenTextFile(Server.MapPath("myfile.txt"), _
ForReading)
do until objTS.AtEndOfStream
strSearchThis = strSearchThis & objTS.ReadLine
loop
if instr(strSearchThis, "keyword") > 0 then
Response.Write "Found it!"
end if
|
Test #1: Standard Parse
(all times in seconds) |
| | 10 KB File | 100 KB File
| 1000 KB File |
| Standard Parse |
0.02 | 1.27 | 162.44 |
It turns out that string concatenation is one of the slower operations
in the engine, and this method’s performance reflects that. (Of
course, if you could count on the keyword being fully contained on
one line, and you had no other value for the file beyond this one
check, you could simply parse the file and perform the INSTR
check on each line of the loop without taking the concatenation hit.
That approach would be extremely fast, but it’s a bit of a cheat for
the topic at hand, so let’s move on.)
Now, there is a way, using an extremely counterintuitive dynamic
array-building approach, to build up a searchable array that results
in very fast performance. Despite what you’ve always heard about
REDIMing arrays as a bad idea, it turns out that the
array processing overhead is minuscule compared to the string concatenation
issues noted above. Here’s how this approach lays out:
const ForReading = 1
dim strSearchThis
redim arrSearchThis(-1)
dim i
dim objFS
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objTS = objFS.OpenTextFile(Server.MapPath("myfile.txt"), _
ForReading)
i = 0
do until objTS.AtEndOfStream
redim preserve arrSearchThis(i)
arrSearchThis(i) = objTS.ReadLine
i = i + 1
loop
strSearchThis = join(arrSearchThis, VbCrLf)
if instr(strSearchThis, "keyword") > 0 then
Response.Write "Found it!"
end if
|
Test #1: Redimmed Array
(all times in seconds) |
| | 10 KB File | 100 KB File
| 1000 KB File |
| Redimmed Array |
0.02 | 0.15 | 2.05 |
Not bad, eh? I generally stop when I get a 30 or 40-fold performance improvement,
but as every good infomercial commands, wait, there’s more!
Besides the more commonly used ReadAll and ReadLine
methods, the TextStream object also supports a Read(n)
method, where n is the number of bytes in the
file/textstream in question. By instantiating an additional object (a
file object), we can obtain the size of the file to be read, and then
use the Read(n) method to race through our file.
As it turns out, the "read bytes" method is extremely fast by comparison:
const ForReading = 1
const TristateFalse = 0
dim strSearchThis
dim objFS
dim objFile
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objFile = objFS.GetFile(Server.MapPath("myfile.txt"))
set objTS = objFile.OpenAsTextStream(ForReading, TristateFalse)
strSearchThis = objTS.Read(objFile.Size)
if instr(strSearchThis, "keyword") > 0 then
Response.Write "Found it!"
end if
|
Test #1: Read Bytes
(all times in seconds) |
| | 10 KB File | 100 KB File
| 1000 KB File |
| Read Bytes |
0.01 | 0.03 | 0.28 |
A pretty good day’s work. We started at over a minute to perform this
read/search, and we’re now down well under a second. While there would
be some minor additional overhead associated with the additional object,
the massive speed improvement would in most cases be an appropriate tradeoff.
Wrap a function declaration around this snippet and you've got another
good tool for the toolbox!
| Test Summary |
| | 10 KB File | 100 KB File
| 1000 KB File |
| TextStream ReadAll |
0.01 | 0.62 | 73.56 |
| Standard Parse |
0.02 | 1.27 | 162.44 |
| Redimmed Array |
0.02 | 0.15 | 2.05 |
| Read Bytes |
0.01 | 0.03 | 0.28 |
| Test Conditions... |
All tests were performed on an otherwise idle webserver configured with
128 MB RAM, a single 450 MHz Pentium II processor, running Windows
2000 Advanced Server (IIS V5.0). The test timings were done with the
VBScript Timer function, meaning that at the low-end
extremes (the 10 KB File readings), it would be imprudent to read
too much into the 100ths of second differences between methods. All
timings included both setup tasks (variable dimensioning) and shutdown
tasks (object destruction). (For information on timing the execution
of ASP scripts, be sure to read: Timing
ASP Execution Using a Profiling Component and
Timing the Execution of Your ASP Scripts!)
|
By Bret Hern
Credits: Billy Monroe asked the question in the
microsoft.public.scripting.vbscript newsgroup that got
this ball rolling, Bill James brought forward the "Redimmed Array" approach,
and Al Dunbar joined me in wondering aloud about the relative speed
of the Read(n) function. This article wouldn't have
happened without them.