QW Scripting: Sortlines
Sortlines 1.03 November 10, 2000
Sortlines is a subroutine that sorts a group of lines in a file into
ascending or descending sequence.
Information on sortlines is organized as follows:
There are four parameters to the sortlines subroutine:
- file: This is a required parameter.
A standard QW file object. This is the file that is to have its lines
sorted.
- startline: This is a required parameter.
The starting line to sort.
- endline: This is a required parameter.
The ending line to sort. All lines between startline and endline
are sorted.
- ascending: This is an optional parameter.
Pass true for ascending sort sequence and false for descending.
If the parameter is omitted, ascending is assumed.
The maximum number of lines that sortlines will sort is half the
maximum number of lines that will fit in the cache. This is a
reasonable limit that stops sortlist from being used to sort
unreasonably large files. On slow machines this limit is probably
too generous.
If the startline and endline are the same, sortlines returns
immediately and does not change the file.
If the selection is rectangular, Sortlines only sorts on the specified columns.
If the selection is not rectangular, the entire line is used to determine the sort sequence.
There is no return result from sortlines.
The following provides an example that sorts the file k.data
on the connection Production MPE
into ascending sequence. This script takes advantage of the
fact that the qslutilsort script is distributed with Qedit
for Windows and automatically loaded:
file = open(connection: "Production MPE", filename: "k.data");
qslutilsort.sortlines(file, 1, file.linecount);
file.close();
The basic design of the sortlines subroutine is straight-forward. But
the implementation is not. In order to minimize data movement (which can
be substantial when dealing with longer lines), we use an indirect method
of sorting.
We create a list of line numbers. We then use this list as "pointers" into
the actual contents of each line. When sorting, we do not shuffle lines
around. We just shuffle the line numbers in the list around. Once we have
a list of line numbers which represent the lines in sorted order, we then
create a new file with each of the lines from the original file.
In the end, we select all of the lines from the newfile we created,
copy them to the clipboard, and then we replace the lines in the original
file using a single paste operation. This means that the changes to the
original file are done in one single operation. Not only is this more
efficient, it also means that an "undo" after you run the sortlines script will
do what most users would expect. It will "undo" the sorted lines subroutine call.
Implementation Subroutines
We have a few helper subroutines that implement the design:
- Sortlist: This sorts the list of lines. This is
similar to our sortlist.html example.
- Sift: Similar to the sift subroutine of
sortlist.qsl, except that the comparisons are
all done using the comparelines subroutine.
- Comparelines: Recall that we are sorting a list of
line numbers. But these line numbers represent real lines in a file. The
comparelines subroutine gets the contents of two different lines and then
returns whether the line contents are less than, equal, or greater than
each other.
The basic sorting algorithm was designed so that all of the lines to
be sorted did not have to be in the cache. But during repeated testing
of this script on real host files, the visual feedback on the document
status bar made it clear that in every case the entire host file was
loaded into the local cache. For this reason we limit sorting to half
the maximum number of lines of the document cache.
Because sortlines.qsl uses the algorithm in
sortlist.qsl, the performance will be constrained
by the performance of the heapsort in sortlist.qsl (see
sortlist.html for performance details).
In order to time sortlines, we did the following test. Using host-based Qedit,
we did:
/text catalog.pub.sys
/keep k.data first/995
This resulted in a file with 250 lines. We then created the same file, but
without line numbers, on Alladin's local hard drive. We then ran the same
test twice:
- Using k.data.green on Production MPE.
- Using the same file contents, but as a local file on the C: drive.
The results show that even this relatively small test took a long time to
execute. We expect that with a faster CPU and a faster network connection,
these run times would be more reasonable.
Date:
| February 25, 1999 at 3:05 PM (Pacific)
|
Person:
| David
|
Machine:
| Vectra XU
|
CPU:
| Dual Pentium 133
|
Memory:
| 128 MB
|
OS:
| WinNT 4.0 SP3
|
Connection:
| ADSL
|
Host Timing:
| 3 minutes 56 seconds
|
Local Timing:
| 1 minute 40 seconds
|