QW Scripting: Sortlines

Sortlines 1.03 November 10, 2000

Sortlines is a subroutine that sorts a group of lines in a file into ascending or descending sequence. Information on sortlines is organized as follows:

Parameters
Result
Sorting An Entire File
Design
Limits
Performance

Sortlines Parameters

There are four parameters to the sortlines subroutine:

file: This is a required parameter. A standard QW file object. This is the file that is to have its lines sorted.
startline: This is a required parameter. The starting line to sort.
endline: This is a required parameter. The ending line to sort. All lines between startline and endline are sorted.
ascending: This is an optional parameter. Pass true for ascending sort sequence and false for descending. If the parameter is omitted, ascending is assumed.

The maximum number of lines that sortlines will sort is half the maximum number of lines that will fit in the cache. This is a reasonable limit that stops sortlist from being used to sort unreasonably large files. On slow machines this limit is probably too generous.

If the startline and endline are the same, sortlines returns immediately and does not change the file.

If the selection is rectangular, Sortlines only sorts on the specified columns. If the selection is not rectangular, the entire line is used to determine the sort sequence.

Sortlines Result

There is no return result from sortlines.

Sorting An Entire File

The following provides an example that sorts the file k.data on the connection Production MPE into ascending sequence. This script takes advantage of the fact that the qslutilsort script is distributed with Qedit for Windows and automatically loaded:

	file = open(connection: "Production MPE", filename: "k.data");

	qslutilsort.sortlines(file, 1, file.linecount);

	file.close();

Sortlines Design

The basic design of the sortlines subroutine is straight-forward. But the implementation is not. In order to minimize data movement (which can be substantial when dealing with longer lines), we use an indirect method of sorting.

We create a list of line numbers. We then use this list as "pointers" into the actual contents of each line. When sorting, we do not shuffle lines around. We just shuffle the line numbers in the list around. Once we have a list of line numbers which represent the lines in sorted order, we then create a new file with each of the lines from the original file. In the end, we select all of the lines from the newfile we created, copy them to the clipboard, and then we replace the lines in the original file using a single paste operation. This means that the changes to the original file are done in one single operation. Not only is this more efficient, it also means that an "undo" after you run the sortlines script will do what most users would expect. It will "undo" the sorted lines subroutine call.

Implementation Subroutines

We have a few helper subroutines that implement the design:

Sortlist: This sorts the list of lines. This is similar to our sortlist.html example.
Sift: Similar to the sift subroutine of sortlist.qsl, except that the comparisons are all done using the comparelines subroutine.
Comparelines: Recall that we are sorting a list of line numbers. But these line numbers represent real lines in a file. The comparelines subroutine gets the contents of two different lines and then returns whether the line contents are less than, equal, or greater than each other.

Limits

The basic sorting algorithm was designed so that all of the lines to be sorted did not have to be in the cache. But during repeated testing of this script on real host files, the visual feedback on the document status bar made it clear that in every case the entire host file was loaded into the local cache. For this reason we limit sorting to half the maximum number of lines of the document cache.

Performance

Because sortlines.qsl uses the algorithm in sortlist.qsl, the performance will be constrained by the performance of the heapsort in sortlist.qsl (see sortlist.html for performance details).

In order to time sortlines, we did the following test. Using host-based Qedit, we did:

     /text catalog.pub.sys
     /keep k.data first/995

This resulted in a file with 250 lines. We then created the same file, but without line numbers, on Alladin's local hard drive. We then ran the same test twice:

Using k.data.green on Production MPE.
Using the same file contents, but as a local file on the C: drive.

The results show that even this relatively small test took a long time to execute. We expect that with a faster CPU and a faster network connection, these run times would be more reasonable.

Date: February 25, 1999 at 3:05 PM (Pacific)
Person: David
Machine: Vectra XU
CPU: Dual Pentium 133
Memory: 128 MB
OS: WinNT 4.0 SP3
Connection: ADSL
Host Timing: 3 minutes 56 seconds
Local Timing: 1 minute 40 seconds

Date:	February 25, 1999 at 3:05 PM (Pacific)
Person:	David
Machine:	Vectra XU
CPU:	Dual Pentium 133
Memory:	128 MB
OS:	WinNT 4.0 SP3
Connection:	ADSL
Host Timing:	3 minutes 56 seconds
Local Timing:	1 minute 40 seconds