Scheme 48 Manual | Contents | In Chapter: Libraries
Previous: Sorting lists and vectors | Next: Sorting lists and vectors

Sorting lists and vectors

(This section, as the libraries it describes, was written mostly by Olin Shivers for the draft of SRFI 32.)

The sort libraries in Scheme 48 include

Design rules

What vs. how

There are two different interfaces: "what" (simple) and "how" (detailed).

Simple
you specify semantics: datatype (list or vector), mutability, and stability.

Detailed
you specify the actual algorithm (quick, heap, insert, merge). Different algorithms have different properties, both semantic and pragmatic, so these exports are necessary.

It is necessarily the case that the specifications of these procedures make statements about execution "pragmatics." For example, the sole distinction between heap sort and quick sort--both of which are provided by this library---is one of execution time, which is not a "semantic" distinction. Similar resource-use statements are made about "iterative" procedures, meaning that they can execute on input of arbitrary size in a constant number of stack frames.

Consistency across procedure signatures

The two interfaces share common procedure signatures wherever possible, to facilitate switching a given call from one procedure to another.

Less-than parameter first, data parameter after

These procedures uniformly observe the following parameter order: the data to be sorted comes after the comparison procedure. That is, we write

  (sort < list)

not

  (sort list <)
Ordering, comparison procedures and stability

These routines take a < comparison procedure, not a <= comparison procedure, and they sort into increasing order. The difference between a < spec and a <= spec comes up in two places:

We say that a data set (a list or vector) is sorted or ordered if it contains no adjacent pair of values ...x, y ... such that y < x.

In other words, scanning across the data never takes a "downwards" step.

If you use a <= procedure where these algorithms expect a < procedure, you may not get the answers you expect. For example, the list-sorted? procedure will return false if you pass it a <= comparison procedure and an ordered list containing adjacent equal elements.

A "stable" sort is one that preserves the pre-existing order of equal elements. Suppose, for example, that we sort a list of numbers by comparing their absolute values, i.e., using comparison procedure

(lambda (x y) (< (abs x) (abs y)))
If we sort a list that contains both 3 and -3:
...3, ..., -3 ...
then a stable sort is an algorithm that will not swap the order of these two elements, that is, the answer is guaranteed to to look like
...3, -3 ...
not
...-3, 3 ...
Choosing < for the comparison procedure instead of <= affects how stability is coded. Given an adjacent pair x, y, (< y x) means "x should be moved in front of x"--otherwise, leave things as they are. So using a <= procedure where a < procedure is expected will invert stability.

This is due to the definition of equality, given a < comparator:

    (and (not (< x y))
         (not (< y x)))
The definition is rather different, given a <= comparator:
    (and (<= x y)
         (<= y x))
A "stable" merge is one that reliably favors one of its data sets when equal items appear in both data sets. All merge operations in this library are stable, breaking ties between data sets in favor of the first data set--elements of the first list come before equal elements in the second list.

So, if we are merging two lists of numbers ordered by absolute value, the stable merge operation list-merge

    (list-merge (lambda (x y) (< (abs x) (abs y)))
                '(0 -2 4 8 -10) '(-1 3 -4 7))
reliably places the 4 of the first list before the equal-comparing -4 of the second list:
    (0 -1 -2 4 -4 7 8 -10)
Some sort algorithms will not work correctly if given a <= when they expect a < comparison (or vice-versa).

In short, if your comparison procedure f answers true to (f x x), then

Note that you can synthesize a < procedure from a <= procedure with
    (lambda (x y) (not (<= y x)))
if need be.

Precise definitions give sharp edges to tools, but require care in use. "Measure twice, cut once."

All vector operations accept optional subrange parameters

The vector operations specified below all take optional start/end arguments indicating a selected subrange of a vector's elements. If a start parameter or start/end parameter pair is given to such a procedure, they must be exact, non-negative integers, such that

0 <= start <= end <= (vector-length vector)
where vector is the related vector parameter. If not specified, they default to 0 and the length of the vector, respectively. They are interpreted to select the range [start,end), that is, all elements from index start (inclusive) up to, but not including, index end.

Required vs. allowed side-effects

List-sort! and List-stable-sort! are allowed, but not required, to alter their arguments' cons cells to construct the result list. This is consistent with the what-not-how character of the group of procedures to which they belong (the sorting structure).

The list-delete-neighbor-dups!, list-merge! and list-merge-sort! procedures, on the other hand, provide specific algorithms, and, as such, explicitly commit to the use of side-effects on their input lists in order to guarantee their key algorithmic properties (e.g., linear-time operation).

Procedure specification

Structure name Functionality
sorting General sorting for lists and vectors
sorted Sorted predicates for lists and vectors
list-merge-sort List merge sort
vector-merge-sort Vector merge sort
vector-heap-sort Vector heap sort
vector-insert-sort Vector insertion sort
delete-neighbor-duplicates List and vector delete neighbor duplicates
binary-searches Vector binary search
Note that there is no "list insert sort" package, as you might as well always use list merge sort. The reference implementation's destructive list merge sort will do fewer set-cdr!s than a destructive insert sort.

Procedure naming and functionality

Almost all of the procedures described below are variants of two basic operations: sorting and merging. These procedures are consistently named by composing a set of basic lexemes to indicate what they do.

Lexeme Meaning
sort The procedure sorts its input data set by some < comparison procedure.
merge The procedure merges two ordered data sets into a single ordered result.
stable This lexeme indicates that the sort is a stable one.
vector The procedure operates upon vectors.
list The procedure operates upon lists.
! Procedures that end in ! are allowed, and sometimes required, to reuse their input storage to construct their answer.

Types of parameters and return values

In the procedures specified below,

Passing values to procedures with these parameters that do not satisfy these types is an error.

If a procedure is said to return "unspecified," this means that nothing at all is said about what the procedure returns, not even the number of return values. Such a procedure is not even required to be consistent from call to call in the nature or number of its return values. It is simply required to return a value (or values) that may be passed to a command continuation, e.g. as the value of an expression appearing as a non-terminal subform of a begin expression. Note that in R5RS, this restricts such a procedure to returning a single value; non-R5RS systems may not even provide this restriction.

sorting--general sorting package

This library provides basic sorting and merging functionality suitable for general programming. The procedures are named by their semantic properties, i.e., what they do to the data (sort, stable sort, merge, and so forth).

Procedure Suggested algorithm
list-sort vector heap or quick
list-sort! list merge sort
list-stable-sort vector merge sort
list-stable-sort! list merge sort
vector-sort heap or quick sort
vector-sort! or quick sort
vector-stable-sort vector merge sort
vector-stable-sort! merge sort
List-Sorted? and vector-sorted? return true if their input list or vector is in sorted order, as determined by their < comparison parameter.

All four merge operations are stable: an element of the initial list list1 or vector vector1 will come before an equal-comparing element in the second list list2 or vector vector2 in the result.

The procedures

do not alter their inputs and are allowed to return a value that shares a common tail with a list argument.

The procedure

are "linear update" operators--they are allowed, but not required, to alter the cons cells of their arguments to produce their results.

On the other hand, the list-merge! procedure make only a single, iterative, linear-time pass over its argument list, using set-cdr!s to rearrange the cells of the list into the final result --it works "in place." Hence, any cons cell appearing in the result must have originally appeared in an input. The intent of this iterative-algorithm commitment is to allow the programmer to be sure that if, for example, list-merge! is asked to merge two ten-million-element lists, the operation will complete without performing some extremely (possibly twenty-million) deep recursion.

The vector procedures

do not alter their inputs, but allocate a fresh vector for their result, of length end - start.

The vector procedures

sort their data in-place. (But note that vector-stable-sort! may allocate temporary storage proportional to the size of the input .)

Vector-merge returns a vector of length (end_1-start_1+(end_2-start_2).

Vector-merge! writes its result into vector v, beginning at index start, for indices less than end = start + (end_1-start_1) + (end_2-start_2). The target subvector v[start,end) may not overlap either source subvector vector_1[start_1,end_1) vector_2[start_2,end_2).

The ...-delete-neighbor-dups-... procedures: These procedures delete adjacent duplicate elements from a list or a vector, using a given element-equality procedure. The first/leftmost element of a run of equal elements is the one that survives. The list or vector is not otherwise disordered.

These procedures are linear time--much faster than the O(n2) general duplicate-element deletors that do not assume any "bunching" of elements (such as the ones provided by SRFI 1). If you want to delete duplicate elements from a large list or vector, you can sort the elements to bring equal items together, then use one of these procedures, for a total time of O(nlog(n)).

The comparison procedure = passed to these procedures is always applied (= x y) where x comes before y in the containing list or vector.

Examples:

(list-delete-neighbor-dups = '(1 1 2 7 7 7 0 -2 -2))
  ==> (1 2 7 0 -2)

(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2))
  ==> #(1 2 7 0 -2)

(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2) 3 7)
  ==> #(7 0 -2)

Algorithm-specific sorting packages

These packages provide more specific sorting functionality, that is, specific committment to particular algorithms that have particular pragmatic consequences (such as memory locality, asymptotic running time) beyond their semantic behaviour (sorting, stable sorting, merging, etc.). Programmers that need a particular algorithm can use one of these packages.

sorted--sorted predicates

Return #f iff there is an adjacent pair ...x, y ... in the input list or vector such that y < x. The optional start/end range arguments restrict vector-sorted? to the indicated subvector.

list-merge-sort--list merge sort
The sort procedures sort their data using a list merge sort, which is stable. (The reference implementation is, additionally, a "natural" sort. See below for the properties of this algorithm.)

The ! procedures are destructive--they use set-cdr!s to rearrange the cells of the lists into the proper order. As such, they do not allocate any extra cons cells--they are "in place" sorts. The merge operations are stable: an element of list1 will come before an equal-comparing element in list2 in the result list.

vector-merge-sort--vector merge sort

The sort procedures sort their data using vector merge sort, which is stable. (The reference implementation is, additionally, a "natural" sort. See below for the properties of this algorithm.)

The optional start/end arguments provide for sorting of subranges, and default to 0 and the length of the corresponding vector.

Merge-sorting a vector requires the allocation of a temporary "scratch" work vector for the duration of the sort. This scratch vector can be passed in by the client as the optional temp argument; if so, the supplied vector must be of size <= end, and will not be altered outside the range [start,end). If not supplied, the sort routines allocate one themselves.

The merge operations are stable: an element of vector1 will come before an equal-comparing element in vector2 in the result vector.

vector-heap-sort--vector heap sort

These procedures sort their data using heap sort, which is not a stable sorting algorithm.

Vector-heap-sort returns a vector of length end-start. Vector-heap-sort! is in-place, leaving its result in vector[start,end).

vector-insert-sort--vector insertion sort

These procedures stably sort their data using insertion sort.

delete-neighbor-duplicates--list and vector delete neighbor duplicates

These procedures delete adjacent duplicate elements from a list or a vector, using a given element-equality procedure =. The first/leftmost element of a run of equal elements is the one that survives. The list or vector is not otherwise disordered.

These procedures are linear time--much faster than the O(n2) general duplicate-element deletors that do not assume any "bunching" of elements (such as the ones provided by SRFI 1). If you want to delete duplicate elements from a large list or vector, you can sort the elements to bring equal items together, then use one of these procedures, for a total time of O(nlog(n)).

The comparison procedure = passed to these procedures is always applied

(= x y)

where x comes before y in the containing list or vector.

Examples:

(list-delete-neighbor-dups = '(1 1 2 7 7 7 0 -2 -2))
  ==> (1 2 7 0 -2)

(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2))
  ==> #(1 2 7 0 -2)

(vector-delete-neighbor-dups = '#(1 1 2 7 7 7 0 -2 -2) 3 7)
  ==> #(7 0 -2)

;; Result left in v[3,9):
(let ((v (vector 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6)))
  (cons (vector-delete-neighbor-dups! = v 3)
        v))
   ==> (9 . #(0 0 0 1 2 3 4 5 6 4 4 5 5 6 6))

binary-searches--vector binary search

vector-binary-search searches vector in range [start,end) (which default to 0 and the length of vector, respectively) for an element whose associated key is equal to key. The procedure elt->key is used to map an element to its associated key. The elements of the vector are assumed to be ordered by the < relation on these keys. That is,

(vector-sorted? (lambda (x y) (< (elt-$>$key x) (elt-$>$key y)))
                vector start end) ==> true

An element e of vector is a match for key if it's neither less nor greater than the key:

(and (not (< (elt-$>$key e) key))
     (not (< key (elt-$>$key e))))

If there is such an element, the procedure returns its index in the vector as an exact integer. If there is no such element in the searched range, the procedure returns false.

(vector-binary-search < car 4 '#((1 . one) (3 . three)
                                 (4 . four) (25 . twenty-five)))
==> 2

(vector-binary-search < car 7 '#((1 . one) (3 . three)
                                 (4 . four) (25 . twenty-five)))
==> #f

Vector-binary-search3 is a variant that uses a three-way comparison procedure compare-proc. Compare-proc compares its parameter to the search key, and returns an exact integer whose sign indicates its relationship to the search key.

(compare-proc x) < 0 => x < search-key
(compare-proc x) = 0 => x = search-key
(compare-proc x) > 0 => x > search-key

(vector-binary-search3 (lambda (elt) (- (car elt) 4))
                       '#((1 . one) (3 . three)
                          (4 . four) (25 . twenty-five)))
==> 2

Algorithmic properties

Different sort and merge algorithms have different properties. Choose the algorithm that matches your needs:

Vector insert sort
Stable, but only suitable for small vectors--O(n2).
Vector heap sort
Not stable. Guaranteed fast--O(nlog(n)) worst case. Poor locality on large vectors. A very reliable workhorse.
Vector merge sort
Stable. Not in-place--requires a temporary buffer of equal size. Fast--O(nlog(n))--and has good memory locality for large vectors.

The implementation of vector merge sort provided by this implementation is, additionally, a "natural" sort, meaning that it exploits existing order in the input data, providing O(n) best case.

Destructive list merge sort
Stable, fast and in-place (i.e., allocates no new cons cells). "Fast" means O(nlog(n)) worse-case, and substantially better if the data is already mostly ordered, all the way down to linear time for a completely-ordered input list (i.e., it is a "natural" sort).

Note that sorting lists involves chasing pointers through memory, which can be a loser on modern machine architectures because of poor cache and page locality. Sorting vectors has inherently better locality.

This implementation's destructive list merge and merge sort implementations are opportunistic--they avoid redundant set-cdr!s, and try to take long already-ordered runs of list structure as-is when doing the merges.

Pure list merge sort
Stable and fast--O(nlog(n)) worst-case, and possibly O(n), depending upon the input list (see discussion above).

Algorithm Stable? Worst case Average case In-place
Vector insert Yes O(n2) O(n2) Yes
Vector quick No O(n2) O(nlog(n)) Yes
Vector heap No O(nlog(n)) O(nlog(n)) Yes
Vector merge Yes O(nlog(n)) O(nlog(n)) No
List merge Yes O(nlog(n)) O(nlog(n)) Either

Previous: Sorting lists and vectors | Next: Sorting lists and vectors