SimpSamp

SimpSamp is a Common Lisp library for simple random sampling without replacement. In English, that means it will return a requested number of random elements from a set, without duplicating any of the elements.

For instance, if you have a list of 1,000,000 elements and want 1,000 unique random elements from the list, SimpSamp will do the trick.

1.1. Obtaining SimpSamp

The latest version of SimpSamp, with accompanying documentation, can be found at: http://www.thoughtcrime.us/software/simpsamp/

The most recent release is 0.2, released 2011-01-28. It depends on: cl-jpl-util 0.3

simpsamp-0.2.tar.gz: ASDF package
simpsamp-0.2.tar.gz.sig: OpenPGP detached signature

I sign all my software with OpenPGP, key ID 0xE979FFBEA002D20F, fingerprint A87B 1C5A 28C4 03BD 54BA CE8E E979 FFBE A002 D20F. (Releases were previously signed with 0x80555CED7394F948, which has been revoked, but not compromised. See my OpenPGP transition statement.)

1.2. Copying

The software and this document are licensed under permissive, BSD-like terms, copied from the ISC license:

Copyright © 2009, Jean-Paul Guy Larocque
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

1.3. Contact Information

The author welcomes feedback, questions, help requests, and bug reports via e-mail: J.P. Larocque <jpl-software at thoughtcrime.us>

2. Usage Guide

2.1. Data Structures and Output Methods

SimpSamp supports many different data structures for input and output.

Data structures SimpSamp can take samples of

Lists
Vectors
Hash tables (keys, values, or key/value pairs as cons cells)
Integer ranges
"Bounded" iterator functions: known size value and a "next" function
"Unbounded" iterator functions: a "next" function and an "exhausted?" function ^[1]

Methods SimpSamp can use to communicate its result

As a LIST.
As a VECTOR.
As a HASH-TABLE (only when the input set is a hash table).
As iteration with a macro (e.g. DO-SAMPLE-OF-VECTOR, like DOLIST).
With functional iteration for effect, without accumulation (e.g. MAP-SAMPLE-OF-VECTOR, like MAP with the NIL result-type).

2.2. Examples

Print three random packages loaded on the system:

(simpsamp:map-sample #'print 3 (list-all-packages))

#<PACKAGE "SIMPSAMP"> 
#<PACKAGE "SB-POSIX"> 
#<PACKAGE "SB-VM">
  => (No values.)

Build a list of 1000 integers, then take a random sample of 10:

(let ((ints (loop for i below 1000 collecting i)))
  (simpsamp:sample 10 ints))
  => #(716 773 507 93 4 157 693 204 880 680)

Do the same thing, but don't build up an intermediate list of integers:

(simpsamp:vector-sample-of-range 10 0 1000)
  => #(94 161 215 243 257 613 639 646 702 793)

Loop over 10 random characters of the English alphabet:

(simpsamp:do-sample-of-vector (char 10 "abcdefghijklmnopqrstuvwxyz")
  (format t "~&Got: ~S" char))
Got: #\a
Got: #\d
Got: #\f
Got: #\g
Got: #\j
Got: #\m
Got: #\p
Got: #\s
Got: #\t
Got: #\y
  => (No values.)

Print 10 random characters of the English alphabet:

(simpsamp:map-sample-of-vector #'princ 10 "abcdefghijklmnopqrstuvwxyz")
cdeijmtuwx
  => (No values.)

2.3. Algorithms

SimpSamp implements two algorithms: selection sampling and reservoir sampling, both described in [TAoCP]. Both algorithms take a set and a sample size, and return a simple random sample, without selection, of the requested size from the given set. Both algorithms (as implemented in SimpSamp) take input values from the set as a stream; random access is not required.

Selection sampling requires knowing the size of the input set up-front. It has constant space complexity, as it streams out elements from the selected sample set.

Reservoir sampling does not require knowing the size of the input set ahead of time, but does require space linear to the requested sample size. It returns the selected sample set as a vector.

Both algorithms run in time linear to the size of their input sets.

2.4. Efficiency

The most efficient algorithms and data processing techniques were aimed for in each combination of input data structure and output method, toward the goal of avoiding unnecessary data structure traversal and memory allocation (permanent and temporary).

It is recommended that you choose the output data structure for sample results that is most natural (first priority) and efficient (second priority) for your purposes. Remember that vectors are more efficient than lists: lists need two pointers per element; vectors only need one per element.

For instance, if you need random access to a sample, vectors are the obvious choice. If you just need to iterate, and you need to hold onto the sample for more than one iteration, vectors are still usually the best choice—they can be iterated just as well as lists, while taking less memory—but lists might be a better fit and are still a fine choice in terms of algorithmic complexity. If you don't need to store the sample, and just need to iterate over it once, one of the iteration macros or one of the functional iteration functions would be best; in nearly all iteration cases, the entire sample doesn't even need to be stored in memory all at once (i.e. space complexity O(1)). See: Section 3.3.2, “DO-SAMPLE-OF-*: The Iteration Macros”, Section 3.3.3, “MAP-SAMPLE-OF-*: The Iteration Functions”

Due to overhead in function application, the iteration macros are usually faster than the functional iteration functions (but that may not be true on your platform).

3. Reference

SimpSamp provides three categories of functions and macros.

Categories of Functions and Macros

Section 3.2, “Generic Functions”: Functions that can be used with a few different types of input sets.
Section 3.3, “Specialized Functions and Macros”: Functions and macros for a wider variety of input and output types, optimized for each combination. Faster, but less flexible.
Section 3.4, “Low-Level Macros”: The core algorithms, implemented as macros.

3.1. Conventions and Common Semantics

When a "random sample" is said to be taken, this is shorthand for a simple random sample without selection.

The order of elements within any random sample is not assured to be random, even though it may appear so at first glance with some functions and macros. The only thing that is random is which selection is made of all the possible subsets of the input set that is of the requested size.

Whenever a random sample is requested with a size larger than the input set, the requested size is limited to the size of the input set. In this case, the resulting sample is the entire input set.

Whenever a parameter calls for a function, any function designator (such as a symbol naming a function) may be used.

The result of mutating a set while it is being iterated upon is undefined.

3.2. Generic Functions

SimpSamp offers a few generic functions, which are explained in this section. These generic functions have methods which specialize on LIST, VECTOR, and HASH-TABLE.

3.2.1. `SAMPLE`: Samples as Sequences

SAMPLE

Syntax.

(SAMPLE N SET)
=> (A sample, as a SEQUENCE.)

Returns a random sample of size N from SET.

When SET is a mapping type (e.g. a hash table), the returned elements will be CONS cells of the form (KEY . VALUE).

Function Parameters

N: The size of the sample to take of the input set.
SET: The set to take the sample of.

3.2.2. `{VECTOR,LIST}-SAMPLE`: Samples as Vectors or Lists

VECTOR-SAMPLE

LIST-SAMPLE

Syntax.

(VECTOR-SAMPLE N SET)
=> (A sample, as a VECTOR.)

(LIST-SAMPLE N SET)
=> (A sample, as a LIST.)

Returns a random sample of size N from SET.

Other than the resulting sequence type, the parameters and semantics are identical to those of SAMPLE. See: Section 3.2.1, “SAMPLE: Samples as Sequences”

3.2.3. `MAP-SAMPLE`: Sample Iteration

MAP-SAMPLE

Syntax.

(MAP-SAMPLE FUNCTION N SET)
=> (No values.)

Iterates over a random sample of a set. For each iteration, FUNCTION is called with the current entry of the sample. The result of each call to FUNCTION is discarded—results are not accumulated.

This function is similar to MAP (with the NIL result-type).

Function Parameters

FUNCTION

The function to call with each entry of the selected random sample. The arguments given to FUNCTION vary with the type of SET.

When SET is a mapping type (e.g. a hash table), FUNCTION is applied with two arguments: the key and the value of the current entry.

For other types of SET, FUNCTION is applied with one argument: the value of the current element.

N, SET

Same meaning as in SAMPLE. See: Section 3.2.1, “SAMPLE: Samples as Sequences”

3.3. Specialized Functions and Macros

Most of the functions and macros provided by SimpSamp are specialized for their input and output types. Each one is optimized for that combination of input and output type, and can only operate on the type of input set it was designed for.

This section explains these functions and macros in detail.

3.3.1. Table by Input and Output Types

The following table lists the available functions and macros, listed according to the type of input they operate on (row-wise) and how they emit their output (column-wise).

Input as...	Output as...
Input as...	Macro Iteration	Functional Iteration	Vector	List	Hash Table
List	`DO-SAMPLE-OF-LIST`^[2]	`MAP-SAMPLE-OF-LIST`	`VECTOR-SAMPLE-OF-LIST`	`LIST-SAMPLE-OF-LIST`
Vector	`DO-SAMPLE-OF-VECTOR`	`MAP-SAMPLE-OF-VECTOR`	`VECTOR-SAMPLE-OF-VECTOR`	`LIST-SAMPLE-OF-VECTOR`
Hash table	`DO-SAMPLE-OF-HASH-TABLE`	`MAP-SAMPLE-OF-HASH-TABLE`	`VECTOR-SAMPLE-OF-HASH-TABLE-PAIRS`	`LIST-SAMPLE-OF-HASH-TABLE-PAIRS`	`HASH-TABLE-SAMPLE-OF-HASH-TABLE`
		`MAP-SAMPLE-OF-HASH-TABLE-KEYS`	`VECTOR-SAMPLE-OF-HASH-TABLE-KEYS`	`LIST-SAMPLE-OF-HASH-TABLE-KEYS`
		`MAP-SAMPLE-OF-HASH-TABLE-VALUES`	`VECTOR-SAMPLE-OF-HASH-TABLE-VALUES`	`LIST-SAMPLE-OF-HASH-TABLE-VALUES`
Integer Range	`DO-SAMPLE-OF-RANGE`	`MAP-SAMPLE-OF-RANGE`	`VECTOR-SAMPLE-OF-RANGE`	`LIST-SAMPLE-OF-RANGE`
Iterator (bounded)	`DO-SAMPLE-OF-ITERATOR`	`MAP-SAMPLE-OF-ITERATOR`	`VECTOR-SAMPLE-OF-ITERATOR`	`LIST-SAMPLE-OF-ITERATOR`
Iterator (unbounded)	`DO-SAMPLE-OF-UNBOUNDED-ITERATOR`^[3]	`MAP-SAMPLE-OF-UNBOUNDED-ITERATOR`^[4]	`VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR`	`LIST-SAMPLE-OF-UNBOUNDED-ITERATOR`^[5]

3.3.2. `DO-SAMPLE-OF-*`: The Iteration Macros

DO-SAMPLE-OF-LIST^[2]

DO-SAMPLE-OF-VECTOR

DO-SAMPLE-OF-HASH-TABLE

DO-SAMPLE-OF-RANGE

DO-SAMPLE-OF-ITERATOR

DO-SAMPLE-OF-UNBOUNDED-ITERATOR^[3]

Syntax.

(DO-SAMPLE-OF-<input type> (<variable-binding specification> N-FORM <set specification> &key EFFECTIVE-N-VAR INITIAL-FORM RESULT-FORM)
  &body BODY)
=> (The result of RESULT-FORM, or no values if not given.)

These macros iterate over a random sample of a set. For each iteration, one or more variables are bound to the current entry of the sample, and a given body of code is evaluated.

All the macros establish an implicit block named NIL. BODY may terminate the loop prematurely and specify the resulting values of the macro with RETURN.

These macros are analogous to DOLIST and DOTIMES.

Macro Parameters

<variable-binding specification>

Expresses which variables to bind to each entry of the selection for the evaluation of BODY. The syntax varies with each macro.

Syntax Possibilities for <variable-binding specification> by Macro

DO-SAMPLE-OF-LIST, DO-SAMPLE-OF-VECTOR, DO-SAMPLE-OF-RANGE, DO-SAMPLE-OF-UNBOUNDED-ITERATOR

<variable-binding specification> ::= VAR

Bound to each selected element.

DO-SAMPLE-OF-HASH-TABLE

<variable-binding specification> ::= KEY-VAR VALUE-VAR

Bound to the key and value of each selected entry.

DO-SAMPLE-OF-ITERATOR

<variable-binding specification> ::= VARS

Either a list of symbols naming variables to be bound to each corresponding value returned by the iterator for each chosen entry, or a symbol naming a variable to be bound to a list of all the values returned by the iterator for each chosen entry.

It is unspecified whether fresh bindings are made for each evaluation of BODY, or whether the same bindings are reused each time.

N-FORM

Evaluated to produce the size of the sample to take of the input set.

<set specification>

Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”

The DO-SAMPLE-OF-* macros evaluate each element of the set specification as expressions to produce the corresponding value to use.

EFFECTIVE-N-VAR

A symbol naming a variable to be bound with the effective number of entries of the selected sample. (The result of N-FORM is limited by the actual size of the input set.)

When EFFECTIVE-N-VAR is given, it is bound for the evaluation of INITIAL-FORM, RESULT-FORM, and BODY.

INITIAL-FORM

A form to be evaluated when iteration is about to begin.

Useful in conjunction with EFFECTIVE-N-VAR to initialize any output data structures (but this technique is mostly useful to other functions in SimpSamp that use the DO-SAMPLE-OF-* macros).

RESULT-FORM

A form to be evaluated when iteration ends. The result of this form is used as the result of the DO-SAMPLE-OF-* macro. If not given, there are no values in the result of the macro.

BODY

Evaluated for each entry of the selected sample of the input set.

3.3.3. `MAP-SAMPLE-OF-*`: The Iteration Functions

MAP-SAMPLE-OF-LIST

MAP-SAMPLE-OF-VECTOR

MAP-SAMPLE-OF-HASH-TABLE

MAP-SAMPLE-OF-HASH-TABLE-KEYS

MAP-SAMPLE-OF-HASH-TABLE-VALUES

MAP-SAMPLE-OF-RANGE

MAP-SAMPLE-OF-ITERATOR

MAP-SAMPLE-OF-UNBOUNDED-ITERATOR^[4]

Syntax.

(MAP-SAMPLE-OF-<input type> FUNCTION N <set specification>)
=> (No values.)

Iterates over a random sample of a set. For each iteration, FUNCTION is called with the current entry of the sample as one or more arguments. The result of each call to FUNCTION is discarded—results are not accumulated.

These functions are like specialized versions of MAP (with the NIL result-type).

Function Parameters

FUNCTION

The function to call with each entry of the selected random sample. The arguments given to FUNCTION vary with each MAP-SAMPLE-OF-* function:

Arguments Given to FUNCTION by Function

MAP-SAMPLE-OF-LIST, MAP-SAMPLE-OF-VECTOR, MAP-SAMPLE-OF-RANGE, MAP-SAMPLE-OF-UNBOUNDED-ITERATOR

<arguments> ::= ELEMENT

The selected element.

MAP-SAMPLE-OF-HASH-TABLE

<arguments> ::= KEY VALUE

The key and value, respectively, of the selected hash table entry.

MAP-SAMPLE-OF-HASH-TABLE-KEYS

<arguments> ::= KEY

The key of the selected hash table entry.

MAP-SAMPLE-OF-HASH-TABLE-VALUES

<arguments> ::= VALUE

The value of the selected hash table entry.

MAP-SAMPLE-OF-ITERATOR

<arguments> ::= &REST VALUES

For each selected entry, all the values returned by the NEXT-FN function of the iterator are given to FUNCTION as corresponding arguments.

N

The size of the sample to take of the input set.

<set specification>

Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”

3.3.4. `{VECTOR,LIST}-SAMPLE-OF-*`: Samples as Vectors and Lists

VECTOR-SAMPLE-OF-LIST

VECTOR-SAMPLE-OF-VECTOR

VECTOR-SAMPLE-OF-HASH-TABLE-PAIRS

VECTOR-SAMPLE-OF-HASH-TABLE-KEYS

VECTOR-SAMPLE-OF-HASH-TABLE-VALUES

VECTOR-SAMPLE-OF-RANGE

VECTOR-SAMPLE-OF-ITERATOR

VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR

LIST-SAMPLE-OF-LIST

LIST-SAMPLE-OF-VECTOR

LIST-SAMPLE-OF-HASH-TABLE-PAIRS

LIST-SAMPLE-OF-HASH-TABLE-KEYS

LIST-SAMPLE-OF-HASH-TABLE-VALUES

LIST-SAMPLE-OF-RANGE

LIST-SAMPLE-OF-ITERATOR

LIST-SAMPLE-OF-UNBOUNDED-ITERATOR^[5]

Syntax.

(VECTOR-SAMPLE-OF-<input type> N <set specification>)
=> (A VECTOR.)

(LIST-SAMPLE-OF-<input type> N <set specification>)
=> (A LIST.)

Returns a random sample of a set as either a vector or a list.

In no case should^[6] it be more efficient to request a list than it would be to request the same thing as a vector. For best performance, it is recommended that you use vectors unless your situation demands otherwise.

The values of the elements of the resulting sequence vary with each {VECTOR,LIST}-SAMPLE-OF-* function:

Resulting Elements of Sequence by Function

VECTOR-SAMPLE-OF-LIST, VECTOR-SAMPLE-OF-VECTOR, VECTOR-SAMPLE-OF-RANGE, VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR, LIST-SAMPLE-OF-LIST, LIST-SAMPLE-OF-VECTOR, LIST-SAMPLE-OF-RANGE, LIST-SAMPLE-OF-UNBOUNDED-ITERATOR

<element> ::= ELEMENT

The selected element.

VECTOR-SAMPLE-OF-HASH-TABLE-PAIRS, LIST-SAMPLE-OF-HASH-TABLE-PAIRS

<element> ::= (KEY . VALUE)

The key and value, as a CONS, of the selected hash table entry.

VECTOR-SAMPLE-OF-HASH-TABLE-KEYS, LIST-SAMPLE-OF-HASH-TABLE-KEYS

<element> ::= KEY

The key of the selected hash table entry.

VECTOR-SAMPLE-OF-HASH-TABLE-VALUES, LIST-SAMPLE-OF-HASH-TABLE-VALUES

<element> ::= VALUE

The value of the selected hash table entry.

VECTOR-SAMPLE-OF-ITERATOR, LIST-SAMPLE-OF-ITERATOR

<element> ::= PRIMARY-VALUE

For each selected entry, only the primary value returned by the NEXT-FN function of the iterator is represented in the returned sequence.

Function Parameters

N: The size of the sample to take of the input set.
<set specification>: Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”

3.3.5. `HASH-TABLE-SAMPLE-OF-HASH-TABLE`: Samples as Hash Tables

HASH-TABLE-SAMPLE-OF-HASH-TABLE

Syntax.

(HASH-TABLE-SAMPLE-OF-HASH-TABLE N <set specification>)
=> (A HASH-TABLE.)

This function returns a random sample of a set (as a HASH-TABLE). The sample is returned as a hash table.

Function Parameters

N: The size of the sample to take of the input set.
<set specification>: Expresses the set to take the sample of. Only hash tables are supported as input. See: Section 3.3.6.1, “Set Specification”

3.3.6. Common Syntax

The specialized functions and macros have some common syntactic elements. This section describes them.

3.3.6.1. Set Specification

A Set Specification expresses an input set to take a sample of. The Set Specification syntax depends on what kind of input is being operated upon.

Set Specification Syntax by Input Set Type

List

<set specification> ::= LIST

A list.

Vector

<set specification> ::= VECTOR

A vector.

Hash table

<set specification> ::= HASH-TABLE

A hash table.

Integer range

<set specification> ::= LOWER UPPER

The inclusive lower bound and exclusive upper bound of the sequence of integers in the input set.

Iterator (bounded)

<set specification> ::= SIZE NEXT-FN

SIZE is the number of entries in the input set. NEXT-FN is a function of no arguments that returns the next entry of the set, advancing through the set.

NEXT-FN is never called more times than the size of the set.

NEXT-FN may produce multiple values. Together, we refer to the sequence of values as an entry of the set. Depending on the function or macro operating on the set, the values may be turned into a list to be used as the value of a single variable, multiple variables may be bound to the values, or a single variable may be bound only to the primary return value.

Iterator (unbounded)

<set specification> ::= NEXT-FN EXHAUSTED?-FN

NEXT-FN is a function of no arguments that returns the next element of the set, advancing through the set. EXHAUSTED?-FN is a function of no arguments that returns a generalized boolean indicating whether the end of the set has been reached.

EXHAUSTED?-FN is always called before each call to NEXT-FN. The first time that EXHAUSTED?-FN returns true, NEXT-FN will not be called again.

Unlike bounded iterators, only the primary return value of NEXT-FN is used.

3.4. Low-Level Macros

The generic functions and specialized functions/macros in SimpSamp depend on two core algorithms, both implemented as macros.

The algorithms are implemented as macros because we want to stream the input of the algorithms, element by element, from other components, and (in the case of selection sampling) stream the output of the algorithms to other components. This is unnecessarily slow (on SBCL) when done the functional way. With macros, different parts can still be composed (sample a vector → sample an integer range → sample an abstract iterator), and functional interfaces are provided that present a clean interface to the fast internal macro-based code.

I've found that using macros to layer functionality gives me the same speed as a hand-written function that does exactly what I want. During prototyping, using functional techniques slowed things down up to about 25%.

This solution isn't the most elegant, but the speed matters for where I intended to use the library.

3.4.1. `DO-SAMPLE-OF-ITERATOR-EXPR`: Selection Sampling

DO-SAMPLE-OF-ITERATOR-EXPR

Syntax.

(DO-SAMPLE-OF-ITERATOR-EXPR (VARS N-FORM SIZE-FORM NEXT-FORM &key EFFECTIVE-N-VAR INITIAL-FORM RESULT-FORM)
  &body BODY)
=> (The result of RESULT-FORM, or no values if not given.)

This macro iterates over a random sample of a set given as an expression of its size and an expression which evaluates to the next entry of the set, advancing through the set. For each iteration, one or more variables are bound to the current entry of the sample, and a given body of code is evaluated.

This macro is nearly identical in syntax and semantics to the DO-SAMPLE-OF-ITERATOR macro. The key difference is that this macro takes the next entry of the input set by evaluating NEXT-FORM each time, whereas with DO-SAMPLE-OF-ITERATOR, NEXT-FN is evaluated once to produce a function which is called each time. With this difference highlighted, the specification of this macro defers to: Section 3.3.2, “DO-SAMPLE-OF-*: The Iteration Macros”

3.4.2. `VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR-EXPR`: Reservoir Sampling

VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR-EXPR

Syntax.

(VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR-EXPR N-FORM NEXT-FORM EXHAUSTED?-FORM)
=> (A vector.)

Evaluates to a random sample of a set. The set is given as an expression which evaluates to the next entry of the set, advancing through the set, and an expression which evaluates to a generalized boolean indicating whether the end of the set has been reached.

This macro is nearly identical in syntax and semantics to the VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR function. The key differences are that this macro is—of course—a macro, and that this macro takes the next entry of the input set by evaluating EXHAUSTED?-FORM and NEXT-FORM each time, whereas with VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR, NEXT-FN and EXHAUSTED?-FN are evaluated once to produce functions which are called each time. With these differences highlighted, the specification of this macro defers to: Section 3.3.4, “{VECTOR,LIST}-SAMPLE-OF-*: Samples as Vectors and Lists”

4. To Do

4.1. Extensibility Concerns

The way SimpSamp processes data is fairly fixed, and I think it could be hard to extend it to support new modes of input and output, such as returning an iteration function, without some retooling of the internals. The current design is primarily motivated by efficiency (minimizing both CPU time and consing).

If you would like to use SimpSamp but are having trouble fitting it into your application, please let me know.

I'm also seeking ideas of how to improve the internal design to make things less tightly-coupled, while maintaining the current performance.

4.2. Faster Integer Range Sampling

For sampling a range of integers, I think Method D described in [Vitter87] might be applicable and might run in time proportional to the sample size, rather than the input set size. I need to verify that this is true and consider implementing it.

It looks like implementation could be complicated by concerns of floating-point accuracy.

References

[TAoCP] Donald E. Knuth. The Art of Computer Programming. 2. 2nd Edition. 0-201-03822-6. Addison-Wesley. Reading Mass. USA. Jan., 1981. 136-139.

[Vitter87] Jeffrey Scott Vitter. An Efficient Algorithm for Sequential Random Sampling. 58-67. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335. ACM Transactions on Mathematical Software (TOMS). 0098-3500. 13. 1. Mar., 1987.

^[1]The iterator can't really be unbounded—it must eventually terminate.

^[2]DO-SAMPLE-OF-LIST traverses its list twice: once to take the length of the list, and once to iterate.

^[3]DO-SAMPLE-OF-UNBOUNDED-ITERATOR allocates the entire random sample in memory before iteration can begin, because it does not know the length of the set it is taking the sample of. This is contrary to all the other DO-SAMPLE-OF-* macros, which require constant space.

^[4]MAP-SAMPLE-OF-UNBOUNDED-ITERATOR allocates the entire random sample in memory before iteration can begin, because it does not know the length of the set it is taking the sample of. This is contrary to all the other MAP-SAMPLE-OF-* functions, which require constant space.

^[5]LIST-SAMPLE-OF-UNBOUNDED-ITERATOR allocates the random sample in memory as a vector before it can convert it to a list, due to the reservoir sampling algorithm, which requires random access to its working sample. This is contrary to all the other LIST-SAMPLE-OF-* functions, which allocate only once, for the resulting list.

(Random access on a list would be unacceptably slow. Using a binary tree based on cons cells or a vector would not save allocation.)

^[6]This performance assumption has not yet been measured.

SimpSamp

J.P. Larocque

1. Introduction

1.1. Obtaining SimpSamp

1.2. Copying

1.3. Contact Information

2. Usage Guide

2.1. Data Structures and Output Methods

2.2. Examples

2.3. Algorithms

2.4. Efficiency

3. Reference

3.1. Conventions and Common Semantics

3.2. Generic Functions

3.2.1. SAMPLE: Samples as Sequences

3.2.2. {VECTOR,LIST}-SAMPLE: Samples as Vectors or Lists

3.2.3. MAP-SAMPLE: Sample Iteration

3.3. Specialized Functions and Macros

3.3.1. Table by Input and Output Types

3.3.2. DO-SAMPLE-OF-*: The Iteration Macros

3.3.3. MAP-SAMPLE-OF-*: The Iteration Functions

3.3.4. {VECTOR,LIST}-SAMPLE-OF-*: Samples as Vectors and Lists

3.3.5. HASH-TABLE-SAMPLE-OF-HASH-TABLE: Samples as Hash Tables

3.3.6. Common Syntax

3.3.6.1. Set Specification

3.4. Low-Level Macros

3.4.1. DO-SAMPLE-OF-ITERATOR-EXPR: Selection Sampling

3.4.2. VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR-EXPR: Reservoir Sampling

4. To Do

4.1. Extensibility Concerns

4.2. Faster Integer Range Sampling

References

3.2.1. `SAMPLE`: Samples as Sequences

3.2.2. `{VECTOR,LIST}-SAMPLE`: Samples as Vectors or Lists

3.2.3. `MAP-SAMPLE`: Sample Iteration

3.3.2. `DO-SAMPLE-OF-*`: The Iteration Macros

3.3.3. `MAP-SAMPLE-OF-*`: The Iteration Functions

3.3.4. `{VECTOR,LIST}-SAMPLE-OF-*`: Samples as Vectors and Lists

3.3.5. `HASH-TABLE-SAMPLE-OF-HASH-TABLE`: Samples as Hash Tables

3.4.1. `DO-SAMPLE-OF-ITERATOR-EXPR`: Selection Sampling

3.4.2. `VECTOR-SAMPLE-OF-UNBOUNDED-ITERATOR-EXPR`: Reservoir Sampling