Table of Contents
SimpSamp is a Common Lisp library for simple random sampling without replacement. In English, that means it will return a requested number of random elements from a set, without duplicating any of the elements.
For instance, if you have a list of 1,000,000 elements and want 1,000 unique random elements from the list, SimpSamp will do the trick.
The latest version of SimpSamp, with accompanying documentation, can be found at: http://www.thoughtcrime.us/software/simpsamp/
The most recent release is 0.2, released 20110128. It depends on: cljplutil 0.3
I sign all my software with OpenPGP, key ID 0x80555CED7394F948, fingerprint 2771 AF53 5D09 BDFB A8D0 BEF3 8055 5CED 7394 F948.
The software and this document are licensed under permissive, BSDlike terms, copied from the ISC license:
Copyright © 2009, JeanPaul Guy Larocque
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
SimpSamp supports many different data structures for input and output.
Data structures SimpSamp can take samples of
Methods SimpSamp can use to communicate its result
LIST
.VECTOR
.HASHTABLE
(only when the input set is a hash table).DOSAMPLEOFVECTOR
,
like DOLIST
).MAPSAMPLEOFVECTOR
,
like MAP
with the NIL
resulttype).Print three random packages loaded on the system:
(simpsamp:mapsample #'print 3 (listallpackages))
#<PACKAGE "SIMPSAMP">
#<PACKAGE "SBPOSIX">
#<PACKAGE "SBVM">
=> (No values.)
Build a list of 1000 integers, then take a random sample of 10:
(let ((ints (loop for i below 1000 collecting i))) (simpsamp:sample 10 ints)) => #(716 773 507 93 4 157 693 204 880 680)
Do the same thing, but don't build up an intermediate list of integers:
(simpsamp:vectorsampleofrange 10 0 1000) => #(94 161 215 243 257 613 639 646 702 793)
Loop over 10 random characters of the English alphabet:
(simpsamp:dosampleofvector (char 10 "abcdefghijklmnopqrstuvwxyz")
(format t "~&Got: ~S" char))
Got: #\a
Got: #\d
Got: #\f
Got: #\g
Got: #\j
Got: #\m
Got: #\p
Got: #\s
Got: #\t
Got: #\y
=> (No values.)
Print 10 random characters of the English alphabet:
(simpsamp:mapsampleofvector #'princ 10 "abcdefghijklmnopqrstuvwxyz")
cdeijmtuwx
=> (No values.)
SimpSamp implements two algorithms: selection sampling and reservoir sampling, both described in [TAoCP]. Both algorithms take a set and a sample size, and return a simple random sample, without selection, of the requested size from the given set. Both algorithms (as implemented in SimpSamp) take input values from the set as a stream; random access is not required.
Selection sampling requires knowing the size of the input set upfront. It has constant space complexity, as it streams out elements from the selected sample set.
Reservoir sampling does not require knowing the size of the input set ahead of time, but does require space linear to the requested sample size. It returns the selected sample set as a vector.
Both algorithms run in time linear to the size of their input sets.
The most efficient algorithms and data processing techniques were aimed for in each combination of input data structure and output method, toward the goal of avoiding unnecessary data structure traversal and memory allocation (permanent and temporary).
It is recommended that you choose the output data structure for sample results that is most natural (first priority) and efficient (second priority) for your purposes. Remember that vectors are more efficient than lists: lists need two pointers per element; vectors only need one per element.
For instance, if you need random access to a sample,
vectors are the obvious choice. If you just need to iterate,
and you need to hold onto the sample for more than one
iteration, vectors are still usually the best
choice—they can be iterated just as well as lists, while
taking less memory—but lists might be a better fit and
are still a fine choice in terms of algorithmic complexity.
If you don't need to store the sample, and just need to
iterate over it once, one of
the iteration macros or
one of the functional
iteration functions would be best; in nearly all
iteration cases, the entire sample doesn't even need to be
stored in memory all at once (i.e. space complexity O(1)).
See:
Section 3.3.2, “DOSAMPLEOF*
: The Iteration Macros”, Section 3.3.3, “MAPSAMPLEOF*
: The Iteration Functions”
Due to overhead in function application, the iteration macros are usually faster than the functional iteration functions (but that may not be true on your platform).
SimpSamp provides three categories of functions and macros.
Categories of Functions and Macros
Functions that can be used with a few different types of input sets.
Functions and macros for a wider variety of input and output types, optimized for each combination. Faster, but less flexible.
The core algorithms, implemented as macros.
When a "random sample" is said to be taken, this is shorthand for a simple random sample without selection.
The order of elements within any random sample is not assured to be random, even though it may appear so at first glance with some functions and macros. The only thing that is random is which selection is made of all the possible subsets of the input set that is of the requested size.
Whenever a random sample is requested with a size larger than the input set, the requested size is limited to the size of the input set. In this case, the resulting sample is the entire input set.
Whenever a parameter calls for a function, any function designator (such as a symbol naming a function) may be used.
The result of mutating a set while it is being iterated upon is undefined.
SimpSamp offers a few generic functions, which are
explained in this section. These generic functions have
methods which specialize
on LIST
, VECTOR
,
and HASHTABLE
.
SAMPLE 
Syntax.
(SAMPLE
N
SET
) => (A sample, as aSEQUENCE
.)
Returns a random sample of size N
from SET
.
When SET
is a mapping type (e.g. a
hash table), the returned elements will
be CONS
cells of the
form (
.KEY
. VALUE
)
Function Parameters
N
The size of the sample to take of the input set.
SET
The set to take the sample of.
VECTORSAMPLE 
LISTSAMPLE 
Syntax.
(VECTORSAMPLE
N
SET
) => (A sample, as aVECTOR
.)
(LISTSAMPLE
N
SET
) => (A sample, as aLIST
.)
Returns a random sample of size N
from SET
.
Other than the resulting sequence type, the parameters
and semantics are identical to those
of SAMPLE
. See:
Section 3.2.1, “SAMPLE
: Samples as Sequences”
MAPSAMPLE 
Syntax.
(MAPSAMPLE
FUNCTION
N
SET
) => (No values.)
Iterates over a random sample of a set. For each
iteration, FUNCTION
is called with the
current entry of the sample. The result of each call
to FUNCTION
is discarded—results
are not accumulated.
This function is similar
to MAP
(with the NIL
resulttype).
Function Parameters
FUNCTION
The function to call with each entry of the
selected random sample. The arguments given
to FUNCTION
vary with the type
of SET
.
When SET
is a mapping type
(e.g. a hash table), FUNCTION
is
applied with two arguments: the key and the value of the
current entry.
For other types
of SET
, FUNCTION
is applied with one argument: the value of the current
element.
N
, SET
Same meaning as in SAMPLE
.
See: Section 3.2.1, “SAMPLE
: Samples as Sequences”
Most of the functions and macros provided by SimpSamp are specialized for their input and output types. Each one is optimized for that combination of input and output type, and can only operate on the type of input set it was designed for.
This section explains these functions and macros in detail.
The following table lists the available functions and macros, listed according to the type of input they operate on (rowwise) and how they emit their output (columnwise).
Input as...  Output as...  

Macro Iteration  Functional Iteration  Vector  List  Hash Table  
List  DOSAMPLEOFLIST ^{[2]} 
MAPSAMPLEOFLIST 
VECTORSAMPLEOFLIST 
LISTSAMPLEOFLIST 

Vector  DOSAMPLEOFVECTOR 
MAPSAMPLEOFVECTOR 
VECTORSAMPLEOFVECTOR 
LISTSAMPLEOFVECTOR 

Hash table  DOSAMPLEOFHASHTABLE 
MAPSAMPLEOFHASHTABLE 
VECTORSAMPLEOFHASHTABLEPAIRS 
LISTSAMPLEOFHASHTABLEPAIRS 
HASHTABLESAMPLEOFHASHTABLE 
MAPSAMPLEOFHASHTABLEKEYS 
VECTORSAMPLEOFHASHTABLEKEYS 
LISTSAMPLEOFHASHTABLEKEYS 

MAPSAMPLEOFHASHTABLEVALUES 
VECTORSAMPLEOFHASHTABLEVALUES 
LISTSAMPLEOFHASHTABLEVALUES 

Integer Range  DOSAMPLEOFRANGE 
MAPSAMPLEOFRANGE 
VECTORSAMPLEOFRANGE 
LISTSAMPLEOFRANGE 

Iterator (bounded)  DOSAMPLEOFITERATOR 
MAPSAMPLEOFITERATOR 
VECTORSAMPLEOFITERATOR 
LISTSAMPLEOFITERATOR 

Iterator (unbounded)  DOSAMPLEOFUNBOUNDEDITERATOR ^{[3]} 
MAPSAMPLEOFUNBOUNDEDITERATOR ^{[4]} 
VECTORSAMPLEOFUNBOUNDEDITERATOR 
LISTSAMPLEOFUNBOUNDEDITERATOR ^{[5]} 
DOSAMPLEOFLIST ^{[2]} 
DOSAMPLEOFVECTOR 
DOSAMPLEOFHASHTABLE 
DOSAMPLEOFRANGE 
DOSAMPLEOFITERATOR 
DOSAMPLEOFUNBOUNDEDITERATOR ^{[3]} 
Syntax.
(DOSAMPLEOF<input type>
(<variablebinding specification>NFORM
<set specification> &keyEFFECTIVENVAR
INITIALFORM
RESULTFORM
) &bodyBODY
) => (The result ofRESULTFORM
, or no values if not given.)
These macros iterate over a random sample of a set. For each iteration, one or more variables are bound to the current entry of the sample, and a given body of code is evaluated.
All the macros establish an implicit block
named NIL
. BODY
may
terminate the loop prematurely and specify the resulting
values of the macro
with RETURN
.
These macros are analogous
to DOLIST
and DOTIMES
.
Macro Parameters
Expresses which variables to bind to each entry of
the selection for the evaluation
of BODY
. The syntax varies with
each macro.
Syntax Possibilities for <variablebinding specification> by Macro
DOSAMPLEOFLIST
, DOSAMPLEOFVECTOR
, DOSAMPLEOFRANGE
, DOSAMPLEOFUNBOUNDEDITERATOR
<variablebinding specification> ::= VAR
Bound to each selected element.
DOSAMPLEOFHASHTABLE
<variablebinding specification> ::= KEYVAR VALUEVAR
Bound to the key and value of each selected entry.
DOSAMPLEOFITERATOR
<variablebinding specification> ::= VARS
Either a list of symbols naming variables to be bound to each corresponding value returned by the iterator for each chosen entry, or a symbol naming a variable to be bound to a list of all the values returned by the iterator for each chosen entry.
It is unspecified whether fresh bindings are made
for each evaluation of BODY
, or
whether the same bindings are reused each time.
NFORM
Evaluated to produce the size of the sample to take of the input set.
Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”
The DOSAMPLEOF*
macros
evaluate each element of the set specification as
expressions to produce the corresponding value to
use.
EFFECTIVENVAR
A symbol naming a variable to be bound with the
effective number of entries of the selected sample.
(The result of NFORM
is limited by
the actual size of the input set.)
When EFFECTIVENVAR
is given,
it is bound for the evaluation
of INITIALFORM
, RESULTFORM
,
and BODY
.
INITIALFORM
A form to be evaluated when iteration is about to begin.
Useful in conjunction
with EFFECTIVENVAR
to initialize
any output data structures (but this technique is
mostly useful to other functions in SimpSamp that use
the DOSAMPLEOF*
macros).
RESULTFORM
A form to be evaluated when iteration ends. The
result of this form is used as the result of
the DOSAMPLEOF*
macro. If not given,
there are no values in the result of the macro.
BODY
Evaluated for each entry of the selected sample of the input set.
MAPSAMPLEOFLIST 
MAPSAMPLEOFVECTOR 
MAPSAMPLEOFHASHTABLE 
MAPSAMPLEOFHASHTABLEKEYS 
MAPSAMPLEOFHASHTABLEVALUES 
MAPSAMPLEOFRANGE 
MAPSAMPLEOFITERATOR 
MAPSAMPLEOFUNBOUNDEDITERATOR ^{[4]} 
Syntax.
(MAPSAMPLEOF<input type>
FUNCTION
N
<set specification>) => (No values.)
Iterates over a random sample of a set. For each
iteration, FUNCTION
is called with the
current entry of the sample as one or more arguments. The
result of each call to FUNCTION
is
discarded—results are not accumulated.
These functions are like specialized versions
of MAP
(with the NIL
resulttype).
Function Parameters
FUNCTION
The function to call with each entry of the
selected random sample. The arguments given
to FUNCTION
vary with
each MAPSAMPLEOF*
function:
Arguments Given to FUNCTION
by Function
MAPSAMPLEOFLIST
, MAPSAMPLEOFVECTOR
, MAPSAMPLEOFRANGE
, MAPSAMPLEOFUNBOUNDEDITERATOR
<arguments> ::= ELEMENT
The selected element.
MAPSAMPLEOFHASHTABLE
<arguments> ::=KEY
VALUE
The key and value, respectively, of the selected hash table entry.
MAPSAMPLEOFHASHTABLEKEYS
<arguments> ::= KEY
The key of the selected hash table entry.
MAPSAMPLEOFHASHTABLEVALUES
<arguments> ::= VALUE
The value of the selected hash table entry.
MAPSAMPLEOFITERATOR
<arguments> ::= &REST VALUES
For each selected entry, all the values
returned by the NEXTFN
function of the iterator are given
to FUNCTION
as corresponding
arguments.
N
The size of the sample to take of the input set.
Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”
VECTORSAMPLEOFLIST 
VECTORSAMPLEOFVECTOR 
VECTORSAMPLEOFHASHTABLEPAIRS 
VECTORSAMPLEOFHASHTABLEKEYS 
VECTORSAMPLEOFHASHTABLEVALUES 
VECTORSAMPLEOFRANGE 
VECTORSAMPLEOFITERATOR 
VECTORSAMPLEOFUNBOUNDEDITERATOR 
LISTSAMPLEOFLIST 
LISTSAMPLEOFVECTOR 
LISTSAMPLEOFHASHTABLEPAIRS 
LISTSAMPLEOFHASHTABLEKEYS 
LISTSAMPLEOFHASHTABLEVALUES 
LISTSAMPLEOFRANGE 
LISTSAMPLEOFITERATOR 
LISTSAMPLEOFUNBOUNDEDITERATOR ^{[5]} 
Syntax.
(VECTORSAMPLEOF<input type>
N
<set specification>) => (AVECTOR
.)
(LISTSAMPLEOF<input type>
N
<set specification>) => (ALIST
.)
Returns a random sample of a set as either a vector or a list.
In no case should^{[6]} it be more efficient to request a list than it would be to request the same thing as a vector. For best performance, it is recommended that you use vectors unless your situation demands otherwise.
The values of the elements of the resulting sequence
vary with each {VECTOR,LIST}SAMPLEOF*
function:
Resulting Elements of Sequence by Function
VECTORSAMPLEOFLIST
, VECTORSAMPLEOFVECTOR
, VECTORSAMPLEOFRANGE
, VECTORSAMPLEOFUNBOUNDEDITERATOR
, LISTSAMPLEOFLIST
, LISTSAMPLEOFVECTOR
, LISTSAMPLEOFRANGE
, LISTSAMPLEOFUNBOUNDEDITERATOR
<element> ::= ELEMENT
The selected element.
VECTORSAMPLEOFHASHTABLEPAIRS
, LISTSAMPLEOFHASHTABLEPAIRS
<element> ::= (KEY
.VALUE
)
The key and value, as
a CONS
,
of the selected hash table entry.
VECTORSAMPLEOFHASHTABLEKEYS
, LISTSAMPLEOFHASHTABLEKEYS
<element> ::= KEY
The key of the selected hash table entry.
VECTORSAMPLEOFHASHTABLEVALUES
, LISTSAMPLEOFHASHTABLEVALUES
<element> ::= VALUE
The value of the selected hash table entry.
VECTORSAMPLEOFITERATOR
, LISTSAMPLEOFITERATOR
<element> ::= PRIMARYVALUE
For each selected entry, only
the primary
value returned by
the NEXTFN
function of the
iterator is represented in the returned
sequence.
Function Parameters
N
The size of the sample to take of the input set.
Expresses the set to take the sample of. See: Section 3.3.6.1, “Set Specification”
HASHTABLESAMPLEOFHASHTABLE 
Syntax.
(HASHTABLESAMPLEOFHASHTABLE
N
<set specification>) => (AHASHTABLE
.)
This function returns a random sample of a set (as
a HASHTABLE
).
The sample is returned as a hash table.
Function Parameters
N
The size of the sample to take of the input set.
Expresses the set to take the sample of. Only hash tables are supported as input. See: Section 3.3.6.1, “Set Specification”
The specialized functions and macros have some common syntactic elements. This section describes them.
A Set Specification expresses an input set to take a sample of. The Set Specification syntax depends on what kind of input is being operated upon.
Set Specification Syntax by Input Set Type
<set specification> ::= LIST
A list.
<set specification> ::= VECTOR
A vector.
<set specification> ::= HASHTABLE
A hash table.
<set specification> ::=LOWER
UPPER
The inclusive lower bound and exclusive upper bound of the sequence of integers in the input set.
<set specification> ::=SIZE
NEXTFN
SIZE
is the number of entries
in the input set. NEXTFN
is a
function of no arguments that returns the next entry
of the set, advancing through the set.
NEXTFN
is never called more
times than the size of the set.
NEXTFN
may produce multiple
values. Together, we refer to the sequence of
values as an entry of the set.
Depending on the function or macro operating on the
set, the values may be turned into a list to be used
as the value of a single variable, multiple
variables may be bound to the values, or a single
variable may be bound only to
the primary
return value.
<set specification> ::=NEXTFN
EXHAUSTED?FN
NEXTFN
is a function of no
arguments that returns the next element of the set,
advancing through the
set. EXHAUSTED?FN
is a function
of no arguments that returns a generalized boolean
indicating whether the end of the set has been
reached.
EXHAUSTED?FN
is always called
before each call to NEXTFN
. The
first time that EXHAUSTED?FN
returns true, NEXTFN
will not be
called again.
Unlike bounded iterators, only
the primary
return value of NEXTFN
is used.
The generic functions and specialized functions/macros in SimpSamp depend on two core algorithms, both implemented as macros.
The algorithms are implemented as macros because we want to stream the input of the algorithms, element by element, from other components, and (in the case of selection sampling) stream the output of the algorithms to other components. This is unnecessarily slow (on SBCL) when done the functional way. With macros, different parts can still be composed (sample a vector → sample an integer range → sample an abstract iterator), and functional interfaces are provided that present a clean interface to the fast internal macrobased code.
I've found that using macros to layer functionality gives me the same speed as a handwritten function that does exactly what I want. During prototyping, using functional techniques slowed things down up to about 25%.
This solution isn't the most elegant, but the speed matters for where I intended to use the library.
DOSAMPLEOFITERATOREXPR 
Syntax.
(DOSAMPLEOFITERATOREXPR
(VARS
NFORM
SIZEFORM
NEXTFORM
&keyEFFECTIVENVAR
INITIALFORM
RESULTFORM
) &bodyBODY
) => (The result ofRESULTFORM
, or no values if not given.)
This macro iterates over a random sample of a set given as an expression of its size and an expression which evaluates to the next entry of the set, advancing through the set. For each iteration, one or more variables are bound to the current entry of the sample, and a given body of code is evaluated.
This macro is nearly identical in syntax and semantics
to
the DOSAMPLEOFITERATOR
macro. The key difference is that this macro takes the next
entry of the input set by
evaluating NEXTFORM
each time, whereas
with DOSAMPLEOFITERATOR
, NEXTFN
is evaluated once to produce a function which is called each
time. With this difference highlighted, the specification
of this macro defers to:
Section 3.3.2, “DOSAMPLEOF*
: The Iteration Macros”
VECTORSAMPLEOFUNBOUNDEDITERATOREXPR 
Syntax.
(VECTORSAMPLEOFUNBOUNDEDITERATOREXPR
NFORM
NEXTFORM
EXHAUSTED?FORM
) => (A vector.)
Evaluates to a random sample of a set. The set is given as an expression which evaluates to the next entry of the set, advancing through the set, and an expression which evaluates to a generalized boolean indicating whether the end of the set has been reached.
This macro is nearly identical in syntax and semantics
to
the VECTORSAMPLEOFUNBOUNDEDITERATOR
function. The key differences are that this macro
is—of course—a macro, and that this macro takes
the next entry of the input set by
evaluating EXHAUSTED?FORM
and NEXTFORM
each time, whereas
with VECTORSAMPLEOFUNBOUNDEDITERATOR
, NEXTFN
and EXHAUSTED?FN
are evaluated once to
produce functions which are called each time. With these
differences highlighted, the specification of this macro
defers to: Section 3.3.4, “{VECTOR,LIST}SAMPLEOF*
: Samples as Vectors and Lists”
The way SimpSamp processes data is fairly fixed, and I think it could be hard to extend it to support new modes of input and output, such as returning an iteration function, without some retooling of the internals. The current design is primarily motivated by efficiency (minimizing both CPU time and consing).
If you would like to use SimpSamp but are having trouble fitting it into your application, please let me know.
I'm also seeking ideas of how to improve the internal design to make things less tightlycoupled, while maintaining the current performance.
For sampling a range of integers, I think Method D described in [Vitter87] might be applicable and might run in time proportional to the sample size, rather than the input set size. I need to verify that this is true and consider implementing it.
It looks like implementation could be complicated by concerns of floatingpoint accuracy.
^{[1] }The iterator can't really be unbounded—it must eventually terminate.
^{[2] }DOSAMPLEOFLIST
traverses its list twice: once to take the length
of the list, and once to iterate.
^{[3] }DOSAMPLEOFUNBOUNDEDITERATOR
allocates the entire random sample in memory
before iteration can begin, because it does not
know the length of the set it is taking the sample
of. This is contrary to all the
other DOSAMPLEOF*
macros,
which require constant space.
^{[4] }MAPSAMPLEOFUNBOUNDEDITERATOR
allocates the entire random sample in memory
before iteration can begin, because it does not
know the length of the set it is taking the sample
of. This is contrary to all the
other MAPSAMPLEOF*
functions, which require constant space.
^{[5] }LISTSAMPLEOFUNBOUNDEDITERATOR
allocates the random sample in memory as a vector
before it can convert it to a list, due to the
reservoir sampling algorithm, which requires
random access to its working sample. This is
contrary to all the
other LISTSAMPLEOF*
functions, which allocate only once, for the
resulting list.
(Random access on a list would be unacceptably slow. Using a binary tree based on cons cells or a vector would not save allocation.)
^{[6] }This performance assumption has not yet been measured.