What does RANDPERM miss?

Question

1 vote

Abstractly, a random permutation generator (RPG) is a function that maps a random bit-string of length b into a permutation of {1,2, ..., n}.

There are 2^b distinct bit-strings of length b and n! distinct permutations of length n. Hence we must have 2^b > n! Using Stirling's approximation for n! and taking logs of both sides of the inequality we get b > n*log2(n).

Now RANDPERM can generate a permutation of length n = 10^6 in 0.2 sec and part of its output looks like this: 383882, 905744, 760491, 138901. However, to generate all of the (10^6)! ~ 10^(5,565,709) possible permutations the inequality above says that b > 10^6*Log2(10^6) ~ 2*10^7 bits are necessary. But RANDPERM uses indirectly the Mersenne Twister generator whose internal state has about 625*32 = 2*10^4 bits and so it cannot possibly generate all (10^6)! permutations. Indeed, it must miss most of these permutations.

My question is: What permutations does RANDPERM (or any other RPG) miss? Are these misses spread uniformly over the permutation space?

4 Comments
Show 2 older comments Hide 2 older comments

Oliver Woodford on 26 Jul 2011

The randperm I have (R2010b) looks like this:

[~,p] = sort(rand(1,n));

So in the case that n = 2 (for example), rand(1,n) will generate a huge number of different streams, but the sort will project these down on to only two different permutations.

Derek O'Connor on 26 Jul 2011

Oliver,

Thanks for your example. I see what you mean now.

Here is a crude generalization:

1. If the size of the RNG space, |R|, is greater than the size of

the permutation space |P|, then FYS(U) will be many to 1.

2. If |R| = |P| then FYS(U) will be 1:1

3. If |R| << |P| then FYS(U) will be ??

It is the third case that bothers me and prompted my question.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jan on 23 Jul 2011

Open in MATLAB Online

1 vote

You are right: The Mersenne-Twister cannot create all permutations from a certain limit. The period of 10^6000 means that the sequence of the created 32 bit numbers is repeated after this number of steps. RANDPERM uses 53 bit doubles, which are built from two 32 integers, such that the half period must be taken.

While the sequence of numbers is repeated after 0.5*10^6000 RNG calls only, the values of the 53 bit floating point number are repeated earlier. RANDPERM uses Matlab's SORT, which is a stable quicksort implementation. Here stability means, that the order of equal inputs is kept in the output. This leads to a lovely bias, which can be found without waiting for years by reducing the RNG range drastically:

rngLimit = 5;
n        = 40;
elem     = 1:n
elemSum  = zeros(1, n);
indexSum = zeros(1, n);
for i = 1:10000
  random    = floor(rand(1, n) * rngLimit);
  randomSum = randomSum + random;
  [igno, p] = sort(random);  % This is RANDPERM
  elemSum   = elemSum + elem(p);
end
% plot(randomSum)   % no bias
plot(elemSum)     % cute bias

However, for 53 bits and periods of 0.5*10^6000 the determination of the bias will lead to other problems: There are only about 10^80 nucleons in the universe. This limits the possibilities to store a list of already chosen permutations. So there are some permutations not reachable by RANDPERM, but the chance to suffer from this is tiny.

As you, Derek, know, I suggest not to use RANDPERM for large data sets, but the Fisher-Yates-Shuffle, e.g. implemented in FEX: RPG Lab or FEX: Shuffle. I've implemented the KISS RNG in Shuffle, which has a period of 10^37 and thought of using Marsaglias' CMWC4096 with a period of 10^33000. But as far as I can see, the limits of the universe are a stricter bound of the number of computable permutations.

I'd looking forward to know, if any advanced mathematician shares this estimation.

5 Comments
Show 3 older comments Hide 3 older comments

Derek O'Connor on 24 Jul 2011

Jan,

Thanks for the program and graph.

A small point: Marsaglia says that the period of his CMWC4096

generator is 10^39460 or about 10^33000 times as large as MT's

10^6000.

In what follows I will assume that the integers {1,2,...,n} are to

be shuffled or permuted, that KISS has a period T = 10^37, MT a

period T = 10^6000, and CMWC4096 a period T = 10^39460.

If an RPG uses an RNG with period T then, to cover (gain access to)

the whole permutation space, n should be restricted so that n! < T.

This means your Shuffle should have n < 26, my GRPfys or RANDPERM

should have n < 1840, and if CMWC is used then n < 9878.

I'm not too worried that there are a mere 10^80 particles in the

known universe because I will never ask an RPG to generate all

permutations of size n = 10^6. What I would like is a very small

sample ns = 10,100,1000 from the gigantic space of (10^6)!

permutations of size 10^6 and that these permutations be selected

or drawn uniformly from this gigantic space.

If an RPG violates the n! < T restriction then I do not know what

happens because I do not know which parts of the permutation space

the RPG cannot access. This is where the mathematicians come in.

In the meantime, I urge you to implement Marsaglia's CMWC4096 which

would extend the range of your FEX:Shuffle and my FEX:RPGlab. Also,

take a look at Richard Brent's site http://maths.anu.edu.au/~brent/

If anyone can answer our questions, he can.

Here is what Brent says in his 2006 paper "Some long-period random

number generators using shifts and xors" contained in

http://wwwmaths.anu.edu.au/~brent/ftp/random/xorgens305.zip

'Any RNG based on a finite state must eventually cycle, but it is

desirable for RNGs to have a very long period T (the cycle length).

Most generators fail certain statistical tests if more than about

T^(1/2) random numbers are used [14], or perhaps even about T^(1/3)

numbers for the “birthday spacings” test.'

This suggests that the restrictions on n should be even smaller

than the ones I calculated above.

Regards,

Derek.

Jan on 26 Jul 2011

@Derek: The difference between the mathematical and physical view is the size of infinity. The physicist tends to accept 10^6! as infinity and then the term _uniformly_ looses its meaning, because there is no physical mechanism to proof that a distribution is not uniformly distributed. (Sorry, this is not 100% exact. A physicist will reject the assumption that a uniform RNG with infinite precision can create the sequence {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ...}.)

GUIDs are accepted to be unique, although this is mathematically wrong. Checksums like SHA-256 are actually RNGs with a wide low-entropy seed. Nevertheless the degree of uniqueness is sufficient for using them as trusted fingerprints.

Without doubt you are correct, that the determination of permutations is limited, especially the RANDPERM method (therefore I've voted the question). My meta arguments conern only the fact, that it will be very hard to create a real-world application, which suffers from this limitations.

Derek O'Connor on 4 Aug 2011

@Jan

Assuming that the Fisher-Yates Shuffle produces permutations

uniformly if it uses a good RNG with period T, and if we use

Brent's & Matsumoto's recommendation that we generate no more than

T^(1/3) RNs then we must restrict n so that n! < T^(1/3). This

gives

1. n = 11 for KISS, T = 10^37

2. n = 702 for MT, T = 10^6000

3. n = 3690 for CMWC4096, T = 10^39460

These values of n are very small.

I have come to the pessimistic conclusion that we may never be able

to generate large permutations, with a guarantee of uniformity.

The same conclusion applies to generating random samples: there are

C(N,n) possible samples of size n from a population of size N. Thus

a function that uses KISS to generate samples of size 500 from an

array of size 10^6 cannot possibly cover the C(10^6,500) ~ 10^1866

possible samples.

J. Gentle, in his book "Random Number Generation and Monte Carlo

Methods", 2nd Ed., 2003, has the following comment at the bottom of

page 220:

`In no other application considered in this book does the

limitation of a finite period stand out so clearly as it does in

the case of generation of random samples and permutations.'

Also, he points out that these limitations have been known for many

years. See:

Greenwood, J. Arthur (1976a), The demands of trivial combinatorial

problems on random number generators, "Proceedings of the Ninth

Interface Symposium on Computer Science and Statistics"

A final problem: empirically testing an RPG for uniformity seems to

be difficult, even for small values of n.

The standard Chi-Squared goodness-of-fit test requires a sample

size ns > 5*n! for permutations of size n. Hence, to test the

output of an RPG that produces permutations of size n = 20, would

require samples of size ns = 5*20! ~ 10^19. This is clearly

impracticable.

There may be other tests for uniformity that require less work.

Does anyone have suggestions?

Sign in to comment.

Answer 2

Peter Perkins on 8 Sep 2011

Open in MATLAB Online

1 vote

A couple of people in this thread have mentioned the Fisher-Yates-Durstenfeld(-...) shuffle algorithm.

New in R2011b, the RANDPERM function lets you specify a second argument, as in, "give me a random selection of size k from a random permutaion of 1:n". This not only simplifies and speeds up the typical

perm = randperm(n);
r = perm(1:k)

that has been recommended in the past, but also uses F-Y-D. So of course you can ask for a random permutation of all of 1:n as

perm = randperm(n,n)

and have it generated using F-Y-D with the Mersenne Twister underneath. For backwards compatibility, randperm(n) continues to use the existing algorithm based on sort(rand(1,n)).

3 Comments
Show 1 older comment Hide 1 older comment

Peter Perkins on 9 Sep 2011

yup:

>> rng default

>> randperm(5)

ans =

3 5 1 2 4

>> rng default

>> randperm(5,5)

ans =

4 2 3 1 5

(Not that demonstrates which is which, but you have it right.) Also note that

>> rng default

>> randperm(5,3)

ans =

5 3 2

so you can see that randperm(n,k) is not simply the first k elements of randperm(n,n) -- the latter is never generated, which leads to fast generation of things like

>> randperm(2^30,3)

ans =

521168135 859294610 152349296

By the way, randperm(n,k) can be thought of as "sampling without replacement", as opposed to randi(n,k,1), which can be thought of as "sampling with replacement. Also of potential interest, new in R2011b in Statistics Toolbox if you have it, is the datasample function <http://www.mathworks.com/help/toolbox/stats/datasample.html>, which facilitates both of those, including weighted versions of both (it is intended as an improved version of the existing randsample function in Statistics Toolbox).

Derek O'Connor on 9 Sep 2011

Peter,

Is randperm(2^30,3) ans = 521168135 859294610 152349296 faster than

this:

>> L=-2^25;U=2^25;ns=3;tic;

>> [S,nw] = RandSampRejBitsP(L,U,ns);disp(toc);disp(nw);disp(S(1:3))

0.00065465 secs

0 random numbers rejected

8440023 -11312240 11620411

>> L=-2^25;U=2^25;ns=2^15;tic;

>> [S,nw] = RandSampRejBitsP(L,U,ns);disp(toc);disp(nw);disp(S(1:3))

0.022715 secs

6 random numbers rejected

-31191737 27713048 16443262

RandSampRejBitsP can't go up to 2^30 and it needs a warm-up call to

set up the large *persistent* N = (U-L+1)= 2^26 "bit vector"

(logical Member(1,N)). Memory is cheap.

Dell Precision 690, Intel 2xQuad-Core E5345 @ 2.33GHz, 16GB RAM

Windows7 64-bit Prof., MATLAB R2008a, 7.6.0.324

Sign in to comment.

Answer 3

Daniel Shub on 23 Jul 2011

0 votes

Peter Perkins blog on Loren's blog a while back about the random number generators.

http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii/

It doesn't answer your question exactly. Your question seems to be more is the Mersenne Twister a good random number generator. The answer to that is, yes, it is one of the better general purpose random number generators available today. Yes, there are sequences it will not generate, however, its repeat time is so long that it is basically impossible to detect that from the sequence itself.

1 Comment
Show -1 older comments Hide -1 older comments

Derek O'Connor on 23 Jul 2011

Daniel,

You are "putting words in my mouth": I do not question the goodness

of the Mersenne Twister. It is one of the best of the "modern"

generators and has been tested and used by many. Its period is

about 2^{19937} ~ 10^{6000} and is equidistributed in 623 dimensions.

There are better generators, according to George Marsaglia

(http://groups.google.com/group/sci.crypt/browse_thread/thread/305c507efbe85be4)

It seems to me that if we use any RNG whose internal state has b

bits then we should restrict the size of the permutations generated

by any RPG that uses it so that n*log2(n) < b. If an RPG uses the

MT generator then it should not be allowed to generate permutations

of size greater than about n = 1830, with n*log2(n) = 19833. If we

allow n to be greater than 1830 then the RPG must miss some of the

permutations. Hence my question.

Sign in to comment.

Answer 4

Walter Roberson on 25 Jul 2011

Open in MATLAB Online

0 votes

As long as randperm() is implemented as a stable sort of random numbers that can be duplicates, you cannot trust randperm() requesting more than 1 value.

Suppose you rand(1,2) and both results happen to come out exactly the same. If they cannot come out exactly the same then that would be a test that could be applied to easily distinguish pseudo-random numbers from true-random numbers: true random numbers can have duplicates (and the standard birthday-paradox analysis would make short work of detecting the problem.)

If the two values of rand(1,2) can come out exactly the same (no matter what the period is for full repetition), then the stable sort that is applied to the random numbers is always going to choose the same ordering, not a permutation of that.

This suggests an recursive "saferrandperm" algorithm

[vals, idx] = sort(rand(1,N));
diff(vals)
Look for runs of 0 in the diff: these indicate groups if identical values, and identifying them is an order-N operation.
For any one group of identified identical values, extract the indices, take the length of that, and saferrandperm() that length. Use the indices saferrandperm() returns on the subset size to shuffle the extracted indices in the vector to be returned; if there are more groups of identical values, do the same things for those.
When saferrandperm detects ties in the generated rand() values, it does the same recursive shuffle.
The proportion of duplicated values will get smaller and smaller and as log as the RNG does not generate all-identical values, eventually any one subset will be suitably permuted with the ordering *not* stable.

1 Comment
Show -1 older comments Hide -1 older comments

Jan on 26 Jul 2011

@Walter: Thanks for this safer version of RANDPERM.

If RANDPERM is patched, I suggest to replace it by the faster Fisher-Yates-Alorithm directly.

Sign in to comment.

What does RANDPERM miss?

4 Comments
Show 2 older comments Hide 2 older comments

Answers (4)

5 Comments
Show 3 older comments Hide 3 older comments

3 Comments
Show 1 older comment Hide 1 older comment

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

What does RANDPERM miss?

4 Comments Show 2 older comments Hide 2 older comments

Answers (4)

5 Comments Show 3 older comments Hide 3 older comments

3 Comments Show 1 older comment Hide 1 older comment

1 Comment Show -1 older comments Hide -1 older comments

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

4 Comments
Show 2 older comments Hide 2 older comments

5 Comments
Show 3 older comments Hide 3 older comments

3 Comments
Show 1 older comment Hide 1 older comment

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments