Zipf Based Number generation

I am trying to generate numbers based on Zipf distribution.
There is a function named RANDDRAW that generates number for the zipf exponent greater than 1. However, i need to generate Zipf numbers which can be less than 1.
Can anybody help me to generate random numbers based on zipf distribution with any value of zipf exponent?

Answers (4)

John D'Errico
John D'Errico on 31 Mar 2015
Edited: John D'Errico on 31 Mar 2015
Whenever someone says something like "for ANY exponent", I can bet that they won't be going to throw an easy problem at any offered code.
So, looking at the zipf distribution, it turns out to be an easy one to deal with, sort of. The zipf pdf is simply
p_zipf(x,p) = -x^(p+1)/zeta(p+1)
This is a discrete distribution in x, over the positive integers.
If we look at the pdf, we recognize that zeta(p+1) is the limit of the sum of powers of x, x^-(p+1), where x ranges from 1 to inf. So zeta(p+1) merely normalizes that pdf to have unit area.
Here the zipf exponent that was referred to would be p, I assume, but there are unresolved questions hidden in the question. I'll quote the line you wrote:
"However, i need to generate Zipf numbers which can be less than 1."
So, IF your question is to generate zipf samples that are non-integer, this makes no sense, since the zipf distribution is a discrete one, that samples over the positive integers.
If you are calling the zipf exponent (p+1), and you wish to sample such that p+1 is less than 1, so p is less than zero, then you will have a great deal of trouble.
If your question is to sample for non-integer values of p that are less than one, but greater than zero, then this is possible, but a serious bit of an effort for many values of p less than 1.
By the way, since the zipf pdf involves zeta(p+1), this would employ the symbolic toolbox, because for small p, the zeta function converges pretty slowly.
Sadly, I cannot proceed further without some resolution of the above questions.
Edit: Though I've not yet gotten a response to my questions, I took a quick look at the randraw code. Randraw defines the zipf pdf such that
p_zipf(x,a) = -x^(-a)/zeta(a)
and then requires only that a>1.
It would appear that you want to generate zipf numbers for a<=1, as randraw defines that distribution. If this is so, then you cannot do so. The sum...
sum(x^-a)
where x is the set of positive integers, is inf for a<=1.
This is why randraw requires that a>1, as otherwise, the zeta function has a problem. Even for a near 1, I would predict that randraw becomes slow.
If you want exponent values less than or equal to one, it is necessary to confine yourself to a specific finite number of terms, since the infinite sum of terms in the zipf distribution would be infinity. See
http://en.wikipedia.org/wiki/Zipf's_law
In the following, N is the number of possible integer values of the distribution, s is the exponent, and the desired size of the generated random matrix is m-by-n.
V = cumsum([0,1./(1:N).^s]');
[~,r] = histc(rand(m*n,1),V/V(end));
r = reshape(r,m,n);
The variable 'r' assumes random integer values from 1 to N in accordance with the zipf distribution.

10 Comments

Hello,
For the specific scenario, i am considering Zipf distribution for a finite set of objects such that each object in the set has some rank given by the Zipf distribution.
Meanwhile, in each time instant, i generate a single value based on Zipf distribution with a Zipf exponent of 0.8 and 1.2.
In your given code, i am not understanding the rank of each integer. Can you please explain about the rank of numbers?
"In your given code, i am not understanding the rank of each integer." In the first line the values in 1:N represent all the integer ranks that are to be considered possible in your finite set. In the second and third lines the quantity 'r' gives the particular random rank values from among these that are being generated with the use of 'rand'. Any rank from 1 to N can be generated. The choice of N is up to you. As I mentioned, if s <= 1, the sum of all terms in V would be infinite if N were allowed to be infinity, so you must select a finite value for N as an upper bound to the rank values you are generating. Again I refer you to the article at:
http://en.wikipedia.org/wiki/Zipf's_law
You state that "in each time instant, i generate a single value based on Zipf distribution" . For the above code it would be more efficient to generate ahead of time a 'V' vector for each of your two values of s, which could be used repeatedly, rather than recreating 'V' for each single random generation.
The problem is that rand generates values that are repeated.
I am trying to use non-repeated values but i am unable to use it. Can you tell me how above code can be used for non-repeated values?
That's a very important restriction. You should have stated that in the very beginning, Tamoor, so that we would not waste our time!
I am sorry roger. But if you can provide me with the solution, i will be thankful.
To avoid repetitions, the right way to proceed, in my opinion, is to use a "rejection" method. Generate a set of n positive integers which are mutually independent random variables with Zipf distribution but then reject the entire set if there are any repetitions and repeat the operation. For this to be practical it is important for n to be much smaller than N, the number of possible rank levels in the distribution.
V = cumsum([0,1./(1:N).^s]'); V = V/V(end);
t = true;
while t
[c,r] = histc(rand(n,1),V);
t = any(c>1); % Repeat until there are no repetitions
end
At exit from the while-loop the column vector r will contain n positive integers each of which lies in the range 1:N and with no two alike.
Thanks for the answer roger.
I was doing the same. Now, i have another modification that i would like to add. The matrix m*n should be in such a way that all the columns contain unique values.
I have N = 500 and n=25. Since m rows are generated by performing the above operation. Each column in a row should contains unique values.
Are you saying that your resulting matrix should have no repetitions along any row nor along any column? If so, that makes it more difficult to achieve. You can try rejecting entire matrices of results which fail this condition:
V = cumsum([0,1./(1:N).^s]'); V = V/V(end);
t = true;
while t
[c,r] = histc(rand(m,n),V);
r2 = diff(sort(r,2),1,2);
t = any(c(:)>1) | any(r2(:)==0);
end
In each row, no values along columns should be repeated.
Your statement "all the columns contain unique values" made me think you wanted no repetitions to occur along any column. If this is not the case, then things are a little easier.
V = cumsum([0,1./(1:N).^s]'); V = V/V(end);
t = true;
while t
[c,r] = histc(rand(n,m),V);
t = any(c(:)>1);
end
r = r';

Sign in to comment.

Wasseem Suhabuth
Wasseem Suhabuth on 31 Aug 2016
Can ihave the coding on matlab to generate zipf distribution please

Asked:

on 31 Mar 2015

Answered:

on 31 Aug 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!