Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

VERY IMPORTANT EDIT: All Ai are unique.

The Question

I have a list A of n unique objects. Each object Ai has a variable percentage Pi.

I want to create an algorithm that generates a new list B of k objects (k < n/2 and in most cases k is significantly less than n/2. E.g. n=231 , k=21). List B should have no duplicates and will be populated with objects originating from list A with the following restriction:

The probability that an object Ai appears in B is Pi.

What I Have Tried

(These snipits are in PHP simply for the purposes of testing) I first made list A

$list = [
    "A" => 2.5, 
    "B" => 2.5, 
    "C" => 2.5, 
    "D" => 2.5, 
    "E" => 2.5, 
    "F" => 2.5, 
    "G" => 2.5, 
    "H" => 2.5, 
    "I" => 5,   
    "J" => 5,   
    "K" => 2.5, 
    "L" => 2.5, 
    "M" => 2.5, 
    "N" => 2.5, 
    "O" => 2.5, 
    "P" => 2.5, 
    "Q" => 2.5, 
    "R" => 2.5, 
    "S" => 2.5, 
    "T" => 2.5, 
    "U" => 5,   
    "V" => 5,   
    "W" => 5,   
    "X" => 5,   
    "Y" => 5,   
    "Z" => 20   
];

At first I tried the following two algorthms (These are in PHP simply for the purposes of testing):

$result = [];

while (count($result) < 10) {
    $rnd = rand(0,10000000) / 100000;

    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
        if ($rnd <= $sum) {
            if (in_array($key,$result)) {
                break;
            } else {
                $result[] = $key;
                break;
            }
        }
    }
}

AND

$result = [];

while (count($result) < 10) {
    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
    }

    $rnd = rand(0,$sum * 100000) / 100000;

    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
        if ($rnd <= $sum) {
            $result[] = $key;
            unset($list[$key]);
            break;
        }
    }
}

The only differences between the two algorithms is that one tries again when it encounters a duplicate, and one removes the object form list A when it is picked. As it turns out, these two algorithms have the same probability outputs.

I ran the second algorithm 100,000 times and kept track of how many times each letter was picked. The following array contians the percentage chance that a letter is picked in any list B based off of the 100,000 tests.

[A] => 30.213
[B] => 29.865
[C] => 30.357
[D] => 30.198
[E] => 30.152
[F] => 30.472
[G] => 30.343
[H] => 30.011
[I] => 51.367
[J] => 51.683
[K] => 30.271
[L] => 30.197
[M] => 30.341
[N] => 30.15
[O] => 30.225
[P] => 30.135
[Q] => 30.406
[R] => 30.083
[S] => 30.251
[T] => 30.369
[U] => 51.671
[V] => 52.098
[W] => 51.772
[X] => 51.739
[Y] => 51.891
[Z] => 93.74

When looking back at the algorithm this makes sense. The algorithm incorrectly interpreted the original percentages to be the percentage chance that an object is picked for any given location, not any list B. So for example, in reality, the chance that Z is picked in a list B is 93%, but the chance that Z is picked for an index Bn is 20%. This is NOT what I want. I want the chance that Z is picked in a list B to be 20%.

Is this even possible? How can it be done?

EDIT 1

I tried simply having the sum of all Pi = k, this worked if all Pi are equal, but after modifying their values, it started to get more and more wrong.

Initial Probabilities

$list= [
    "A" => 8.4615,
    "B" => 68.4615,
    "C" => 13.4615,
    "D" => 63.4615,
    "E" => 18.4615,
    "F" => 58.4615,
    "G" => 23.4615,
    "H" => 53.4615,
    "I" => 28.4615,
    "J" => 48.4615,
    "K" => 33.4615,
    "L" => 43.4615,
    "M" => 38.4615,
    "N" => 38.4615,
    "O" => 38.4615,
    "P" => 38.4615,
    "Q" => 38.4615,
    "R" => 38.4615,
    "S" => 38.4615,
    "T" => 38.4615,
    "U" => 38.4615,
    "V" => 38.4615,
    "W" => 38.4615,
    "X" => 38.4615,
    "Y" =>38.4615,
    "Z" => 38.4615
];

Results after 10,000 runs

Array
(
    [A] => 10.324
    [B] => 59.298
    [C] => 15.902
    [D] => 56.299
    [E] => 21.16
    [F] => 53.621
    [G] => 25.907
    [H] => 50.163
    [I] => 30.932
    [J] => 47.114
    [K] => 35.344
    [L] => 43.175
    [M] => 39.141
    [N] => 39.127
    [O] => 39.346
    [P] => 39.364
    [Q] => 39.501
    [R] => 39.05
    [S] => 39.555
    [T] => 39.239
    [U] => 39.283
    [V] => 39.408
    [W] => 39.317
    [X] => 39.339
    [Y] => 39.569
    [Z] => 39.522
)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
331 views
Welcome To Ask or Share your Answers For Others

1 Answer

We must have sum_i P_i = k, or else we cannot succeed.

As stated, the problem is somewhat easy, but you may not like this answer, on the grounds that it's "not random enough".

Sample a uniform random permutation Perm on the integers [0, n)
Sample X uniformly at random from [0, 1)
For i in Perm
    If X < P_i, then append A_i to B and update X := X + (1 - P_i)
    Else, update X := X - P_i
End

You'll want to approximate the calculations involving real numbers with fixed-point arithmetic, not floating-point.

The missing condition is that the distribution have a technical property called "maximum entropy". Like amit, I cannot think of a good way to do this. Here's a clumsy way.

My first (and wrong) instinct for solving this problem was to include each A_i in B independently with probability P_i and retry until B is the right length (there won't be too many retries, for reasons that you can ask math.SE about). The problem is that the conditioning messes up the probabilities. If P_1 = 1/3 and P_2 = 2/3 and k = 1, then the outcomes are

{}: probability 2/9
{A_1}: probability 1/9
{A_2}: probability 4/9
{A_1, A_2}: probability 2/9,

and the conditional probabilities are actually 1/5 for A_1 and 4/5 for A_2.

Instead, we should substitute new probabilities Q_i that yield the proper conditional distribution. I don't know of a closed form for Q_i, so I propose to find them using a numerical optimization algorithm like gradient descent. Initialize Q_i = P_i (why not?). Using dynamic programming, it's possible to find, for the current setting of Q_i, the probability that, given an outcome with l elements, that A_i is one of those elements. (We only care about the l = k entry, but we need the others to make the recurrences work.) With a little more work, we can get the whole gradient. Sorry this is so sketchy.

In Python 3, using a nonlinear solution method that seems to converge always (update each q_i simultaneously to its marginally correct value and normalize):

#!/usr/bin/env python3
import collections
import operator
import random


def constrained_sample(qs):
    k = round(sum(qs))
    while True:
        sample = [i for i, q in enumerate(qs) if random.random() < q]
        if len(sample) == k:
            return sample


def size_distribution(qs):
    size_dist = [1]
    for q in qs:
        size_dist.append(0)
        for j in range(len(size_dist) - 1, 0, -1):
            size_dist[j] += size_dist[j - 1] * q
            size_dist[j - 1] *= 1 - q
    assert abs(sum(size_dist) - 1) <= 1e-10
    return size_dist


def size_distribution_without(size_dist, q):
    size_dist = size_dist[:]
    if q >= 0.5:
        for j in range(len(size_dist) - 1, 0, -1):
            size_dist[j] /= q
            size_dist[j - 1] -= size_dist[j] * (1 - q)
        del size_dist[0]
    else:
        for j in range(1, len(size_dist)):
            size_dist[j - 1] /= 1 - q
            size_dist[j] -= size_dist[j - 1] * q
        del size_dist[-1]
    assert abs(sum(size_dist) - 1) <= 1e-10
    return size_dist


def test_size_distribution(qs):
    d = size_distribution(qs)
    for i, q in enumerate(qs):
        d1a = size_distribution_without(d, q)
        d1b = size_distribution(qs[:i] + qs[i + 1 :])
        assert len(d1a) == len(d1b)
        assert max(map(abs, map(operator.sub, d1a, d1b))) <= 1e-10


def normalized(qs, k):
    sum_qs = sum(qs)
    qs = [q * k / sum_qs for q in qs]
    assert abs(sum(qs) / k - 1) <= 1e-10
    return qs


def approximate_qs(ps, reps=100):
    k = round(sum(ps))
    qs = ps[:]
    for j in range(reps):
        size_dist = size_distribution(qs)
        for i, p in enumerate(ps):
            d = size_distribution_without(size_dist, qs[i])
            d.append(0)
            qs[i] = p * d[k] / ((1 - p) * d[k - 1] + p * d[k])
        qs = normalized(qs, k)
    return qs


def test(ps, reps=100000):
    print(ps)
    qs = approximate_qs(ps)
    print(qs)
    counter = collections.Counter()
    for j in range(reps):
        counter.update(constrained_sample(qs))
    test_size_distribution(qs)
    print("p", "Actual", sep="")
    for i, p in enumerate(ps):
        print(p, counter[i] / reps, sep="")


if __name__ == "__main__":
    test([2 / 3, 1 / 2, 1 / 2, 1 / 3])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...