TLDR: the negative binomial counts the number of trials needed before the Nth success.
I had this problem where we were considering running some very expensive tests that had a known success rate, and we wanted to know, given the success rate and the cost, whether we should run them at all. To make things more interesting, we were only interested in a set number of successes, and we could stop all testing after the first successes. My initial thought was to use the binomial distribution, but the binomial doesn’t “cut off” after a set number of successes. It turns out that we needed to use a version of the negative binomial distribution.
I think that the biggest trick to this is to use the combinations function from scipy, scipy.misc.comb()
. You could roll your own combinations function, but that would be one more thing to test, and it probably wouldn’t be as efficient as the scipy method.
import scipy import scipy.misc def ncr( n, r ): """nCr: the number of combinations of r elements chosen from n elements Args: n (int): total population of elements r (int): size of subset Returns: (int): number of subsets of size `r` taken from `n` elements """ return scipy.misc.comb( n, r, exact=True ) def negative_binomial( k, n, p ): """Probability of the n-th success on the k-th trial. Args: k (int): total number of trials n (int): number of successes p (float): probability of success Returns: (float): probability of the n-th success occurring on the k-th trial """ return ncr( k-1, n-1 ) * p**n * (1-p)**(k-n)
The particular version of the negative binomial we are using is the one that takes the probability of success, and the number of successes as a parameter, and returns the probability of the th success occurring on the th trial.
In our case, we had a success rate of , and we were looking for our first two successes, so , then we can plot probabilities for different values of .
n_tries = [ i for i in range(2,10) ] prob_of_n_tries = map( negative_binomial, n_tries, [2]*len(n_tries), [0.6]*len(n_tries) ) plot( n_tries, prob_of_n_tries ) title("Probability of the 2nd Success on the Nth Trial") xlabel("Trials") ylabel("Probability") savefig("negative_binomial_plot.png",dpi=200)
This says that there is a 36% chance of the second success occurring on the second trial, a ~30% chance that the second success occurs exactly on the third trial, etc. What we’re probably more interested in is the probability of the the second success occurring in three trials or less, or four trials or less. This allows us to decide on a limit to the number of tests we’re willing to commit to, and sleep at night with our decision. To make that calculation we’ll need to take a cumulative sum of the probabilities in the figure above, and that can be done with the scipy.cumsum()
function.
plot( n_tries, scipy.cumsum( p_of_n_tries ) ) title("Probability of Two Successes in N or Fewer Trials") xlabel("Cumulative Trials") ylabel("Probability") savefig("cumulative_negative_binomial_plot.png",dpi=200)
Here, we can see that we have a ~65% chance of getting two successes in no more than three trials, or a better than 80% chance of getting two successes in no more than 4 trials. This way, if the tests are very expensive, we can feel comfortable committing to a maximum of four tests which should give us an 80% chance of accumulating two successes. That’s pretty cool.