This material was a teaching aid for a crash course I gave at work about cosine similarity. Cosine similarity is a blunt instrument used to compare two sets of text. If two the two texts have high numbers of common words, then the texts are assumed to be similar. The ultimate goal is to plug two texts into a function and get an easy to understand number out that describes how similar the texts are, and cosine similarity is one way to skin that cat.

Please note, there are plenty of other very fast implementations for cosine similarity, but this one was written for educational purposes.

TLDR: the negative binomial counts the number of trials needed before the Nth success.

I had this problem where we were considering running some very expensive tests that had a known success rate, and we wanted to know, given the success rate and the cost, whether we should run them at all. To make things more interesting, we were only interested in a set number of successes, and we could stop all testing after the first successes. My initial thought was to use the binomial distribution, but the binomial doesn’t “cut off” after a set number of successes. It turns out that we needed to use a version of the negative binomial distribution.

Summarizing the average performance of a set of things under different loads is a particularly tricky thing. The correct way to summarize performance is to use the geometric mean instead of the arithmetic mean. The tricky part is that the difference between the arithmetic and geometric mean is only significant under a certain condition, so the impact of using the arithmetic mean instead of the geometric may not be painfully obvious. Let’s start with an example.

I’ve borrowed (stolen) code from this iPython Notebook hosted on GitHub from the PyData NYC 2014 conference. I didn’t like the local call in the original code, so I made it object oriented. (Full disclosure: I’d never seen the local keyword before, so I stuck with the devil I knew.) I also wanted syntax reminiscent of scipy.stats, so I added a .rvs() method from extracting a sample from the Poisson disk object.

Sub-random numbers sort of look random, but they aren’t and they usually provide better coverage over an interval which is sometimes more important than having truly random data. For example, you wouldn’t use sub-random numbers for encryption, but they’d be great for performing Monte Carlo calculations. You can read more about them on the Wikipedia page for low discrepancy sequences.

In this post I’ll discuss how to use Python and R to calculate the Pearson Chi-Squared Test for goodness of fit. The chi-squared test for goodness of fit determines how well categorical variables fit some distribution. We assume that the categories are mutually exclusive, and completely cover the sample space. That means that the everything we can think of fits into exactly one category, with no exceptions. For example, suppose we flip a coin to determine if it is fair. The outcomes of this experiment fit into exactly two categories, head and tails. The same goes for rolling a die to determine its fairness; rolls of the die will result in (exactly) one of (exactly) six outcomes or categories. This test is only meaningful with mutually exclusive categories.

In this post I’ll discuss compound Poisson processes, which I read about in the final chapter of Hassett and Stewart’s Probability for Risk Management last night. These model a stochastic process where at regular intervals (months, quarters, etc.) some number of events occur according to a Poisson process with rate , and the intensity of each event is determined independently by another other distribution.

In this post, I’ll describe a technique for determining whether the mean of two sets are significantly different. In a previous post I demonstrated how to perform the standard statistical tests using R. Randomization tests are convenient when you can’t say anything about the normality or homoscedasticity (constant variance) of the population, and/or you don’t have access to a truly random sample.

In this post I’ll look at different statistical hypothesis tests in R. Statistical tests can be tricky because they all have different assumptions that must be met before you can use them. Some tests require samples to be normally distributed, others require two samples to have the same variance, while others are not as restrictive.

We’ll begin with testing for normality. Then we’ll look at testing for equality of variance, with and without an assumption of normality. Finally we’ll look at testing for equality of mean, under different assumptions regarding normality and equal variance.

In this post I’ll discuss how to perform incremental updates to a simple statistical model using PyMC. The short answer is that you have to create a new model each time. In this example, I’ll use a Bernoulli random variable from scipy.stats to generate coin flips, and I will use PyMC to model a prior and likelihood distribution, and produce a posterior distribution as output.