Tag Archives: Statistics

Modeling with Beta Distributions

The beta distribution requires two parameters, usually referred to as a and b, or alpha and beta. If you are considering a Bernoulli process, a sequence of binary outcomes (success or failure) with a constant probability of success, then you could use a beta distribution, setting the parameter a equal to the number of successes, and setting the parameter b equal to the number of failures. The neat thing about the Beta distribution is that the greater the total number of trials (the sum of the successes and failures) the more peaked, or narrow, the distribution becomes.

Continue reading

Z-score Transform for Geostatistics

In this post I’ll present the z-score forward and backward transforms used in Sequential Gaussian Simulation, to be discussed at a later date. Some geostatistical algorithms assume that data is distributed normally, but interesting data is generally never normally distributed? Solution: force normality, or quasi-normality. All of this is loosely based on Clayton V. Deutsche’s work on the GSLIB library, and his books.

Continue reading

Classical Hypothesis Testing, Statistical Power, and Type-II Errors

This is one of the fundamental tasks in science. You do a study, and then you have to determine if there is a statistically meaningful difference between the test and control data. It is important to be able to understand the hypothesis testing, because a lot of interesting functions in R are hypothesis tests. I’ll consider the simple z-test for testing whether the mean of the simple is the same as the hypothesized mean of the population. We’ll see how statistical power, which is the probability of detecting a difference in means, changes with sample size and effect size, which is the size of the difference between the observed sample mean, and the hypothesized population mean. We’ll also see that the significance level \alpha is comparable to the Type-II (false negative) error rate.

Continue reading

Computing Principal Components in Python

In this post I will walk through the computation of principal components from a data set using Python. A number of languages and modules implement principal components analysis (PCA) but some implementations can vary slightly which may lead to confusion if you are trying to follow someone else’s code, or you are using multiple languages. Perhaps more importantly, as a data analyst you should at all costs avoid using a tool if you do not understand how it works. I will use data from The Handbook of Small Data Sets to illustrate this example. The data sets will be found in a zipped directory on site linked above.

Continue reading

Exploratory Factor Analysis in R

In this post I’ll provide an example of exploratory factor analysis in R. We will use the psych package by William Revelle. Factor analysis seeks to find latent variables, or factors, by looking at the correlation matrix of the observed variables. This technique can be used for dimensionality reduction, or for better insight into the data. As with any technique, this will not work in all scenarios. Firstly, latent variables are not always present, and secondly, it will miss existing latent variables if they are not apparent from the observed data. More information can be found in the manual distributed with the package.

Continue reading

Simple Kriging in Python

In this post I will work through an example of Simple Kriging. Kriging is a set of techniques for interpolation. It differs from other interpolation techniques in that it sacrifices smoothness for the integrity of sampled points. Most interpolation techniques will over or undershoot the value of the function at sampled locations, but kriging honors those measurements and keeps them fixed. In future posts I would like to cover other types of kriging, other semivariaogram models, and colocated co-kriging. Until then, I’m keeping relatively up to date code at my GitHub project, geostatsmodels.

Continue reading

Fractal Dimension and Box Counting

In this post I will present a technique for generating a one dimensional (quasi) fractal data set using a modified Matérn point process, perform a simple box-couting procedure, and then calculate the lacunarity and fractal dimension using linear regression. Lacunarity is actually a pretty large topic, and we will only cover one accepted interpretation here. This material was motivated by an interesting paper on the fractal modelling of fractures in tight gas reservoirs. Tight gas reservoirs refer to reservoirs with very low permeability. To provide a sense of perspective, oil reservoirs typically have a permebility of ten to a hundred millidarcies, whereas shale gas reservoirs are usually less than 0.1 microdarcies, which is about the same permeability as a granite countertop.

Continue reading

Spatial Point Processes

Here, I’ll introduce some ideas regarding spatial point processes using Python. First I’ll present the Poisson point process, and then I’ll cover two other processes: the Thomas point process and the Matérn point process. I’ll use these tools in two future posts regarding measuring fractal dimension, and kriging. An excellent resource for spatial statistics is the R package spstat. The manual is a really great read. The spstat package implements the Thomas and Matérn point processes as rThomas and rMatern.

Continue reading

Linear Regression with Python

In this post I will use Python to explore more measures of fit for linear regression. I will consider the coefficient of determination (R2), hypothesis tests (F, t, Omnibus), AIC, BIC, and other measures. This will be an expansion of a previous post where I discussed how to assess linear models in R, via the IPython notebook, by looking at the residual, and several measures involving the leverage.

Continue reading