In this post I’ll present the z-score forward and backward transforms used in Sequential Gaussian Simulation, to be discussed at a later date. Some geostatistical algorithms assume that data is distributed normally, but interesting data is generally never normally distributed? Solution: force normality, or quasi-normality. All of this is loosely based on Clayton V. Deutsche’s work on the GSLIB library, and his books.
The first function creates a cumulative distribution function, and its inverse, for a given data set
d. The CDF and its inverse are reported as
<Nx2> arrays, where
N is the number of data points in
d, the input data set.
def cdf( d, bins=12 ): N = len( d ) counts, intervals = np.histogram( d, bins=bins ) h = np.diff( intervals ) / 2.0 f, finv = np.zeros((N,2)), np.zeros((N,2)) idx, k, T = 0, 0, float( np.sum( counts ) ) for count in counts: for i in range( count ): x = intervals[idx]+h y = np.cumsum( counts[:idx+1] )[-1] / T f[k,:] = x, y finv[k,:] = y, x k += 1 idx += 1 return f, finv
This next awkward function returns an interpolation function for the CDFs created by the previous function. It’s a function that returns a function, thanks to functions being first class citizens in Python land. The basis of this function is the
scipy.interpolate.interp1d() function, but since this function throws an error when you give it a value outside the original interpolation range, I’ve taken care of those boundary conditions. I could have alternatively used a try/except structure, and I may yet, but this works for now.
def fit( d ): x, y = d[:,0], d[:,1] def f(t): if t <= x.min(): return y[ np.argmin(x) ] elif t >= x.max(): return y[ np.argmax(x) ] else: intr = scipy.interpolate.interp1d( x, y ) return intr(t) return f
Finally, here are my transformation and back-transformation functions. Depending on the original distribution of the data, the transformation to a standard normal distribution may or may not work out so nicely.
# transform data to normal dist def to_norm( data, bins=12 ): mu = np.mean( data ) sd = np.std( data ) z = ( data - mu ) / sd f, inv = cdf( z, bins=bins ) z = scipy.stats.norm(0,1).ppf( f[:,1] ) z = np.where( z==np.inf, np.nan, z ) z = np.where( np.isnan( z ), np.nanmax( z ), z ) param = ( mu, sd ) return z, inv, param, mu, sd # transform data from normal dist back def from_norm( data, inv, param, mu, sd ): h = fit( inv ) f = scipy.stats.norm(0,1).cdf( data ) z = [ h(i)*sd + mu for i in f ] return z
So that’s it. I’ll revisit this code and make it nicer and put comments in it, but this is a working example. Tomorrow I’ll work on the two dimensional Sequential Gaussian Simulation.