In this post I’ll present the z-score forward and backward transforms used in Sequential Gaussian Simulation, to be discussed at a later date. Some geostatistical algorithms assume that data is distributed normally, but interesting data is generally never normally distributed? Solution: force normality, or quasi-normality. All of this is loosely based on Clayton V. Deutsche’s work on the GSLIB library, and his books.
The first function creates a cumulative distribution function, and its inverse, for a given data set d. The CDF and its inverse are reported as <Nx2> arrays, where N is the number of data points in d, the input data set.
def cdf( d, bins=12 ):
N = len( d )
counts, intervals = np.histogram( d, bins=bins )
h = np.diff( intervals ) / 2.0
f, finv = np.zeros((N,2)), np.zeros((N,2))
idx, k, T = 0, 0, float( np.sum( counts ) )
for count in counts:
for i in range( count ):
x = intervals[idx]+h[0]
y = np.cumsum( counts[:idx+1] )[-1] / T
f[k,:] = x, y
finv[k,:] = y, x
k += 1
idx += 1
return f, finv
This next awkward function returns an interpolation function for the CDFs created by the previous function. It’s a function that returns a function, thanks to functions being first class citizens in Python land. The basis of this function is the scipy.interpolate.interp1d() function, but since this function throws an error when you give it a value outside the original interpolation range, I’ve taken care of those boundary conditions. I could have alternatively used a try/except structure, and I may yet, but this works for now.
def fit( d ):
x, y = d[:,0], d[:,1]
def f(t):
if t <= x.min():
return y[ np.argmin(x) ]
elif t >= x.max():
return y[ np.argmax(x) ]
else:
intr = scipy.interpolate.interp1d( x, y )
return intr(t)
return f
Finally, here are my transformation and back-transformation functions. Depending on the original distribution of the data, the transformation to a standard normal distribution may or may not work out so nicely.
# transform data to normal dist
def to_norm( data, bins=12 ):
mu = np.mean( data )
sd = np.std( data )
z = ( data - mu ) / sd
f, inv = cdf( z, bins=bins )
z = scipy.stats.norm(0,1).ppf( f[:,1] )
z = np.where( z==np.inf, np.nan, z )
z = np.where( np.isnan( z ), np.nanmax( z ), z )
param = ( mu, sd )
return z, inv, param, mu, sd
# transform data from normal dist back
def from_norm( data, inv, param, mu, sd ):
h = fit( inv )
f = scipy.stats.norm(0,1).cdf( data )
z = [ h(i)*sd + mu for i in f ]
return z
So that’s it. I’ll revisit this code and make it nicer and put comments in it, but this is a working example. Tomorrow I’ll work on the two dimensional Sequential Gaussian Simulation.
One thought on “Z-score Transform for Geostatistics”
Comments are closed.