In this post I will discuss an implementation of sequential Gaussian simulation (SGS) from the field of geostatistics. Geostatistics is simply a statistical consideration of spatially distributed data. Sequential Gaussian simulation is a technique used to “fill in” a grid representing the area of interest using a smattering of observations, and a model of the observed trend. The basic workflow incorporates three steps:
- Modeling the measured variation using a semivariogram
- Using the semivariogram to perform interpolation by kriging
- Running simulations to estimate the spatial distribution of the variable(s) of interest
We will use code from previous posts to perform simple kriging and z-score transformations, which will be available at cjohnson318/geostatistics. The four-inch paint brush version of two dimensional sequential Gaussian simulation is as follows:
Sequential Gaussian Simulation
- Define a grid, this may be fine or coarse
- Place the z-transformed initial data into the nearest grid cells
- Visit the rest of the grid cells at random and perform kriging using all of the data available
- Back-transform the data to retrieve the approximate distribution
There are two important things to note here:
- SGS expects normally distributed data, so if the data is not normally distributed, we use a z-score transformation before SGS and a back-transformation afterwards
- In step 3. above, as we move randomly about the grid, we add each newly kriged estimate to the data used for kriging each time
The second point means that our data set for kriging increases by one data point for each step until we visit all of the cells in our grid. The randomization allows us to run multiple simulations and then take their mean.
Data
For this example, I will be using the 140 data points found in the data set cluster.dat
prepared by Clayton V. Deutsch. This data set and others in the GSLIB: Geostatistical Software Library and User’s Guide may be found here.
I have prepared a Python function for reading this data format.
def read_data( fn ): f = open( fn, "r" ) # title of the data set title = f.readline() # number of variables nvar = int( f.readline() ) # variable names columns = [ f.readline().strip() for i in range( nvar ) ] # make a list for the data data = list() # for each line of the data while True: # read a line line = f.readline() # if that line is empty if line == '': # the jump out of the while loop break # otherwise, append that line to the data list else: data.append( line ) # strip off the newlines, and split by whitespace data = [ i.strip().split() for i in data ] # turn a list of list of strings into an array of floats data = np.array( data, dtype=np.float ) # combine the data with the variable names into a DataFrame df = pandas.DataFrame( data, columns=columns, ) return df
We’ll import some packages and load our data into our workspace using the following lines.
import time import numpy as np import scipy.stats import scipy.optimize import scipy.interpolate import krige import utilities import random clstr = read_data( 'cluster.dat' )
Gridding
The next step is to define a grid. The data in cluster.dat
lies in 50 by 50 foot region, so we can form a grid from zero to fifty with thirty steps as follows.
# we want 30 steps between 0 and 50, in both dimensions nx = 50 ; xsteps = 30 ny = 50 ; ysteps = 30 xrng = np.linspace(0,nx,num=xsteps) yrng = np.linspace(0,ny,num=ysteps)
Randomization
This part creates a random path through the spatial domain. First I shuffle a vector of indices, attach some data to those indices, and then sort them in order to shuffle the data. The data is the cell address, an ordered pair of integers, and the physical location, which is an ordered pair of floats, describing a coordinate in feet.
# total number of cells N = xsteps * ysteps # an array of values 0 to N-1 idx = np.arange( N ) # randomize idx in place random.shuffle( idx ) # create a list named path path = list() # initialize an index for idx t = 0 # walk through the 2D grid.. for i in range( xsteps ): for j in range( ysteps ): # the first element is the shuffled index, # then the cell address, then the physical location path.append( [ idx[t], (i,j), (xrng[i],yrng[j]) ] ) # increment the index t += 1 # sort the shuffled index, shuffling the cell # addresses and the physical locations (in feet) path.sort()
Simulation
This part transforms the data before it goes into the sgs()
function, and bundles the location information with the the transformed variable. The sgs()
function then visits all of the cells in the grid at random, calculates a kriging estimate, adds that to the original data, and repeats.
# x and y coordinates in feet locations = np.array( clstr[['Xlocation','Ylocation']] ) # value of the variable of interest, e.g., porosity or permeability variable = np.array( clstr['Primary'] ) # the z-score transformation norm, inv, param, mu, sd = to_norm( variable, 1000 ) # the data to be kriged and appended to # during the sequential Gaussian simulation data = np.vstack((locations.T,norm)).T def sgs( data, bw, path, xsteps, ysteps ): ''' Input: (data) <N,3> NumPy array of data (bw) bandwidth of the semivariogram (path) randomized path through region of interest (xsteps) number of cells in the x dimension (ysteps) number of cells in the y dimension Output: (M) <xsteps,ysteps> NumPy array of data representing the simulated distribution of the variable of interest ''' # create array for the output M = np.zeros((xsteps,ysteps)) # generate the lags for the semivariogram hs = np.arange(0,50,bw) # for each cell in the grid.. for step in path : # grab the index, the cell address, and the physical location idx, cell, loc = step # perform the kriging kv = krige( data, spherical, hs, bw, loc, 4 ) # add the kriging estimate to the output M[ cell[0], cell[1] ] = kv # add the kriging estimate to a spatial location newdata = [ loc[0], loc[1], kv ] # add this new point to the data used for kriging data = np.vstack(( data, newdata )) return M
We can time the execution of our code as follows:
t0 = time.time() M = sgs( data, 5, path, xsteps, ysteps ) t1 = time.time() print (t1-t0)/60.0
Visualization
This is the best part. This is what the simulation looks like before we transform it back from the z-score transformation.
matshow( M[:,::-1].T, cmap=my_cmap ) colorbar(); xs, ys = str(xsteps), str(ysteps) xlabel('X') ; ylabel('Y') ; title('Raw SGS '+xs+'x'+ys+'\n') savefig( 'sgs_untransformed_'+xs+'_by_'+ys+'.png', fmt='png', dpi=200 )
This is what the simulation looks like after we transform it back from the z-score transform.
z = from_norm( M.ravel(), inv, param, mu, sd ) z = np.reshape( z, ( xsteps, ysteps ) ) matshow( z[:,::-1].T, cmap=my_cmap ) colorbar() ; xs, ys = str(xsteps), str(ysteps) xlabel('X') ; ylabel('Y') ; title('SGS '+xs+'x'+ys+'\n') savefig( 'sgs_transformed_'+xs+'_by_'+ys+'.png', fmt='png', dpi=200 )