Two Dimensional Sequential Gaussian Simulation in Python

In this post I will discuss an implementation of sequential Gaussian simulation (SGS) from the field of geostatistics. Geostatistics is simply a statistical consideration of spatially distributed data. Sequential Gaussian simulation is a technique used to “fill in” a grid representing the area of interest using a smattering of observations, and a model of the observed trend. The basic workflow incorporates three steps:

  1. Modeling the measured variation using a semivariogram
  2. Using the semivariogram to perform interpolation by kriging
  3. Running simulations to estimate the spatial distribution of the variable(s) of interest

We will use code from previous posts to perform simple kriging and z-score transformations, which will be available at cjohnson318/geostatistics. The four-inch paint brush version of two dimensional sequential Gaussian simulation is as follows:

Sequential Gaussian Simulation

  1. Define a grid, this may be fine or coarse
  2. Place the z-transformed initial data into the nearest grid cells
  3. Visit the rest of the grid cells at random and perform kriging using all of the data available
  4. Back-transform the data to retrieve the approximate distribution

There are two important things to note here:

  1. SGS expects normally distributed data, so if the data is not normally distributed, we use a z-score transformation before SGS and a back-transformation afterwards
  2. In step 3. above, as we move randomly about the grid, we add each newly kriged estimate to the data used for kriging each time

The second point means that our data set for kriging increases by one data point for each step until we visit all of the cells in our grid. The randomization allows us to run multiple simulations and then take their mean.

Data

For this example, I will be using the 140 data points found in the data set cluster.dat prepared by Clayton V. Deutsch. This data set and others in the GSLIB: Geostatistical Software Library and User’s Guide may be found here.

I have prepared a Python function for reading this data format.

def read_data( fn ):
    f = open( fn, "r" )
    # title of the data set
    title = f.readline()
    # number of variables
    nvar = int( f.readline() )
    # variable names
    columns = [ f.readline().strip() for i in range( nvar ) ]
    # make a list for the data
    data = list()
    # for each line of the data
    while True:
        # read a line
        line = f.readline()
        # if that line is empty
        if line == '':
            # the jump out of the while loop
            break
        # otherwise, append that line to the data list
        else:
            data.append( line )
    # strip off the newlines, and split by whitespace
    data = [ i.strip().split() for i in data ]
    # turn a list of list of strings into an array of floats
    data = np.array( data, dtype=np.float )
    # combine the data with the variable names into a DataFrame
    df = pandas.DataFrame( data, columns=columns,  )
    return df

We’ll import some packages and load our data into our workspace using the following lines.

import time
import numpy as np
import scipy.stats
import scipy.optimize
import scipy.interpolate

import krige
import utilities

import random

clstr = read_data( 'cluster.dat' )

Gridding

The next step is to define a grid. The data in cluster.dat lies in 50 by 50 foot region, so we can form a grid from zero to fifty with thirty steps as follows.

# we want 30 steps between 0 and 50, in both dimensions
nx = 50 ; xsteps = 30
ny = 50 ; ysteps = 30
xrng = np.linspace(0,nx,num=xsteps)
yrng = np.linspace(0,ny,num=ysteps)

Randomization

This part creates a random path through the spatial domain. First I shuffle a vector of indices, attach some data to those indices, and then sort them in order to shuffle the data. The data is the cell address, an ordered pair of integers, and the physical location, which is an ordered pair of floats, describing a coordinate in feet.

# total number of cells
N = xsteps * ysteps

# an array of values 0 to N-1
idx = np.arange( N )

# randomize idx in place
random.shuffle( idx )

# create a list named path
path = list()

# initialize an index for idx
t = 0

# walk through the 2D grid..
for i in range( xsteps ):
    for j in range( ysteps ):
        
        # the first element is the shuffled index,
        # then the cell address, then the physical location
        path.append( [ idx[t], (i,j), (xrng[i],yrng[j]) ] )
        
        # increment the index
        t += 1

# sort the shuffled index, shuffling the cell 
# addresses and the physical locations (in feet)
path.sort()

Simulation

This part transforms the data before it goes into the sgs() function, and bundles the location information with the the transformed variable. The sgs() function then visits all of the cells in the grid at random, calculates a kriging estimate, adds that to the original data, and repeats.

# x and y coordinates in feet
locations = np.array( clstr[['Xlocation','Ylocation']] )

# value of the variable of interest, e.g., porosity or permeability
variable = np.array( clstr['Primary'] )

# the z-score transformation
norm, inv, param, mu, sd = to_norm( variable, 1000 )

# the data to be kriged and appended to
# during the sequential Gaussian simulation
data = np.vstack((locations.T,norm)).T

def sgs( data, bw, path, xsteps, ysteps ):
    '''
    Input:  (data)   <N,3> NumPy array of data
            (bw)     bandwidth of the semivariogram
            (path)   randomized path through region of interest
            (xsteps) number of cells in the x dimension
            (ysteps) number of cells in the y dimension
    Output: (M)      <xsteps,ysteps> NumPy array of data
                     representing the simulated distribution
                     of the variable of interest 
    '''
    # create array for the output
    M = np.zeros((xsteps,ysteps))
    # generate the lags for the semivariogram
    hs = np.arange(0,50,bw)
    # for each cell in the grid..
    for step in path :
        # grab the index, the cell address, and the physical location
        idx, cell, loc = step
        # perform the kriging
        kv = krige( data, spherical, hs, bw, loc, 4 )
        # add the kriging estimate to the output
        M[ cell[0], cell[1] ] = kv
        # add the kriging estimate to a spatial location
        newdata = [ loc[0], loc[1], kv ]
        # add this new point to the data used for kriging
        data = np.vstack(( data, newdata ))
    return M

We can time the execution of our code as follows:

t0 = time.time()
M = sgs( data, 5, path, xsteps, ysteps )
t1 = time.time()
print (t1-t0)/60.0

Visualization

This is the best part. This is what the simulation looks like before we transform it back from the z-score transformation.

matshow( M[:,::-1].T, cmap=my_cmap )
colorbar();
xs, ys = str(xsteps), str(ysteps)
xlabel('X') ; ylabel('Y') ; title('Raw SGS '+xs+'x'+ys+'\n')
savefig( 'sgs_untransformed_'+xs+'_by_'+ys+'.png', fmt='png', dpi=200 )

This is what the simulation looks like after we transform it back from the z-score transform.

z = from_norm( M.ravel(), inv, param, mu, sd )
z = np.reshape( z, ( xsteps, ysteps ) )
matshow( z[:,::-1].T, cmap=my_cmap )
colorbar() ;
xs, ys = str(xsteps), str(ysteps)
xlabel('X') ; ylabel('Y') ; title('SGS '+xs+'x'+ys+'\n')
savefig( 'sgs_transformed_'+xs+'_by_'+ys+'.png', fmt='png', dpi=200 )