Working with Excel Files using Python

I explained in a previous post how to quickly and easily grab data from Excel files using pandas. I use this approach when I know that there’s a ton of data in the Excel file and I want it in a pandas DataFrame. If I am only extracting a handful of values, I like to use a lower level module, xlrd.

Opening a Workbook

First we import xlrd, then we define the file name and the sheet name that we are interested in. Next, in line 11, we open the Excel file as a Book object. Then, in line 15, we use the Book object to create a Sheet object, from which we can access specific cells.

import xlrd

# the filename
fn = "workBook.xls"

# the name of the sheet
sht = "Summary"

# open a workbook by filename
# this returns a Book object
wkbk = xlrd.open_workbook( fn )

# use the Book object
# to produce a Sheet object
sheet = wkbk.sheet_by_name( sht )

Accessing Data

Given a row number and column number, we can access the data at that cell. While Excel cells are indexed by the column letter and the row number, Python indexes these guys by the row number and the column number, sans letters. Therefore the cell K4 in Excel is (3,11) in Python.

# extract a value from the sheet
row = 3
col = 5
value = sheet.cell_value( row, col )
# alternatively we can say
value = sheet.cell( row, cal ).value

We can also grab a subset of a column; this is often referred to as a slice. Note that in Python the end of the slice is exclusive, not inclusive. (This complements the convention of indexing arrays beginning with zero rather than one as in Excel/VBA and MATLAB.) Therefore the range B2:B5 in Excel is the slice 1:5 in Python; recall that the indexing begins at zero in Python, not one. As an example, let’s grab the values from the range B2:B5.

# specify the column
# A -> 0, B-> 1,...
col = 1

# indexing starts at 0, not 1
row_begin = 1    # <- 1 = 2-1

# again, indexing starts at 0, but the
# upper bound is exclusive, not inclusive
row_end = 5      # <- 5 = 5-1+1

values = sheet.col_values( col, row_begin, row_end )

Handling Dates

Excel handles dates awkwardly, to quote the xlrd documentation,

In reality, there are no such things. What you have are floating point numbers and pious hope.

Bearing that in mind, xlrd offers some functionality for handling dates. It will return a tuple in the format ( year, month, day, hour, minute, second ). Assuming that our date and/or time is at C4 in the Excel sheet, we can extract that date using the cell using the cell_value() function described above, and by using the xldate_as_tuple() function provided by the xlrd module. We will also need to pass datamode attribute of the Book object to the function.

xlrd.xldate_as_tuple( sheet.cell_value( 3, 2 ), wkbk.datemode )

Update

Since I’ve written this, a new tool has come out called PyXLL that allows users to write Excel add-ins in Python, and perform other tasks. Learn more here: https://www.pyxll.com/blog/tools-for-working-with-excel-and-python/