I explained in a previous post how to quickly and easily grab data from Excel files using pandas. I use this approach when I know that there’s a ton of data in the Excel file and I want it in a pandas DataFrame. If I am only extracting a handful of values, I like to use a lower level module, xlrd
.
Opening a Workbook
First we import xlrd
, then we define the file name and the sheet name that we are interested in. Next, in line 11, we open the Excel file as a Book
object. Then, in line 15, we use the Book
object to create a Sheet
object, from which we can access specific cells.
import xlrd # the filename fn = "workBook.xls" # the name of the sheet sht = "Summary" # open a workbook by filename # this returns a Book object wkbk = xlrd.open_workbook( fn ) # use the Book object # to produce a Sheet object sheet = wkbk.sheet_by_name( sht )
Accessing Data
Given a row number and column number, we can access the data at that cell. While Excel cells are indexed by the column letter and the row number, Python indexes these guys by the row number and the column number, sans letters. Therefore the cell K4 in Excel is (3,11) in Python.
# extract a value from the sheet row = 3 col = 5 value = sheet.cell_value( row, col ) # alternatively we can say value = sheet.cell( row, cal ).value
We can also grab a subset of a column; this is often referred to as a slice. Note that in Python the end of the slice is exclusive, not inclusive. (This complements the convention of indexing arrays beginning with zero rather than one as in Excel/VBA and MATLAB.) Therefore the range B2:B5 in Excel is the slice 1:5 in Python; recall that the indexing begins at zero in Python, not one. As an example, let’s grab the values from the range B2:B5.
# specify the column # A -> 0, B-> 1,... col = 1 # indexing starts at 0, not 1 row_begin = 1 # <- 1 = 2-1 # again, indexing starts at 0, but the # upper bound is exclusive, not inclusive row_end = 5 # <- 5 = 5-1+1 values = sheet.col_values( col, row_begin, row_end )
Handling Dates
Excel handles dates awkwardly, to quote the xlrd
documentation,
In reality, there are no such things. What you have are floating point numbers and pious hope.
Bearing that in mind, xlrd
offers some functionality for handling dates. It will return a tuple in the format ( year, month, day, hour, minute, second )
. Assuming that our date and/or time is at C4 in the Excel sheet, we can extract that date using the cell using the cell_value()
function described above, and by using the xldate_as_tuple()
function provided by the xlrd
module. We will also need to pass datamode
attribute of the Book
object to the function.
xlrd.xldate_as_tuple( sheet.cell_value( 3, 2 ), wkbk.datemode )
Update
Since I’ve written this, a new tool has come out called PyXLL that allows users to write Excel add-ins in Python, and perform other tasks. Learn more here: https://www.pyxll.com/blog/tools-for-working-with-excel-and-python/