Scraping Yahoo! Finance Data with BeautifulSoup

This weekend I wanted to work on collecting and plotting historical option contract prices. I used the following API call to pull option contract data from Yahoo!

curl -X GET "http://finance.yahoo.com/q/op?s=AAPL&m=2016-01" | cat > aapl

BeautifulSoup

The first bit imports BeautifulSoup and pandas, and the second bit grabs a filename from the command line, opens the file as data and passes data through BeautifulSoup to produce soup. I knew from looking at the raw HTML that the call and put option contracts were located in a div element that had a class attribute that was called follow-quote-area. So, in line 11, I grabbed the two tables in that div element.

Next, I define a function for iterating through the table rows, using x.find_all("tr"), and then iterating through the data items in each row, using row.find_all("td"), to get the items in the table. The last line in the extractData() function cleans the output up by tossing any lists that don’t contain 10 elements.

#!/usr/bin/env python

from bs4 import BeautifulSoup
import pandas as pd
import sys

filename = sys.argv[1]
data = open( filename, "r" )
soup = BeautifulSoup( data, "html.parser" )

calls, puts = soup.find_all( attrs={ "class": "follow-quote-area" } )

def extractData( x ):
    arr = []
    for row in x.find_all("tr"):
        arr.append([])
        for data in row.find_all("td"):
            value = data.get_text().strip()
            arr[-1].append( value )
    arr = filter( lambda x: len(x) == 10, arr )
    return arr

pandas

Next, we can store this data in a pandas object for immediate processing, or we can pickle the pandas DataFrame objects for later use.

calls = extractData( calls )
puts = extractData( puts )

columns = [ "Strike"
          , "ContractName"
          , "Last"
          , "Bid"
          , "Ask"
          , "Change"
          , "PctChange"
          , "Volume"
          , "OpenInterest"
          , "ImpliedVolatility" ]

calls = pd.DataFrame( calls, columns=columns )
puts = pd.DataFrame( puts, columns=columns )