Web Scraping the IRS with Python & Beautiful Soup


Back in February, I started seeing some articles talking about tax refunds and how it will be smaller this year, and wanted to dive into the data myself and see if I can agree with them, I did a short write-up here and the tl;dr was that it was too early to tell.

I found this treasure trove of IRS data a while back when I was working on an analysis on tax season retail promotions. I didn’t want capture the data manually so I wrote a quick and dirty script for it. So here it is -

I’ve put the script on Google Colab so you can follow along. You’ll want to open it in playground mode or copy it on your drive to interact with it.

Import everything we will need.

import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd
import datetime as dt
import pytz

First we will need to study the page a little bit.

Use google chrome’s developer tools to inspect the webpage (shortcut on Mac: Cmd+Shift+C).

Notice that each of the links are housed inside a HTML “a” tag within the element selected above, so our goal is to get beautiful soup to narrow in to this section and parse through for the “a” tags for us to save the link href into a list. There’s also misc links (see “2010 Filing Season Statistics”) that we want to avoid, so we will use regex to save only the links where the text is a date format.

url = "https://www.irs.gov/newsroom/filing-season-statistics-by-year"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = []
linktable = soup.find(class_='panel-group accordion-list-group accordion-list-items')
for a in linktable.find_all('a'):
    if re.search(r'^[0-9]{2}\/[0-9]{2}\/[0-9]{2,4}$',a.text): # Matches ##/##/[##/####] format

print("Found a total of {} links!".format(len(links)))

Great, now we have a list of links. Looking through the data (try: pd.DataFrame(links)), it looks like IRS have a couple of links that they linked incorrectly, so let’s fix that really quickly.

for link in links:
    if link[0] == '03/20/15':
        link[1] = '/newsroom/filing-season-statistics-for-week-ending-march-20-2015'
    elif link[0] == '02/17/17':
        link[1] = '/newsroom/filing-season-statistics-for-week-ending-february-17-2017'
links.append(['04/14/17','/newsroom/filing-season-statistics-for-week-ending-april-14-2017']) # Link IRS omitted when updated website

So let’s see what links looks like …


Now let’s build a scrapper for the individual pages, and process the data a little bit.

Now let’s use chrome dev tool to inspect a few pages (one recent, one dated) and see what these pages look like.

Structurally, it’s a simple table <tr><td>…</td></tr> structure. So to scrape it, the general idea is to point beautiful soup to look at the <div> with the data, then identify all the rows (tr) then loop through all the columns (td) and dump that data into a dataframe.

Exploring the pages you will also see that there is a bit to clean through:

  • Some of the older statistics pages it’s actually 6 columns instead of 4, and we will need normalize it to 4 columns.
  • Symbols, trailing spaces, unicodes, etc to delete
  • Ambiguous rows names to rename
    • For example, “Number”, “Amount” occurs twice in the page and only make sense with the context of the headings (“Total Refunds:”, “Direct Deposit Refunds”)
      • To solve this, I appended a row number to each label and created a dictionary to replace the old label+number into my ideal labels. (Omitted in this post, check the Colab notebook!)
  • Short hands for value like $147.577 Billion needs to be expand into actual numbers.
  • Some old (2011, 2012) links are actually pdfs, and will fail - not super important because the data is old but want the script to note the failures and print a message of when it failed.
  • One link is actually broken and returns a 404 error - can’t fix this one, haha.
 # Create df to store data
final_df = pd.DataFrame(columns=['category','LY','CY','perc_change','Date'])

for link in links:
    baseurl = 'https://www.irs.gov/'
    r = requests.get(baseurl+link[1])
    soup = BeautifulSoup(r.text, 'html.parser')

        mytable = soup.find(class_= 'field field--name-body field--type-text-with-summary field--label-hidden field--item')
        myrows = mytable.find_all('tr')
        mylist = []
        for tr in myrows:
            td = tr.find_all('td')
            row = [tr.text for tr in td]

        # Some legacy Data stored in 6 columns instead of 4
        for row in mylist:
            if len(row) > 4:
                row.pop(4) # Remove "\nBillion\n" on Column 5
                row.pop(2) # Remove "\nBillion\n" on Column 3

        # Put scrapped data into dataframe, process:
        df = pd.DataFrame(mylist, columns=['category','LY','CY','perc_change'])
        df = df.replace(r'^(\s+)$', np.nan, regex=True).fillna(value=np.nan) # Remove Trailing Spaces
        df = df.replace(r'(\xa0|\n|\$|,|Billion|billion)', '', regex=True) # Remove Symbols, Unicode, etc
        df = df[df['LY'].notnull()].reset_index(drop=True)

        df['Date'] = link[0]

        # Change category names to be more systematic/readable
        for idx, row in enumerate(df.category): # append row number to name because there are duplicate labels on the scrapped data.
            df.category.iloc[idx] = row+'_'+str(idx)
        df['category'] = df['category'].map(category_map)

        # Make total_refund_value and directdeposit_total_refund_value into a number
        df.loc[:,'CY'] = df.CY.replace(r'\s','',regex=True)
        df.loc[df['category']=='total_refund_value', 'CY'] = float(df.loc[df['category']=='total_refund_value', 'CY']) * 1000000000
        df.loc[df['category']=='directdeposit_total_refund_value', 'CY'] = float(df.loc[df['category']=='directdeposit_total_refund_value', 'CY']) *1000000000

        # Append to finished df
        final_df = final_df.append(df, ignore_index = True).reset_index(drop=True)

        print("Failed for {}, url: {}".format(link[0],link[1]))

final_df = final_df[final_df.category != 'delete']
final_df.loc[:,'Date'] = pd.to_datetime(final_df.Date)

print('Finished scraping data into dataframe!')
Failed for 12/27/13, url: https://www.irs.gov/pub/irs-utl/12-27-2013.pdf
Failed for 11/22/13, url: https://www.irs.gov/pub/newsroom/Filing%20Season%20Stats%20--%2011-22-13.pdf
Failed for 05/10/13, url: /newsroom/filing-season-statistics-may-10-2013
Finished scraping data into dataframe!

Graphing the data with matplotlib/seaborn we see some weird drops/spikes, and that’s due to some typos/data errors in a few of the records from 2017/05/05 and 2016/12/02, which I will manually fix.

# Data Corrections (IRS published incorrect data, haha)
final_df.loc[(final_df.category == 'total_refund_count') & (final_df.Date == "2017-05-05"), 'CY'] = 101641000 #missing 1 zero
final_df.loc[(final_df.category == 'returns_processed') & (final_df.Date == "2016-12-02"), 'CY'] = 151891000 # typo from web "," changed to "."

Last step, extracting the data I want.

The statistics published are running totals, but I want to undo that.

So I subtract the current period with the previous period (e.g.: 03/01/2019 from 02/22/2019) in order to get the statistics of that period (3/01/2019). Also I’ll have to reset it every year and not subtract the past years data with the current year (e.g.: 02/01/2019 from 11/23/2018).

def calcdiffbyyear(final_df,category_label,new_col_name):
    Extract period by period data from running total, resets every annual year.
        - final_df: DataFrame with "Date" (datetime), "category" and "CY" (numeric)
        - category_label: "category" column name in final_df to filter for.
        - new_col_name: Output dataframe, what the calculated value is
        - Processed DataFrame with column:["Date",new_col_name]

    output_df = final_df[final_df.category ==category_label][['Date', 'CY']].sort_values('Date').reset_index(drop=True).copy()
    output_df.CY = output_df.CY.astype('f8')

    # Calculate diff from previous date given by date
    output_df['prev'] = output_df.CY.shift(1).replace(np.nan, 0)
    output_df.loc[output_df.prev > output_df.CY,'prev'] = 0 #Account for date reset every year
    output_df[new_col_name] = output_df.CY - output_df.prev
    output_df.drop(labels = ['prev','CY'], axis=1, inplace=True)

    return output_df

returnsreceived = calcdiffbyyear(final_df, 'returns_received', 'returns_received')
returnsprocessed = calcdiffbyyear(final_df, 'returns_processed', 'returns_processed')
totalrefundvalue = calcdiffbyyear(final_df, 'total_refund_value', 'total_refund_value')
totalrefundcount = calcdiffbyyear(final_df, 'total_refund_count', 'total_refund_count')

Last step is just to concatenate all the dataframe and export it as a csv.

exportme = returnsreceived.merge(returnsprocessed, on='Date', how='outer') \
                            .merge(totalrefundcount, on='Date', how='outer') \
                            .merge(totalrefundvalue, on='Date', how='outer')

# Save file with timestamp
filepath = '/home/ubuntu/Notebooks/IRS_Scrape/'
filename = '{}irs_data_final.csv'.format(dt.datetime.now(pytz.timezone('America/Los_Angeles')).strftime('%Y%m%d_%H.%M_'))
exportme.to_csv(filename, index=False)
print('Saved data into csv: {}'.format(filepath+filename))

Automate it!

Schedule it as a cronjob

Not really sure when IRS updates their statistics, so I’m going to run it everyday at 3pm.

I use my Jupyter Notebooks with jupytext, so I have a .py file that’s two-way synced with the notebook.

crontab -e

Inside, put: (To find your python installation directory/path, use which python in your terminal.)

# Run everyday at 3pm
0 15 * * * /home/ubuntu/anaconda3/bin/python /home/ubuntu/Notebooks/IRS_Scrape/irs_scrape_v2.py

Confused? Check out crontab guru

Bash Script to download the file from my EC2 instance with scp

Now that’s only necessary if you’re using your Jupyter Notebook on the cloud (EC2) (which is what I do).

scp -i yourkey.pem root@my.ec2.id.amazonaws.com:/somefilepath /yourfilepath/yourirsscrapedata.csv

Visualize it!

Open up the csv with the visualization tool of your choice and play with the data. Check out my Tableau Dashboard!

Thoughts, comments?:

Reach out on LinkedIn or email.