Speed up Python's pandas slow read_excel()

Pandas’ read_excel performance is way too slow.

Pandas reading from excel (pandas.read_excel()) is really, really slow, even some with small datasets (<50000 rows), it could take minutes. To speed it up, we are going to convert the Excel files from .xlsx to .csv and use panda.read_csv() instead.

Getting Started

We will need xlsx2csv.py from this GitHub Repo: git clone https://github.com/dilshod/xlsx2csv.git or download xlsx2csv

Using xlsx2csv to batch convert xlsx

Make your python script in the same folder as xlsx2csv.py, or replace all relative xlsx2csv.py filepath references to absolute filepath references.

import glob
import re
import subprocess

We will be using glob to get a list of all the xlsx that we want to convert to csv, regex to get filename of the xlsx, and subprocess to call python to run xlsx2csv.py

xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')

Define xlsx_path to where the .xlsx files are, and then define where you want to save the csvs as csv_path

We want to use xlsx2csv to unzip/convert xlsx to csvs. You can see the docs for xlsx2csv.py or simply run python xlsx2csv.py in your terminal to see the manual:

$ python xlsx2csv.py
usage: xlsx2csv.py [-h] [-v] [-a] [-c OUTPUTENCODING] [-d DELIMITER] [--hyperlinks] [-e]
                   [-E EXCLUDE_SHEET_PATTERN [EXCLUDE_SHEET_PATTERN ...]] [-f DATEFORMAT] [-t TIMEFORMAT]
                   [--floatformat FLOATFORMAT] [--sci-float] [-I INCLUDE_SHEET_PATTERN [INCLUDE_SHEET_PATTERN ...]]
                   [--ignore-formats IGNORE_FORMATS [IGNORE_FORMATS ...]] [-l LINETERMINATOR] [-m] [-n SHEETNAME] [-i]
                   [--skipemptycolumns] [-p SHEETDELIMITER] [-q QUOTING] [-s SHEETID]
                   xlsxfile [outfile]

So to convert a single xlsx to csvs it will be: python xlsx2csv.py [xlsx_filename] [csv_filename]. We can use python’s subprocess.call() to call xlsx2csv.py and loop through all the xlsx files and supply the appropriate arguments.

for xlsx in list_of_xlsx:
    # Extract File Name on group 2 "(.+)"
    filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
    # Setup the call for subprocess.call()
    call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
    try:
        subprocess.call(call) # On Windows use shell=True
    except:
        print('Failed with {}'.format(filepath))

Now if we like, we can concatenate all those csv files into a single big one.

outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv

listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
    df = pd.read_csv(file)
    if df.shape[1] == 24: # make sure 24 columns
        listofdataframes.append(df)
    else:
        print('{}  has {} columns - skipping'.format(file,df.shape[1]))

bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)

Now the cost is frontloaded to converting xlsx to csvs, and it can be slow to convert a folder of xlsx to csvs. We can speed things up by using multiple threads.

Improving performance - Using multiple threads.

Multithreading and multiprocessing are two different things, in our case, we are doing a lot of input/output (I/O) tasks, and multithreading is a easy way for our script to run xlsx2csv.py multiple times. Note that multithreading is still limited by GIL but not as important to us if we are primarily bottlenecked by I/O.

First we will generate a list of commands to feed into subprocess.call()

commands = []
for filepath in glob.glob(xlsx_path+'*.xlsx'):
    filename = re.search(r'(.+[\\|\/])(.+)(\.(csv|xlsx|xlx))', filepath) #Extract File Name on group 2 "(.+)"

    call = ["python", "./xlsx2csv.py", filepath, csv_path+'{}.csv'.format(filename.group(2))]
    commands.append(call)

Then we will utilize multiprocessing.dummy and specify how many threads we want to utilize, and spawn those threads.

from multiprocessing.dummy import Pool

pool = Pool(2) # Specify How many concurrent threads

# Use functools.partial(subprocess.call, shell=True) in place of subprocess.call if you're on windows
for i, return_code in enumerate(pool.imap(subprocess.call, commands)):
    if return_code != 0:
        print("Command # {} failed with return code {}.".format(i, return_code))

While the script is running, if you navigate to your file explorer, we will now see that multiple files are being written out to instead of 1 file at a time.

Just using %%timeit on my jupyter notebook on my laptop. We can see that using multiple threads indeed speed things up.

Using 1 thread: 27.1 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using 2 concurrent threads: 13 s ± 3.91 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Depending on your processor(s) and how many threads you choose to or can spawn, this will definitely speed things up.


Thoughts, comments?:

Reach out on LinkedIn or email.