Updated date:

How to Extract PDF Files to Excel using Python

Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.

In Data Engineering, it is often necessary to extract data, especially table data, from PDFs. For various analytical exercises, it is often vital or necessary to store this data in Excel, either for ad hoc analysis, or to build a data set or even to combine with other data to form a simple datalake. In this tutorial, we will extract data from a PDF that contains data stored in a table and save the data to a csv file and an Excel file using PyPDF2 and Pandas.

For this project, I will use the following Python modules (libraries):

  • PyPDF2
  • Pandas
  • Openpyxl

Create Virtual Environment

C:\...\ExtractPDF>python -m venv pyextrac
C:\...\ExtractPDF>pyextrac\scripts\activate.bat
(pyextrac) C:\...\ExtractPDF>

Install Dependencies

Install pyPDF2 module in pyextrac virtual environment as well as Pandas.

Pip install pypdf2

Pip install pandas

(pyextrac) C:\...\ExtractPDF>pip install pypdf2
(pyextrac) C:\...\ExtractPDF>pip install pandas

Create PDF Extraction script

From Visual Studio Code’s (VS Code, VSCode) project explorer (View, Explorer), create a new Python file. Name it anything you like. For this example, I named mine pdfxlsx.py. Here is the code which I will describe following the script listing below.

import PyPDF2
import pandas

reader = PyPDF2.PdfFileReader('Countries.pdf')

print(reader.documentInfo)

numOfPages = reader.numPages

# test to see if file can be opened and read
print('Number of pages: ' + str(numOfPages))

As usual, you will need to import the library that we need. In this example, it is PyPDF2 module that we installed in the previous step.

Then define a PDF file reader using the PdfFileReader function in the PyPDF2 library and provide the name of the PDF file to read. In my example, I am using a countries.pdf file which contains a list of countries and population from Wikipedia, but you are free to use whichever PDF you like.

Next, I will print out the document information as a test to ensure that we can read the file and well as get the number of pages in the file using the numPages property.

Here are the results of these two operations:

(pyextrac) C:\...\ExtractPDF>python pdf2xl.py

{'/Author': 'kevin', '/CreationDate': "D:20210102100909-05'00'", '/Creator': 'Microsoft® Excel® for Office 365', '/ModDate': "D:20210102100929-05'00'", '/Producer': 'Microsoft® Excel® for Office 365'}

Number of pages: 5

Ok now that we can read the file, the next step is to extract the contents and copy them to Excel. First, we will need to loop over the pages, extracting and storing the string data into a list as it has a handy append function. The pdfContent list will contain the extracted text. Using the for loop, we will iterate through the pages using the numOfPages as the index. For each page, we will extract the text using the extractText function and append the data to the pdfContent list variable.

pdfContent = []

for pageNbr in range(numOfPages):
    content = pdfReader.getPage(pageNbr).extractText()
    pdfContent.append(content)

Here is a sample of the output. While the output is one long string of text, you will notice that at the end of each “row”, there is a new line character “\n ” that we will use to convert the text into a real list. The space following newline character is very important since the value in the original table is also separated with a “\n” with no space.

PDF Output

PDF Output

The following line of code will take the first element of the pdfContent list [0] since there is only one element in the list and we will split into actual list items using the split(‘\n ’) function from the string module.

print(pdfContent[0].split('\n '))

Sample output of the list data


[' China[a]\nAsia\nEastern Asia\n1,427,647,786\n1,433,783,686\n0.43%', 'India\nAsia\nSouthern Asia\n1,352,642,280\n1,366,417,754\n1.02%', 'United States\nAmericas\nNorthern America\n327,096,265\n329,064,917\n0.60%', 'Indonesia\nAsia\nSouth-eastern Asia\n267,670,543\n270,625,568\n1.10%', 'Pakistan\nAsia\nSouthern Asia\n212,228,286\n216,565,318\n2.04%', 'Brazil\nAmericas\nSouth America\n209,469,323\n211,049,527\n0.75%', 'Nigeria\nAfrica\nWestern Africa\n195,874,683\n200,963,599\n2.60%', 'Bangladesh\nAsia\nSouthern Asia\n161,376,708\n163,046,161\n1.03%', 'Russia\nEurope\nEastern Europe\n145,734,038\n145,872,256\n0.09%', 'Mexico\nAmericas\nCentral America\n126,190,788\n127,575,529\n1.10%', 'Japan\nAsia\nEastern Asia\n127,202,192\n126,860,301\n', 'Ethiopia\nAfrica\nEastern Africa\n109,224,414\n112,078,730\n2.61%', 'Philippines\nAsia\nSouth-eastern Asia\n106,651,394\n108,116,615\n1.37%', ….

Notice the two versions of the “\n” and “\n “ that are separating the values in the list and each element of the list row. While, this is much better, we cannot copy this data into Excel as it will be still a mess. So, we will use the Pandas module which has a very powerful DataFrame object that will take list as it source and create a nth dimensional array called a DataFrame, which is essentially a table.

Here is the code for the DataFrame:

# create pandas dataframe from list
countries = pandas.DataFrame(pdfContent[0].split('\n '))

print(countries)

Here are is snippet of the results:

0

0 China[a]\nAsia\nEastern Asia\n1,427,647,786\n...

1 India\nAsia\nSouthern Asia\n1,352,642,280\n1,3...

2 United States\nAmericas\nNorthern America\n327...

3 Indonesia\nAsia\nSouth-eastern Asia\n267,670,5...

Pandas automatically includes a numerical index. We still have cleaning up to do before importing or copying the data into Excel, namely, to replace the comma in the population values and replace “\n” by a comma. Also, “0” as a column name is not very helpful and will generate errors when we try to reference the column name in Pandas.

First, rename the column to “Country” and reassign to DatraFrame:

countries = countries.rename(columns={0: "Country"})

Second, replace the comma in the population values:

countries = pandas.DataFrame(countries.Country.str.replace(',','', regex=True))

Notice how I wrap the replaced returned values in a new pandas.DataFrame. This is because the conversion creates a pandas Series instead. We need a DataFrame to perform the second replace operation.

Third, replace the “\n” with a comma and return a new DataFrame:

countries = pandas.DataFrame(countries.Country.str.replace('\n',',', regex=True))

Sample output:

Country

0 China[a],Asia,Eastern Asia,1427647786,1433783...

1 India,Asia,Southern Asia,1352642280,1366417754...

2 United States,Americas,Northern America,327096...

3 Indonesia,Asia,South-eastern Asia,267670543,27...

4 Pakistan,Asia,Southern Asia,212228286,21656531...

5 Brazil,Americas,South America,209469323,211049...

6 Nigeria,Africa,Western Africa,195874683,200963...


Here is the complete code, so far:

import PyPDF2
import pandas

pdfReader = PyPDF2.PdfFileReader('Countries.pdf')

print(pdfReader.documentInfo)

numOfPages = pdfReader.numPages

# test to see if file can be opened and read
print('Number of pages: ' + str(numOfPages))

pdfContent = []
strList = []
for pageNbr in range(numOfPages):
    content = pdfReader.getPage(pageNbr).extractText()
    pdfContent.append(content)

print(pdfContent[0].split('\n '))

# test the list
print(type(pdfContent[0].split('\n ')))
#strList = pdfContent[0].split('\n ')
#print(strList[2])

# create pandas dataframe from list
countries = pandas.DataFrame(pdfContent[0].split('\n '))
countries = countries.rename(columns={0: "Country"})
countries = pandas.DataFrame(countries.Country.str.replace(',','', regex=True))
countries = pandas.DataFrame(countries.Country.str.replace('\n',',', regex=True))

print(countries)

Finally, we have to convert the rows into a list and create columns

countries = pandas.DataFrame(countries['Country'].values.tolist(), index=countries.index)

Result:

Pandas DataFrame Output

Pandas DataFrame Output

Save to CSV and Excel

Ok, now we have our data that we can copy to Excel. Now, we have two options to move the data into Excel: save the Countries DataFrame to a csv file which can be opened by Excel or save the data directly into a xlsx file. Both can be handled by Pandas easily.

I will the csv code first as follows:

countries.to_csv("countries.csv", index=False, sep=",")
Sample csv output

Sample csv output

And for Excel:

#export to Excel
countries.to_excel("countries.xlsx", index=False) 
Sample Excel output

Sample Excel output

Here is the complete code which is also available on GitHub here

import PyPDF2
import pandas

pdfReader = PyPDF2.PdfFileReader('Countries.pdf')

print(pdfReader.documentInfo)

numOfPages = pdfReader.numPages

# test to see if file can be opened and read
print('Number of pages: ' + str(numOfPages))

pdfContent = []
strList = []
for pageNbr in range(numOfPages):
    content = pdfReader.getPage(pageNbr).extractText()
    pdfContent.append(content)

print(pdfContent[0].split('\n '))

# test the list
print(type(pdfContent[0].split('\n ')))
#strList = pdfContent[0].split('\n ')
#print(strList[2])

# create pandas dataframe from list
countries = pandas.DataFrame(pdfContent[0].split('\n '))
countries = countries.rename(columns={0: "Country"})
countries = pandas.DataFrame(countries.Country.str.replace(',','', regex=True))
countries = pandas.DataFrame(countries.Country.str.replace('\n',',', regex=True).str.split(","))


countries = pandas.DataFrame(countries['Country'].values.tolist(), index=countries.index)

print(countries)
# export DataFrame to csv file

countries.to_csv("countries.csv", index=False, sep=",")

#export to Excel
countries.to_excel("countries.xlsx", index=False) 

Conclusion

So, in this tutorial we extracted data from a PDF and stored the contents in Excel using PyPDF2 and Pandas to clean up the data and save it to a csv and an Excel file. You can explore other Python Excel modules through this article: https://owlcation.com/stem/8-Ways-to-Use-Python-with-Excel.

Another option to try would be to use Python as the “glue” instead of VBA, or Java or C# as they have libraries to handle this type of data extraction.

This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.

© 2021 Kevin Languedoc