Updated date:

How to Extract PDF Files to Excel using VBA

Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.

Open Visual Basic (VBA) editor from the Developer tab. The process is simple, Adobe PDF has an interface with Microsoft Word. So, you simply need to open the PDF in Word and copy the contents to a Microsoft Excel spreadsheet. The following tutorial will show you how in simple terms.

Visual Basic in Developer tab

Visual Basic in Developer tab

Create Script

Double-click on sheet1 to open a new VBA page

Add the following code. A detailed explanation of the code follows the code listing below.

Sub ExtractPDFData()
 
	 Dim oSheet As Worksheet
	 Dim oWordApp As Object ' this is Microsoft Word Application
	 Dim oWordDoc As Object ' this is Microsoft Word Document
	 Dim oWinShell As Object ' this is the Microsoft wscript engine
	 Dim strPDFFilepath As String
 
	 On Error Resume Next ' if an error is generated, sckip to next instruction
	 Set oWordApp = GetObject(, "Word.Application")
	 If Err Then
	    Set oWordApp = CreateObject("Word.Application")
	 End If
	 
	 On Error GoTo 0 ' first line
	 strPDFFilepath = Cells(1, 1) 'put the full path to pdf file in first cell of Sheet1
	 ""
	    'open pdf in Word document if not protected
	    Set oWordDoc = oWordApp.Documents.Open(Filename:=strPDFFilepath, confirmConversions:=False)
	    oWordDoc.Content.Copy
	    
	    'Paste content in Excel Worksheet
	    Set oSheet = ActiveWorkbook.Worksheets.Add
	    With oSheet
	      .Range("A1").Select
	      .PasteSpecial Format:="Text"
	    End With
	    oWordDoc.Close SaveChanges:=0
	    strPDFFilepath = ""
 
  
	  'Close Word
	  oWordApp.Quit SaveChanges:=0
	  Set oWordApp = Nothing
	  Set oWinShell = Nothing
End Sub

Important! You must save your Excel file as macro-enabled file (.xlsm) otherwise your code won’t be saved with your file

Sub ExtractPDFData()

Sub ExtractPDFData()

Sub is the command to define and create a subroutine in VBA (a variant of Basic and Visual Basic). ExtractPDFData is the user defined name of the subroutine. A subroutine in VBA is often called Macro or Method.

Dim is a command to declare and reserve a space in active memory. The size of the space is defined by the object

Dim oSheet As Worksheet ‘This is a Excel Worksheet

Dim oWordApp As Object 'This is Microsoft Word Application

Dim oWordDoc As Object 'This is Microsoft Word Document

Dim strPDFFilepath As String ‘This is the full path of the PDF file

Once the variables are defined, you will need to in initialize them using the SET command. The following lines of code adds some error handling using the on error resume next which means to move the cursor to the next instruction if the program detects a runtime error. If no error is detected, the script will attempt to define a Microsoft Word application using OLE (Object Linking and Embedding) using the GetObject method and assign the Word instance to the oWordApp variable that we defined in the variable section above.

On Error Resume Next ' if an error is generated, skip to next instruction

Set oWordApp = GetObject(, "Word.Application")

If Err Then

Set oWordApp = CreateObject("Word.Application")

End If

If the script cannot create the Word automation application using the GetObject, it will throw or raise an error. If this happens, the script will try and create the OLE Word application using the CreateObject method instead. If both don’t work, then the script will disable error handlers and exit the script.

On Error GoTo 0 ' disable error handler and exit. The next line, strPDFFilepath = Cells(1, 1) will get the full path of the pdf file from the first cell of Sheet1 and store in the variable strPDFFilepath. With the file path string in hand, we can open the PDF file using Microsoft Word in the background, or in memory. This means you won’t actually see the Word interface. The code to open the PDF file is listed here:

Set oWordDoc = oWordApp.Documents.Open(Filename:=strPDFFilepath, confirmConversions:=False)

oWord.Documents.Open opens the PDF file assigned to the Filename parameter. The confirmConversions property is set to False to disable any popup warnings about the PDF being converted to Word. Finally, the contents are assigned to the oWordDoc variable of the Microsoft Word Document.

The next line, oWordDoc.Content.Copy, simply copies the contents into memory (clipboard). Once the contents are copied, we simply need to create a new Microsoft Excel Worksheet and paste the contents of the clipboard.

First, we need to create and initialize the Worksheet using the following command which assigns to new Worksheet to the oSheet variable

Set oSheet = ActiveWorkbook.Worksheets.Add

Then we use the With, End With commands to select the first cell on the new Worksheet using the Range command, Range*(“A1”) and then perform a PasteSpecial command as Text

With oSheet

.Range("A1").Select

.PasteSpecial Format:="Text"

End With

Finally, we will clean up our memory by closing the Word (PDF) file and releasing the memory that was allocated when we created our objects. The oWord.Close Savechanges:=0 closes the Word file without saving any changes. The strPDFFilepath is set to an empty string. The oWordApp.Quit SaveChanges:=0 command closes the Word application without saving any changes, then we set the oWordApp to nothing to release any memory.

oWordDoc.Close SaveChanges:=0

strPDFFilepath = ""

oWordApp.Quit SaveChanges:=0

Set oWordApp = Nothing

With the coding done, we to wrap things up by running the macro and checking to make sure everything works as designed and the contents of the PDF will be copied to Sheet2. From the Developer tab, click on the Macro menu option, select the Sheet1.ExtractPDFData subroutine and click run.

Run macro

Run macro

Results

Here are is snippet of the results

Output to Excel

Output to Excel

Conclusion

So, in this tutorial we extracted data from a PDF and stored the contents in Excel. We used Microsoft Word as the “glue” between the Adobe Acrobat PDF file and Microsoft Excel. We did the extraction using VBA which is included with Excel. A final note, I have tested the script on various PDF file sizes and performance degrades considerably the larger the PDF file.

Another option to try would be to use Python as the “glue” instead of VBA, or Java or C# as they have libraries to handle this type of data extraction.

This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.

© 2021 Kevin Languedoc