Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.
Open Visual Basic (VBA) editor from the Developer tab. The process is simple, Adobe PDF has an interface with Microsoft Word. So, you simply need to open the PDF in Word and copy the contents to a Microsoft Excel spreadsheet. The following tutorial will show you how in simple terms.
Double-click on sheet1 to open a new VBA page
Add the following code. A detailed explanation of the code follows the code listing below.
Important! You must save your Excel file as macro-enabled file (.xlsm) otherwise your code won’t be saved with your file
Sub is the command to define and create a subroutine in VBA (a variant of Basic and Visual Basic). ExtractPDFData is the user defined name of the subroutine. A subroutine in VBA is often called Macro or Method.
Dim is a command to declare and reserve a space in active memory. The size of the space is defined by the object
Dim oSheet As Worksheet ‘This is a Excel Worksheet
Dim oWordApp As Object 'This is Microsoft Word Application
Dim oWordDoc As Object 'This is Microsoft Word Document
Dim strPDFFilepath As String ‘This is the full path of the PDF file
Once the variables are defined, you will need to in initialize them using the SET command. The following lines of code adds some error handling using the on error resume next which means to move the cursor to the next instruction if the program detects a runtime error. If no error is detected, the script will attempt to define a Microsoft Word application using OLE (Object Linking and Embedding) using the GetObject method and assign the Word instance to the oWordApp variable that we defined in the variable section above.
On Error Resume Next ' if an error is generated, skip to next instruction
Set oWordApp = GetObject(, "Word.Application")
If Err Then
Set oWordApp = CreateObject("Word.Application")
If the script cannot create the Word automation application using the GetObject, it will throw or raise an error. If this happens, the script will try and create the OLE Word application using the CreateObject method instead. If both don’t work, then the script will disable error handlers and exit the script.
On Error GoTo 0 ' disable error handler and exit. The next line, strPDFFilepath = Cells(1, 1) will get the full path of the pdf file from the first cell of Sheet1 and store in the variable strPDFFilepath. With the file path string in hand, we can open the PDF file using Microsoft Word in the background, or in memory. This means you won’t actually see the Word interface. The code to open the PDF file is listed here:
Set oWordDoc = oWordApp.Documents.Open(Filename:=strPDFFilepath, confirmConversions:=False)
oWord.Documents.Open opens the PDF file assigned to the Filename parameter. The confirmConversions property is set to False to disable any popup warnings about the PDF being converted to Word. Finally, the contents are assigned to the oWordDoc variable of the Microsoft Word Document.
The next line, oWordDoc.Content.Copy, simply copies the contents into memory (clipboard). Once the contents are copied, we simply need to create a new Microsoft Excel Worksheet and paste the contents of the clipboard.
First, we need to create and initialize the Worksheet using the following command which assigns to new Worksheet to the oSheet variable
Set oSheet = ActiveWorkbook.Worksheets.Add
Then we use the With, End With commands to select the first cell on the new Worksheet using the Range command, Range*(“A1”) and then perform a PasteSpecial command as Text
Finally, we will clean up our memory by closing the Word (PDF) file and releasing the memory that was allocated when we created our objects. The oWord.Close Savechanges:=0 closes the Word file without saving any changes. The strPDFFilepath is set to an empty string. The oWordApp.Quit SaveChanges:=0 command closes the Word application without saving any changes, then we set the oWordApp to nothing to release any memory.
strPDFFilepath = ""
Set oWordApp = Nothing
With the coding done, we to wrap things up by running the macro and checking to make sure everything works as designed and the contents of the PDF will be copied to Sheet2. From the Developer tab, click on the Macro menu option, select the Sheet1.ExtractPDFData subroutine and click run.
Here are is snippet of the results
So, in this tutorial we extracted data from a PDF and stored the contents in Excel. We used Microsoft Word as the “glue” between the Adobe Acrobat PDF file and Microsoft Excel. We did the extraction using VBA which is included with Excel. A final note, I have tested the script on various PDF file sizes and performance degrades considerably the larger the PDF file.
Another option to try would be to use Python as the “glue” instead of VBA, or Java or C# as they have libraries to handle this type of data extraction.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.
© 2021 Kevin Languedoc