Convert PDF to Excel Using Python

Fahiz
3 min readSep 1, 2024

--

Introduction

Converting PDF files to Excel can be a daunting task, especially when dealing with complex tables and data structures. Fortunately, Python offers several powerful libraries that make this process much more manageable. Whether you’re dealing with straightforward tables or complex, unstructured data, there’s a Python solution that can help. In this post, we’ll walk through three different methods to convert PDF files to Excel using Python, tailored to fit different types of PDFs.

Method 1: Using tabula-py

tabula-py is a simple and efficient tool for extracting tables from PDFs and converting them into Excel files. This method works best for PDFs that contain well-structured tables.

Step 1: Install the Required Libraries

First, install tabula-py and pandas using pip:

pip install tabula-py pandas

Step 2: Convert PDF to Excel

Here’s a basic script to convert a PDF to an Excel file:

import tabula
import pandas as pd

# Path to the PDF file
pdf_path = "your_file.pdf"

# Convert PDF to a list of DataFrames
dfs = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)

# Combine all tables into a single DataFrame (optional)
combined_df = pd.concat(dfs)

# Save the DataFrame to an Excel file
combined_df.to_excel("output_file.xlsx", index=False)

This script extracts all the tables from the PDF and saves them as a single Excel file.

Method 2: Using camelot-py for More Complex Tables

If your PDF contains more structured or complex tables, camelot-py might be the better option. It provides more control and accuracy when extracting tables from PDFs.

Step 1: Install the Required Libraries

Install camelot-py and pandas:

pip install camelot-py[cv] pandas

Step 2: Convert PDF to Excel

Here’s how you can use camelot-py to extract and convert tables:

import camelot
import pandas as pd

# Path to the PDF file
pdf_path = "your_file.pdf"

# Extract tables from the PDF
tables = camelot.read_pdf(pdf_path, pages='all')

# Combine tables into a single DataFrame (optional)
combined_df = pd.concat([table.df for table in tables])

# Save the DataFrame to an Excel file
combined_df.to_excel("output_file.xlsx", index=False)

This method is handy for PDFs with more complex layouts, as camelot-py can handle more nuanced table structures.

Method 3: Using PyPDF2 for Manual Table Extraction

For PDFs with unstructured or highly complex data, PyPDF2 allows for manual text extraction, which you can then parse and convert into an Excel file.

Step 1: Install the Required Libraries

Install PyPDF2 and pandas:

pip install PyPDF2 pandas

Step 2: Manually Extract and Convert Data

Here’s a basic script to extract text and convert it to an Excel file:

import PyPDF2
import pandas as pd

# Path to the PDF file
pdf_path = "your_file.pdf"

# Create a PDF file reader object
pdf_reader = PyPDF2.PdfFileReader(open(pdf_path, 'rb'))

# Extract text from each page
text_data = []
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
text_data.append(page.extract_text())

# Here you need to manually parse the extracted text
# This is a very basic example
rows = []
for line in text_data:
rows.append(line.split())

# Convert the parsed data to a DataFrame
df = pd.DataFrame(rows)

# Save to Excel
df.to_excel("output_file.xlsx", index=False)

This approach gives you the flexibility to manually parse and structure the extracted data, making it ideal for more challenging PDFs.

Conclusion

Choosing the right method to convert a PDF to Excel depends largely on the structure of your PDF. tabula-py is great for simple, well-structured tables, camelot-py is better for more complex tables, and PyPDF2 allows for maximum flexibility when dealing with unstructured data. By leveraging these Python libraries, you can streamline converting PDFs to Excel, saving time and effort.

Try these methods in your next project, and see which works best for your needs!

--

--

Fahiz
Fahiz

Responses (1)