Extract Data from PDF with Python | A Case Study in Automation

You’re facing a new, exciting project. You need to build a custom tool for a client based on a solid database. There’s just one problem: all the knowledge, all the “gold” meant to populate this database, is locked away in hundreds of pages of PDF files.

This was my exact starting point. I stood at a crossroads. I could have taken the shortcut: open each PDF and manually copy hundreds of questions and answers. A feasible task, but painfully tedious, slow, and prone to error. I knew it was a trap, especially since the client planned to send more files in the future.

I chose the other path, the one I love. Instead of spending hours clicking, I decided to invest that time in writing a simple but powerful tool—a Python converter that would do all the work for me.

When the Devil is in the Formatting

The first step—extracting raw text from the PDF files using the pdfplumber library—went smoothly. The real challenge began afterward. It turned out that each PDF, though seemingly identical, had its own little “quirks”: an extra space here, a different newline character there. These minor anomalies caused simple methods of splitting questions from answers to fail. The text was a chaotic mix that required an intelligent approach.

My Weapon of Choice: Regular Expressions (RegEx)

In the fight against unstructured text, my secret weapon is regular expressions. It’s like giving a computer the superpower of understanding patterns in text. Instead of telling it to “find the letter A,” I tell it: “find me a line that starts with the letter A, which might be followed by an asterisk, then any amount of whitespace, and then capture everything until the end of the line.”

Thanks to Python’s re library, I created precise patterns that could “extract” exactly the data I needed from the chaos. The snippet below is the heart of my automation—the logic that analyzes a block of text, identifies the question and the correct answer, and then assembles them into a structured format.


# A snippet of code that extracts a question and answer from a block of text
for i in range(1, len(blocks), 2):
    content = blocks[i + 1].strip()
    
    # Using RegEx to find the correct answer (starting with 'A')
    match = re.search(r'^A\s*(\*?)\s*(.+)', content, re.MULTILINE)
    if match:
        # Extracting the answer and the rest of the text (which is the question)
        correct_answer = match.group(2).strip()
        question_text = re.split(r'^A\s*(\*?)\s*(.+)', content, maxsplit=1, flags=re.MULTILINE)[0].strip()
        
        # Cleaning and appending to the final list
        questions.append({
            "questionText": question_text,
            "correctAnswerText": correct_answer
        })

Finally, the script saved all the data into a clean, structured JSON file, ready to populate the main database.

The Result? Hours Turned into Seconds

The result of this work was transformative. A process that would have manually taken me many days of tedious copying was shortened to a few minutes. More importantly, I created a scalable solution. When the client sends a new batch of PDF files next month, updating the database will be instant and error-free.

I love projects like this because they show the true power of code. This wasn’t about building a complex system, but about creating a small, simple tool that solved one very annoying problem. Sometimes, a smart script like this brings more business value than the largest applications, freeing us from boring tasks and allowing us to focus on creative work.

Is there a process in your company where someone spends hours manually copying and pasting data from one place to another? Or are you drowning in documents from which data needs to be moved into a system?

I invite you for a free consultation. Let’s talk about how a small, clever automation could save you time and money.

I Wrote Code Instead of Clicking: The Story of a Script That Saved Days of Work

When the Devil is in the Formatting

My Weapon of Choice: Regular Expressions (RegEx)

The Result? Hours Turned into Seconds

From a Working Prototype to a Living App: The Story of an Architecture That Gave a Project Its Soul

Your Private Sandbox: How to Clone a WordPress Site to Your Local Computer in 5 Steps

How to Tame KSeF: A Case Study of a PHP Bot That Downloads Your Invoices for You

Your React App Online in 15 Minutes? A Practical Guide to Deploying on Netlify

Same Code, Two Different Worlds: How an Adaptive Algorithm Solved the OCR Problem for Light and Dark Mode

Your Changes Disappear After an Update? Discover the Secret of WordPress Child Themes

When the Devil is in the Formatting

My Weapon of Choice: Regular Expressions (RegEx)

The Result? Hours Turned into Seconds

Similar Posts