Enhancing Data Extraction: RAG with PDF and Chart Images Using GPT-4o

5 min read6 days ago

Have you ever faced the challenge of extracting information from a PDF full of text and graph images?

While extracting plain text from a PDF is relatively straightforward, understanding and extracting meaningful data from graphs and then chatting with the data can be a challengin task.

In this blog post, I will explore a solution to this problem using Python and OpenAI’s GPT-4o model.

For example I am using — Outlier or laggard: divergence and convergence in the UK’s recent inflation performance — speech by Dave Ramsden | Bank of England pdf and below is the sample page from the pdf

If you add these pdf files to “chat with data” feature of azure open AI, and ask questions about the data available in chat, you will not get the information available in charts and graphs.

After following steps mentioned below, and creating new index on the the data produced by this process, you can chat with the data available in graphs.

for example, I want to know “What was the inflation rate in the United States (or any other specific country/region) in October 2022?”

In response you will get the inflation rates presented in graphs and in citations it will also present the information in tabular format

Step 1: Converting PDF Pages to Images

The first step is to convert the pages of the PDF into images. We’ll use the pdf2image library, which allows us to easily transform each page of a PDF into a separate image file. This is a crucial step in preparing the document for further processing.

This function takes the path of a PDF file, converts each page to a JPEG image, and saves these images in a directory named after the PDF file.

from pdf2image import convert_from_path
import os

def convert_pdf_to_images(pdf_path):
    """
    Converts each page of a PDF into JPEG images and saves them in a directory named after the PDF file.

    Args:
    - pdf_path (str): Path to the PDF file.

    Returns:
    - list: List of image file paths saved.
    """
    # Create a directory based on the PDF filename
    pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
    output_dir = f"{pdf_name}_images"
    os.makedirs(output_dir, exist_ok=True)

    # Convert each page of the PDF into images
    images = convert_from_path(pdf_path)
    saved_image_paths = []
    for i, img in enumerate(images):
        image_path = os.path.join(output_dir, f'page{i}.jpg')
        img.save(image_path, 'JPEG')
        saved_image_paths.append(image_path)
    
    return saved_image_paths

# Example usage:
pdf_path = "path_to_your_pdf.pdf"
saved_paths = convert_pdf_to_images(pdf_path)
print("Images saved to:", saved_paths)

Step 2: Extracting Text and Graphical Information Using GPT-4 Vision

With the PDF pages now converted to images, the next challenge is to extract both text and the data represented in graphs. Using the GPT-4 Vision model, we can extract not just the text but also interpret the graphical data, converting it into a tabular format for easier understanding and reproduction.

This function processes an image using GPT-4 Vision and saves the response to a text file. The model is instructed to extract both the text and the data from any charts present in the image, presenting the latter in a tabular format.

import os
import requests
import base64

def process_image_and_save_text(image_path, api_key, output_dir=None):
    """
    Process an image using OpenAI's GPT-4 Vision model and save the response to a text file.

    Args:
    - image_path (str): Path to the input image file.
    - api_key (str): API key for accessing the GPT-4 Vision model.
    - output_dir (str, optional): Directory where the output text file will be saved. Defaults to the same directory as the input image.

    Returns:
    - str: Path to the saved text file.
    """
    # Read and encode the image file
    encoded_image = base64.b64encode(open(image_path, 'rb').read()).decode('ascii')

    # Configuration
    headers = {
        "Content-Type": "application/json",
        "api-key": api_key,
    }

    # Payload for the request
    payload = {
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "type": "text",
              "text": "From the above image extract the text as is and export the information from chart into tabular format so that one can understand meaning of chart and can reproduce this chart from tabular data and insert the tabular information in the same place where chart is present in the image"
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{encoded_image}"
              }
            }
          ]
        },
      ],
      "temperature": 0.7,
      "top_p": 0.95,
      "max_tokens": 800
    }

    # GPT-4 Vision endpoint
    GPT4V_ENDPOINT = "https://<CHANGE_ME name of openai model deployment>.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-15-preview"

    # Send request
    try:
        response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
        response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
    except requests.RequestException as e:
        raise SystemExit(f"Failed to make the request. Error: {e}")

    # Handle the response
    response_data = response.json()
    # Determine output directory and filename
    if output_dir is None:
        output_dir = os.path.dirname(image_path)
    else:
        os.makedirs(output_dir, exist_ok=True)
    
    image_filename = os.path.basename(image_path)
    text_filename = os.path.splitext(image_filename)[0] + "_text.txt"
    text_filepath = os.path.join(output_dir, text_filename)

    # Save response to a text file
    with open(text_filepath, 'w') as f:
        f.write(response.json()['choices'][0]['message']['content'])

    print(f"Text response saved to: {text_filepath}")
    return text_filepath

# Example usage:
api_key = "YOUR_API_KEY"
image_path = "path_to_your_image.jpg"
output_directory = "output_directory"
r = process_image_and_save_text(image_path, api_key, output_directory)
print(r)

Automating the Process for Multiple Images

To handle multiple images efficiently, we can automate the process by iterating through a directory containing the images. This script will process each image and save the extracted information accordingly.

This script iterates over each image file in a specified directory, processes it using the process_image_and_save_text function, and saves the extracted text and tabular data from graphs.

# Directory path containing images
directory = "path_to_your_image_directory"
output_directory = "output_directory"
api_key = "YOUR_API_KEY"

# Iterate over each file in the directory
for filename in os.listdir(directory):
    if filename.endswith(".jpg") or filename.endswith(".png"):  # Adjust based on your image file extensions
        image_path = os.path.join(directory, filename)
        process_image_and_save_text(image_path, api_key, output_directory)
    else:
        continue

Conclusion

By following these steps, you can convert a PDF document filled with text and graphs into images, extract both textual and graphical information using GPT-4 Vision, and save the data in a structured format. This approach not only simplifies the digitization of printed documents but also makes it easier to understand and reproduce graphical data, making it a powerful tool for data extraction and analysis.

Next, I am working on more scalable option to implement custom skills in AI search. stay tuned!!

Note: These are my personal views and I am happy to receive feedback.

Enhancing Data Extraction: RAG with PDF and Chart Images Using GPT-4o

Step 1: Converting PDF Pages to Images

Step 2: Extracting Text and Graphical Information Using GPT-4 Vision

Automating the Process for Multiple Images

Conclusion

Written by Umesh Pawar