Bits and Bytes of DataScience – invoice

Hands Extracting Key-Value Pairs from Invoice Images using OCR and Gradio

Introduction:

Have you ever been in a situation where you have a scanned copy of an invoice, but you need to extract certain key information from it? It can be a tedious task to manually go through the invoice and extract the information you need. That’s where the Invoice Recognizer comes in handy. In this blog post, we will be discussing the implementation of the Invoice Recognizer using Python.

Objective:

The objective of this project is to extract key-value pairs from an invoice image using the Tesseract OCR engine and regular expressions. We will be using Gradio to build a user interface for this project.

Implementation:

To implement the Invoice Recognizer, we will be using the following libraries:

Gradio: For building the user interface.
pytesseract: For OCR (Optical Character Recognition).
PIL (Python Imaging Library): For working with images.
pandas: For displaying the output in a tabular format.

Background Information:

Optical Character Recognition (OCR): Optical Character Recognition, or OCR, is the process of converting scanned images of text into machine-encoded text that can be searched, indexed, and manipulated by a computer. OCR software is used to read the text from images and convert it into machine-readable text. OCR technology is used in many applications, such as digitizing books and documents, recognizing license plates, and scanning receipts.

Tesseract OCR: Tesseract OCR is a widely used open-source OCR engine developed by Google. It is known for its accuracy and ability to recognize text in a wide range of languages. Tesseract OCR is available in many programming languages, including Python.

Gradio: Gradio is a Python library that allows developers to quickly and easily create customizable UI components for their machine learning models. With Gradio, developers can create web interfaces for their models without needing to write any front-end code. Gradio is built on top of Flask and React, making it easy to integrate into existing Python projects. Let’s start by installing the required libraries:

!pip install gradio
!apt-get update
!apt-get install tesseract-ocr
!apt-get install libtesseract-dev
!pip install pytesseract



import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

Once we have installed the required libraries, let’s import them into our Python script:

import gradio as gr
import re
import pytesseract
from PIL import Image
import pandas as pd

We have imported the necessary libraries, and we have also specified the path to the Tesseract OCR engine. The next step is to define the regular expressions that we will use to match the keys in the invoice.

def extract_kvp(text,key_regexes):
    # Define the regular expressions to match the keys
    key_regexes = key_regexes
    """# [
        re.compile(r"^BILLTO"),
        re.compile(r"^Invoice #"),
        re.compile(r"^Task"),
        re.compile(r"^Item"),
        re.compile(r"^Rate"),
      #  re.compile(r"^Price"),
        re.compile(r"^Amount Due \(USD\)"),
    ]"""

    # Split the text into lines
    lines = text.split("\n")

    # Iterate over the lines and extract the key-value pairs
    kvp = {}
    key = None
    for line in lines:
        for regex in key_regexes:
            if regex.match(line):
                key = regex.pattern.replace("^", "").strip()
                kvp[key] = ""
                break
        if key:
            kvp[key] += line + "\n"
    if key == 'Amount Due \\(USD\\)':
      key = 'Amount Due (USD):'
      kvp[key] = line.split(":")[-1].strip().split("$")[-1].strip()
      
    else:
        kvp[key] = line.strip().replace(",","").replace(key + ":", "")
    return kvp

The function first splits the text into lines using the split method. It then iterates over the lines and matches each line against the regular expressions in key_regexes. When a match is found, it sets key to the corresponding key and initializes an empty string for the value in the kvp dictionary. It then continues to append the subsequent lines to the value until it finds the next key.

The function also includes a special case for the “Amount Due (USD)” key, since the value for this key is not on a separate line. It extracts the value using string manipulation and adds it to the kvp dictionary.

Finally, the function removes any commas and replaces the key and colon from the value using the replace method.

format_table The format_table function takes in a dictionary of key-value pairs and formats it as an HTML table using the Pandas library.

def format_table(output):
    kvp = {key.strip(): value.replace("\n", ", ") for key, value in output.items()}
    print(kvp)
    df = pd.DataFrame(list(kvp.items()), columns=["Key", "Value"])
    return df.to_html(index=False, justify='left')

The function first creates a new dictionary (kvp) with stripped keys and values that have newlines replaced with commas. It then creates a Pandas dataframe with the dictionary and formats it as an HTML table using the to_html method. It returns the formatted table as a string.

extract_kvp_from_image: The extract_kvp_from_image function is the main function that takes in an image and a string of regular expressions, extracts the text from the image, and extracts the key-value pairs using the extract_kvp function. It then formats the output as an HTML table using the format_table function.


def extract_kvp_from_image(image, key_regexes_str):
    key_regexes = [re.compile(regex) for regex in key_regexes_str.split(", ")]
    text = extract_text_from_image(image)
    print('Extracted Text Data',text)
    kvp = extract_kvp(text, key_regexes)
    output = format_table(kvp)
    return output

Next, we define a function to extract text from the invoice image using the Tesseract OCR engine.

def extract_text_from_image(image):
    # Extract the text from the image using Tesseract
    text = pytesseract.image_to_string(image)
    
    return text

And finally, we have the extract_kvp_from_image function. This function takes in an image and a string of regular expressions for the keys that we want to extract from the image. It then uses the extract_text_from_image function to extract the text from the image, and the extract_kvp function to extract the key-value pairs from the text using the provided regular expressions. Finally, the function formats the output as an HTML table using the format_table function and returns it.

Let’s put all the functions together and run the application!

import pytesseract
from PIL import Image
import pandas as pd
import re
import gradio as gr

pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'

def extract_text_from_image(image):
    # Extract the text from the image using Tesseract
    text = pytesseract.image_to_string(image)
    
    return text

def extract_kvp(text,key_regexes):
    # Define the regular expressions to match the keys
    key_regexes = key_regexes

    # Split the text into lines
    lines = text.split("\n")

    # Iterate over the lines and extract the key-value pairs
    kvp = {}
    key = None
    for line in lines:
        for regex in key_regexes:
            if regex.match(line):
                key = regex.pattern.replace("^", "").strip()
                kvp[key] = ""
                break
        if key:
            kvp[key] += line + "\n"
    if key == 'Amount Due \\(USD\\)':
      key = 'Amount Due (USD):'
      kvp[key] = line.split(":")[-1].strip().split("$")[-1].strip()
      
    else:
        kvp[key] = line.strip().replace(",","").replace(key + ":", "")
    return kvp

def format_table(output):
    kvp = {key.strip(): value.replace("\n", ", ") for key, value in output.items()}
    df = pd.DataFrame(list(kvp.items()), columns=["Key", "Value"])
    return df.to_html(index=False, justify='left')

def extract_kvp_from_image(image, key_regexes_str):
    key_regexes = [re.compile(regex) for regex in key_regexes_str.split(", ")]
    text = extract_text_from_image(image)
    kvp = extract_kvp(text, key_regexes)
    output = format_table(kvp)
    return output

inputs = [gr.inputs.Image(type="filepath", label="Input"), gr.inputs.Textbox(label="Key Regexes", default="^BILLTO, ^Invoice #, ^Task, ^Item, ^Rate, ^Price, ^Amount Due \\(USD\\)")]

outputs = gr.outputs.HTML(label="Output")

gr.Interface(
extract_kvp_from_image, inputs, title="Extract Key-Value Pairs from Invoice Image",layout ="browse", outputs=outputs
).launch()

Conclusion:

In this blog post, we have explored how to extract key-value pairs from an invoice image using Python and Gradio. We have used the Tesseract OCR engine to extract text from the image, and regular expressions to extract the key-value pairs from the text. We have also used Gradio to create a user-friendly interface for our application.