Image To Markdown

OCR and Markdown Processing with LLaMA 3.2 Vision – A Streamlit App

OCR and Markdown Processing with LLaMA 3.2 Vision – A Streamlit Application

In this post, we build a Streamlit application that extracts text from an image using Tesseract OCR and processes it with LLaMA Vision 3.2 (via Ollama) to produce Markdown-formatted text.

Pre-Requisites

Before running the program, ensure you have the following Python modules installed:

streamlit: pip install streamlit
pytesseract: pip install pytesseract
Pillow: pip install Pillow
requests: pip install requests

Also note that base64, io, and subprocess are part of the Python standard library.

Code Explanation and Details

1. Importing Dependencies

import streamlit as st
import subprocess
from PIL import Image
import pytesseract
import base64

This section imports the required modules for our application. We use Streamlit for the web interface, Pillow for image processing, pytesseract for OCR, subprocess to call the external Ollama command, and base64 for generating a download link.

2. Setting Up the Application Title and Description

st.title("OCR and Markdown Processing with LLaMA 3.2 Vision")
st.write("Upload an image to extract text via Tesseract OCR and process it with LLaMA Vision 3.2 for Markdown conversion.")

Here we define the title and brief description of the app that will be displayed on the web page.

3. File Uploader for Image Input

uploaded_image = st.file_uploader("Upload Image", type=["jpg", "jpeg", "png"])

This widget allows users to upload an image file (JPG, JPEG, PNG) for processing.

4. Displaying the Uploaded Image and Extracting Text

if uploaded_image is not None:
    image = Image.open(uploaded_image)
    st.image(image, caption="Uploaded Image", use_container_width=True)

    st.subheader("Extracted Text:")
    extracted_text = pytesseract.image_to_string(image, lang="eng")
    st.write(extracted_text)

After uploading, the image is displayed on the page. Then Tesseract OCR is used to extract text, which is shown as plain text below the image.

5. Processing the Extracted Text with LLaMA Vision 3.2 (via Ollama)

if st.button("Process Text with LLaMA 3.2"):
    with st.spinner("Processing text with LLaMA 3.2..."):
        try:
            command = ["ollama", "run", "llama3.2-vision:11b", extracted_text]
            result = subprocess.run(command, capture_output=True, text=True)
            if result.returncode != 0:
                st.error(f"Ollama Error: {result.stderr}")
            else:
                markdown_text = result.stdout.strip()
                st.subheader("Processed Markdown Text:")
                st.write(markdown_text)
                
                # Provide a download link for the Markdown output
                b64 = base64.b64encode(markdown_text.encode()).decode()
                download_link = f'Download Markdown File'
                st.markdown(download_link, unsafe_allow_html=True)
        except Exception as e:
            st.error(f"Error during processing: {e}")

When the user clicks the button, the app sends the extracted text to the Ollama LLaMA 3.2 model for processing. The resulting Markdown text is then displayed and a download link is generated for the user to save the output.

Final Code

import streamlit as st
import subprocess
from PIL import Image
import pytesseract
import base64

# Title and description
st.title("OCR and Markdown Processing with LLaMA 3.2 Vision")
st.write("Upload an image to extract text via Tesseract OCR and process it with LLaMA Vision 3.2 for Markdown conversion.")

# File uploader for image input
uploaded_image = st.file_uploader("Upload Image", type=["jpg", "jpeg", "png"])

if uploaded_image is not None:
    # Open and display the uploaded image
    image = Image.open(uploaded_image)
    st.image(image, caption="Uploaded Image", use_container_width=True)

    # Step 1: Perform OCR on the image using Tesseract
    st.subheader("Extracted Text:")
    extracted_text = pytesseract.image_to_string(image, lang="eng")
    st.write(extracted_text)

    # Step 2: Process the extracted text with Ollama LLaMA 3.2 for Markdown formatting
    if st.button("Process Text with LLaMA 3.2"):
        with st.spinner("Processing text with LLaMA 3.2..."):
            try:
                command = ["ollama", "run", "llama3.2-vision:11b", extracted_text]
                result = subprocess.run(command, capture_output=True, text=True)
                if result.returncode != 0:
                    st.error(f"Ollama Error: {result.stderr}")
                else:
                    markdown_text = result.stdout.strip()
                    st.subheader("Processed Markdown Text:")
                    st.write(markdown_text)
                    
                    # Provide a download link for the Markdown output
                    b64 = base64.b64encode(markdown_text.encode()).decode()
                    download_link = f'Download Markdown File'
                    st.markdown(download_link, unsafe_allow_html=True)
            except Exception as e:
                st.error(f"Error during processing: {e}")

This is the complete code for the Streamlit application.

How to Run the Program

streamlit run image2markdown.py

Open your terminal, navigate to the project directory, and run the command above.
Once the app starts, view it in your browser:
Local URL: http://localhost:8501
Network URL: http://192.168.1.11:8501

Conclusion

This application demonstrates how to integrate Tesseract OCR with the Ollama LLaMA 3.2 Vision model using Streamlit. By uploading an image, extracting text from it, and processing that text into Markdown, the app showcases a powerful combination of computer vision and natural language processing techniques in an easy-to-use web interface.

Reference Links

Meta Vision 3.2: Meta Research
Streamlit: Streamlit Official Site
Ollama Installation on Local Host: Ollama
Running LLaMA 3.2 Vision on Local Host using Ollama: Ollama Documentation