Building a Hybrid OCR + LLM Document Classification Pipeline

When I started building Nemra's document processing API, the goal seemed straightforward: accept financial documents bank statements, invoices, payslips, receipts and return structured data that could automatically feed into accounting systems. I thought optical character recognition (OCR) would solve most of the problem. Just extract the text, pass it through a classifier, pull out the key fields, and we're done. Right?

That assumption lasted about two weeks into the project.

The OCR-Only Approach That Didn't Work

My first prototype was simple: use PyMuPDF to extract text from PDFs, run it through a basic text classifier, and parse the results with regex patterns. For the first few test documents, it worked beautifully. I was feeling pretty good about myself.

Then I started testing with real-world documents from actual businesses.

The problems came fast:

Layout variations broke everything. Every bank has its own statement template. Every company designs invoices differently. My rigid template-based extraction logic would work perfectly for one invoice format, then completely fail on another that was only slightly different. I was playing whack-a-mole, constantly adding new templates and special cases.

OCR wasn't as reliable as I thought. Low-quality scans were a nightmare. Handwritten notes on invoices caused chaos. Complex multi-column layouts confused the text extraction, mixing up columns and reading order. What came out of the OCR was often garbled or out of order, and my downstream classifier was trained on clean text it didn't know what to do with messy input.

The pipeline was fragile. I had OCR services, custom scripts for cleaning, rate limiting logic, API calls, error handling scattered everywhere. Every connection point was a potential failure. I spent more time debugging infrastructure than actually building features.

Context was missing. When I finally tried feeding OCR text into a language model, it would sometimes hallucinate details or miss important visual cues that were obvious from looking at the document. A logo, a colored header, the position of text all that context was lost when I reduced everything to plain text.

I realized I needed to rethink the entire approach.

Exploring What Could Work Better

Before committing to a new architecture, I spent time researching alternatives. I looked at CNN-based image classifiers, but they required huge labeled datasets and significant training infrastructure. That felt like overkill for our use case, and I didn't want to spend months collecting and labeling thousands of documents.

I discovered that newer multimodal LLMs models that could process both text and images were getting really good at document understanding tasks. The key insight was that I didn't need to train a custom model. These pre-trained vision-language models could already understand documents if I gave them the right combination of visual and textual information.

I also found research suggesting that hybrid approaches combining OCR with LLM strategies could achieve much better accuracy while staying fast. The idea was to exploit what each modality was good at: OCR for extracting raw text, images for preserving layout and visual context, and LLMs for understanding and classification.

This felt like the right direction.

Building the Hybrid Pipeline

I redesigned the pipeline around a simple principle: treat each document as both text and image, and let a multimodal LLM unify the information.

Document Ingestion and Preprocessing

The flow starts when a customer uploads a PDF or image file to my FastAPI endpoint. For PDFs, I focus on the first two pages they almost always contain the critical classification clues like headers, logos, and document type indicators.

I convert the file into high-resolution images and extract the raw text using OCR. Critically, I preserve the bounding-box positions from the OCR, which helps maintain layout information.

Creating a Hybrid Representation

Here's the key innovation: I create a single image by vertically concatenating the first pages of the document. I encode this as base64 and combine it with the raw extracted text into a single payload.

This hybrid representation gives the LLM both visual signals (layout, logos, formatting) and textual signals (the actual content). It can see that this looks like an invoice while also reading the text to confirm details.

The Multimodal LLM Call

I send the combined image and text to a vision-enabled LLM like GPT-4o. My prompt instructs it to classify the document into one of our categories: bank statement, invoice, receipt, payslip, cheque, transfer note, or other.

The model returns a JSON response with the predicted document type and a confidence score. If it's uncertain, it labels the document as "other" rather than making a bad guess.

1# Simplified example of the classification call
2async def classify_document(image_base64: str, extracted_text: str) -> dict:
3    response = await llm_client.complete(
4        messages=[
5            {
6                "role": "user",
7                "content": [
8                    {
9                        "type": "text",
10                        "text": f"""Classify this financial document into one of these categories:
11                        - bank_statement
12                        - invoice
13                        - receipt
14                        - payslip
15                        - cheque
16                        - transfer_note
17                        - other
18                        
19                        OCR Text:
20                        {extracted_text[:2000]}
21                        
22                        Return JSON with 'document_type' and 'confidence' (0-1)."""
23                    },
24                    {
25                        "type": "image_url",
26                        "image_url": {
27                            "url": f"data:image/jpeg;base64,{image_base64}"
28                        }
29                    }
30                ]
31            }
32        ]
33    )
34    
35    return response.json()

Handling Large Documents

For multi-page PDFs, I split the document into chunks of about five pages each. I process each chunk separately and get a summary, then make a second LLM call to synthesize the combined summaries. This approach keeps me within token limits while ensuring I don't miss important information scattered across pages.

Downstream Extraction

Once a document is classified, specialized parsers take over. Each parser knows how to extract structured fields for its document type using templated prompts and Pydantic schemas for validation.

For example, the invoice parser extracts supplier name, invoice number, line items, amounts, and tax details. It also generates double-entry journal entries using a chart of accounts, which is exactly what our customers need for their accounting systems.

What Changed After the Hybrid Approach

The difference was dramatic.

Classification accuracy went way up. By analyzing both visual and textual cues, the LLM could reliably distinguish between similar document types. A payslip and an invoice might have similar text, but they look completely different the model could see that.

Layout changes stopped breaking things. I was no longer relying on brittle template rules. Minor variations in invoice design or statement format didn't matter. The model adapted naturally.

Fewer false positives. The confidence scoring worked well. When the model wasn't sure, it said so by returning "other" rather than making a confident wrong guess. This let me route ambiguous documents to human review.

The system could scale. Automatic chunking for large documents, efficient image encoding, and parallel processing meant I could handle multi-page statements and batch uploads without hitting limits.

My internal tests showed a significant reduction in classification errors compared to the OCR-only approach. More importantly, the time I spent manually verifying and fixing misclassified documents dropped dramatically.

Lessons I Learned Building This

OCR is a tool, not the solution. Treating OCR as just one source of signals not the ultimate truth made everything more robust. The raw text was useful, but it needed visual context to really work.

Multimodal LLMs are powerful. Modern vision-language models can understand documents without custom training. This saved me months of work collecting datasets and training models.

Design for token limits from the start. Large documents will exceed context windows. Building chunking and synthesis into the architecture from day one prevented headaches later.

Validate everything with schemas. Using Pydantic schemas to validate extracted data prevented hallucinations and ensured consistent output. If the LLM tried to return something that didn't match the schema, I'd know immediately.

Keep infrastructure simple. The fewer moving parts, the fewer things that can break. Using a single multimodal LLM call instead of chaining multiple services reduced complexity and failure points.

The Impact

What started as an OCR classification project evolved into a hybrid system that combines text, images, and language models. By acknowledging that OCR alone wasn't enough and embracing multimodal AI, I built a platform that can classify and parse a wide variety of financial documents reliably.

The architecture is extensible adding new document types is a matter of updating the prompt and adding a new parser, not retraining models or building new templates. As multimodal models continue to improve, the system gets better without me having to change the core architecture.

If you're building document processing pipelines, I'd encourage you to think beyond just OCR. The combination of visual and textual understanding is more powerful than either alone. And sometimes, the best solution isn't the one you started with it's the one you discover when your first approach fails.