🔒 Free tier data may be used to improve AI models. Upgrade Pro for 100% Privacy

Tesseract OCR Guide: Accuracy, Setup & Real-World Testing

Tesseract OCR Guide: Accuracy, Setup & Real-World Testing

2025-12-02 01:30 | 9 min read | 571 views | Author: Thai Nguyen (Software Engineer)

🧩 What Is OCR and Why Is It So Popular?

OCR (Optical Character Recognition) is the technology that allows machines to read text from images—scanned documents, receipts, screenshots, manga panels, photos of books, etc.

Simply put: OCR turns pixels into text.

It is widely used in:

  1. document digitization
  2. automated data entry
  3. AI chat & search
  4. PDF-to-text workflow
  5. invoice processing
  6. reading manga or scanned books
  7. screen capture text extraction

That’s why OCR remains a highly searched topic.


📚 Common OCR Solutions

There are many OCR tools today:

Open-source tools

  1. Tesseract
  2. EasyOCR
  3. PaddleOCR

Cloud/Commercial

  1. Google Cloud Vision
  2. Azure OCR
  3. AWS Textract
  4. OpenAI Vision OCR

Among them, Tesseract is the most popular thanks to being:

  1. free
  2. offline
  3. easy to deploy
  4. supported on JS/Python
  5. Docker-friendly

This article focuses deeply on real-world Tesseract usage, including performance and pitfalls.


🎯 Overview of Tesseract OCR

Tesseract is a Google-backed open-source OCR engine supporting 100+ languages across Linux, macOS, and Windows.

Advantages:

  1. free and open
  2. works offline
  3. fast on local CPU
  4. integrates with JS (tesseract.js)
  5. integrates with Python (pytesseract)

However, based on my experience deploying OCR processing for TTSForFree.com:

  1. excellent for English
  2. weak for Vietnamese
  3. server CPU overloads quickly without worker limits
  4. fails on manga, small fonts, or blurry images
  5. Docker images can become large

Let’s break down how Tesseract really works.


🔍 How Tesseract Works Internally (Deep Dive)

H3.1 Pre-processing

  1. grayscale conversion
  2. thresholding (Otsu)
  3. noise reduction
  4. deskewing

This stage heavily affects final accuracy.

H3.2 Layout Analysis

Tesseract detects:

  1. text blocks
  2. lines
  3. words
  4. characters

Bad layouts or skewed images → lower accuracy.

H3.3 LSTM OCR Engine

Tesseract 4+ uses LSTM models.

English datasets are large → high accuracy.

Vietnamese datasets are limited → diacritic errors, character confusion.

H3.4 Post-processing

  1. reconstructs words
  2. removes invalid characters
  3. optional dictionary correction



🌍 Tesseract Accuracy by Language (Real Tests)

Based on my benchmarks:

LanguageClear ImagesLow-quality/Small TextNotes
English90–95%~80%Strongest
Vietnamese85–90% (after threshold)60–70%Diacritics often fail
Japanese/Chinese70–85%<60%Needs fine-tuned models
Others80–95%variesGenerally OK


🖼 Why Tesseract Struggles with Manga, Small Fonts & Vietnamese

H3.1 Manga fonts not in LSTM training data

H3.2 JPEG compression destroys text edges

H3.3 Small fonts (<14px) are almost unreadable

H3.4 Vietnamese diacritics are difficult for LSTM

Examples from my tests:

"được" → "duoc"
"những" → "nhụng"
"tường" → "tương"


⚙ Integration with JavaScript & Python

H3.1 Tesseract.js (JavaScript, React, Next.js)

Advantages:

  1. runs on client-side → very fast
  2. offloads CPU from server
  3. immune to API abuse
  4. easy to embed in UI

Drawbacks:

  1. slower than native
  2. poor Vietnamese accuracy
  3. WASM model size (15–30MB)

H3.2 Pytesseract (Python, Native)

Advantages:

  1. more stable
  2. integrates well with OpenCV
  3. ideal for server pipelines
  4. faster than tesseract.js

Drawbacks:

  1. unlimited API calls → CPU 100% instantly
  2. requires worker limits and queue system


🚀 Performance Testing (Real Benchmark)

Machine specs:

  1. 16GB RAM
  2. 16 cores

Results:

  1. single-image OCR → near-realtime
  2. client-side (JS) → extremely fast
  3. server-side (Python) without concurrency limit → CPU meltdown in seconds

GPU note:

Tesseract cannot use GPU.

For GPU OCR → PaddleOCR or custom models.


🐳 Docker Image Size

My production Docker worker installing only English + Vietnamese:

👉 ~700–900MB

Why so large?

  1. traineddata files (20–50MB each)
  2. leptonica libraries
  3. TIFF/JPEG/PNG libs
  4. Ubuntu base image

Installing all languages can exceed 3GB.


🧠 Tips to Improve Vietnamese OCR Accuracy

H3.1 Apply grayscale + threshold

Improves clarity significantly.

H3.2 Upscale image 2× or 3×

LSTM performs much better with higher resolution.

H3.3 Use correct PSM mode

Examples:

--psm 6 → block of text
--psm 7 → single line
--psm 11 → sparse text


H3.4 Use character whitelist

Good for invoices/numbers.

H3.5 Pre-process with OpenCV

Noise reduction boosts accuracy by 5–20%.


🏗 Recommended Production Architecture (Event-Driven)

❌ Bad (naive approach)
Client → API → Tesseract → response
→ CPU overload, server crash
✔ Proper event-driven pipeline
Client (React/Next.js)
API Gateway
Message Queue (RabbitMQ/Kafka)
OCR Workers (Python/Tesseract)
Storage (R2/S3)
API fetch result

Benefits:

  1. protects CPU
  2. scales horizontally
  3. easy retries
  4. stable even with large traffic

This is the architecture I use for TTSForFree’s OCR pipeline.


⭐ Tesseract OCR Evaluation

Strengths

  1. free & open-source
  2. offline
  3. fast
  4. great for English
  5. easy Python/JS integration

Weaknesses

  1. weak Vietnamese accuracy
  2. no GPU support
  3. heavy Docker images
  4. CPU-intensive under load
  5. bad at manga/small fonts

Best for

  1. simple OCR tasks
  2. clear text images
  3. small/medium projects
  4. offline tools

Not suitable for

  1. mission-critical accuracy
  2. Vietnamese-heavy workloads
  3. manga/scanned books
  4. large-scale production


Frequently Asked Questions

Q: Is Tesseract OCR free?

A: Yes. Tesseract is fully free and open-source under Apache 2.0.

Q: Is Tesseract good for Vietnamese text?

A: Not really. Vietnamese accuracy is significantly lower than English, especially with diacritics and small fonts.

Q: How large is a typical Tesseract Docker image?

A: Around 700–900MB with only English and Vietnamese installed. With many languages it may exceed 3GB.

Q: Should Tesseract be used in high-traffic production?

A: No. You need a queue-based system and limited workers to avoid CPU overload.

Q: Where does Tesseract perform fastest?

A: Client-side (tesseract.js) or on a strong local machine using the native engine.

Was this article helpful?

Related Articles

Latest from Our Blog

Không có bài viết nào