2024-04-02 OCR PDFs and images directly in your browser by Simon Willison ¶

https://tools.simonwillison.net/ocr
https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/
https://tesseract.projectnaptha.com/
https://github.com/naptha/tesseract.js (Pure Javascript OCR for more than 100 Languages 📖🎉🖥 )
https://mozilla.github.io/pdf.js/
https://github.com/mozilla/pdf.js (a Portable Document Format (PDF) viewer that is built with HTML5)

Introduction ¶

This tool runs entirely in your browser. No files are uploaded to a server.

It uses:

Tesseract.js for OCR
and PDF.js to convert PDFs into images.

Some explainations ¶

https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

So I built a new tool !

tools.simonwillison.net/ocr provides a single page web app that can run Tesseract OCR against images or PDFs that are opened in (or dragged and dropped onto) the app.

Crucially, everything runs in the browser. There is no server component here, and nothing is uploaded. Your images and documents never leave your computer or phone.

It’s not perfect: multi-column PDFs (thanks, academia) will be treated as a single column, illustrations or photos may result in garbled ASCII-art and there are plenty of other edge cases that will trip it up.

But… having Tesseract OCR available against PDFs in a web browser (including in Mobile Safari) is still a really useful thing

I’m really pleased with this project. I consider it finished—it does the job I designed it to do and I don’t see any need to keep on iterating on it.

And because it’s all static JavaScript and WebAssembly I expect it to continue working effectively forever.

Update: OK, a few more features: I added language selection, paste support and some basic automated tests using Playwright Python.