Tony Tascioglu Wiki

TechnoTony Wiki - Tony Tascioglu's personal public knowledge-base!

scripts:files:ocr_pdf_file

OCR a PDF file

Super simple solution.

Install tesseract and the corresponding data package for your language
Install ocrmypdf either from AUR or pip.

If your document is scanned/has no text layer, run it as

ocrmypdf input.pdf output_with_text.pdf

If it does have text, it doesn't always merge cleanly. You can use –force-ocr but that rasterizes the file and makes it massive.

scripts/files/ocr_pdf_file.txt · Last modified: 2023-03-27 02:21 by Tony