I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.
Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.
[1] https://tesseract-ocr.github.io/
It OCRs screenshots and stores the text in a search index, so you can query by keyword, date, boolean operators, the whole shebang.
It's all local. It is really useful for me - yesterday it saved me after Firefox wigged out and lost all my tabs. It's in a great place to try out, and I am actively developing it.
[0] https://apse.io
I've built an archival system based around Tesseract and PostgreSQL. It takes Images/PDFs, either scanned or generated, and rebuilds them as searchable PDFs before being extracted and inserted into Postgres' full-text search. I keep all of the original media because disk is cheap.
Originally I used Tesseract directly. But I found that ocrmypdf did a better job than my home-grown pipeline, so I switched.
There's a reason why the external services are popular though...lots of training data and tweaks to make them much more accurate. Try the Google demo here, for example: https://cloud.google.com/vision/docs/ocr
Full disclosure: I'm the primary package developer. Shameless plug. :)
On Mac I use a modified version of this Keyboard Maestro script, to OCR a user selected area of the screen.
This script will result in the OCR Text on the clipboard. I'm sure Keyboard Maestro could automagically append it to a text file or something. I'm kinda a noob with Keyboard Maestro, so I don't know all of it's functionality.
I have a couple variations of this script, one that will use the Mac's speak this command to read aloud the OCR text, as I am a slow reader, and an auditory learner.
My father had a bunch of newspaper clippings scanned into the family tree application and wanted the text. I used this method to get the text instead of typing it all out.
https://forum.keyboardmaestro.com/t/ocr-user-selected-area-m...
Is there an accuracy optimization to be found if I can pre-train the OCR engine to look for a limited set of words instead of the entire dictionary- and printable character space?
The use case I have is OCRing shipping labels for packages that arrive at an office. The set of plausible matches is incredibly small as it is the set of employee names that work in said office.
Further optimizations include reducing the problem space by only considering computer printed glyphs and not bothering with handwritten labels, and the insight that the distribution of packages follow a power law where a disproportionately small group of people receive the largest number of packages.
The end goal is to perform this entirely on device, with low latency and high accuracy.
(no affiliation, just a user)
After a fair amount of searching I found ScanTailor: https://github.com/4lex4/scantailor-advanced#scan-tailor-adv... which seems to have the capability of dealing with warped page images. I haven't actually gone through the complete workflow with it yet, but it seems to be a very capable OCR package.
https://www.dizzybits.com/Photoplex
It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.
I wrote a guide how to do it here:
https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...
Hope it helps
If you still want to grab the text yourself you make a copy to Google Keep and use the "grab text" function.
Works for me, I take full screenshots of interesting stuff so the url is still visible when I want to go back to the original.
Obviously I have a paid G Suite account at Google. That comes with a very good set of privacy protecting rules. Doesn't matter how you roll your stack eventually you are going to be dependent on a 3th party. Better use one that offers full encryption and 2FA to lockup your data.
https://gsuite.google.com/learn-more/security/security-white...