OCR framework for extracting formatted text?

Question

I'm a serial information hoarder, and often use screenshots in order to store comments, passages and fragments of conversations I find useful or insightful. This works well if I want to reference something recent, but obviously doesn't scale well. I'd like to integrate these into my personal archive, but don't know any frameworks (preferably for Go, Node, or Python) which could automatically extract the text from the images while retaining its formatting. I'm not against doing some image preprocessing myself, but I don't feel comfortable passing the images to a 3rd party service, since a portion of the images contain private or sensitive information that I can't readily sort out of my collection.

undebuggable · Accepted Answer

To extract text from photos and non OCR-ed PDFs Tesseract[1] with language specific model[2] never fails me.
I use my shell utility[3] to automate the workflow with ImageMagick and Tesseract, with intermediate step using monochrome TIFFs. Extracting each page into separate text file allows to ag/grep a phrase and then find it easily back in the original PDF.
Having greppable libraries of books on various domains and not having to crawl through the web search each time is very useful and time-saving.
[1] https://tesseract-ocr.github.io/
[2] https://github.com/tesseract-ocr/tessdata
[3] https://github.com/undebuggable/pdf2txt

ryanfox · Answer

I built an application for exactly this. It's called A Personal Search Engine, APSE for short.[0]
It OCRs screenshots and stores the text in a search index, so you can query by keyword, date, boolean operators, the whole shebang.
It's all local. It is really useful for me - yesterday it saved me after Firefox wigged out and lost all my tabs. It's in a great place to try out, and I am actively developing it.
[0] https://apse.io

asguy · Answer

ocrmypdf and friends.I've built an archival system based around Tesseract and PostgreSQL. It takes Images/PDFs, either scanned or generated, and rebuilds them as searchable PDFs before being extracted and inserted into Postgres' full-text search. I keep all of the original media because disk is cheap.Originally I used Tesseract directly. But I found that ocrmypdf did a better job than my home-grown pipeline, so I switched.

tyingq · Answer

Here's a blog post showing self hosted PyTesseract finding text in an image and preserving the format: https://stackabuse.com/pytesseract-simple-python-optical-cha...There's a reason why the external services are popular though...lots of training data and tweaks to make them much more accurate. Try the Google demo here, for example: https://cloud.google.com/vision/docs/ocr

flicken · Answer

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

faustomorales · Answer

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.
Full disclosure: I'm the primary package developer. Shameless plug. :)
[0] https://github.com/faustomorales/keras-ocr

nacho_man · Answer

This doesn't meet most your requirements, (Go, Node, Python, and it's a manual process...) but... maybe this would be helpful?
On Mac I use a modified version of this Keyboard Maestro script, to OCR a user selected area of the screen.
This script will result in the OCR Text on the clipboard. I'm sure Keyboard Maestro could automagically append it to a text file or something. I'm kinda a noob with Keyboard Maestro, so I don't know all of it's functionality.
I have a couple variations of this script, one that will use the Mac's speak this command to read aloud the OCR text, as I am a slow reader, and an auditory learner.
My father had a bunch of newspaper clippings scanned into the family tree application and wanted the text. I used this method to get the text instead of typing it all out.
https://forum.keyboardmaestro.com/t/ocr-user-selected-area-m...

kamalfariz · Answer

OCR techniques are general purpose in trying to map any conceivable text-looking shapes into actual text. Accuracy can vary wildly but the good ones will match against plausible words to eliminate low quality guesses.
Is there an accuracy optimization to be found if I can pre-train the OCR engine to look for a limited set of words instead of the entire dictionary- and printable character space?
The use case I have is OCRing shipping labels for packages that arrive at an office. The set of plausible matches is incredibly small as it is the set of employee names that work in said office.
Further optimizations include reducing the problem space by only considering computer printed glyphs and not bothering with handwritten labels, and the insight that the distribution of packages follow a power law where a disproportionately small group of people receive the largest number of packages.
The end goal is to perform this entirely on device, with low latency and high accuracy.

kranner · Answer

Try https://screenotate.com/(no affiliation, just a user)

inetsee · Answer

One problem that I have with OCR is dealing with images of pages that are warped. I have some books that I would like to turn into electronic books, but not enough to justify setting up a book scanning rig (framework, two cameras, platen, etc). Setting up a document camera is fairly easy, but using it to take pictures of a book laying flat on the base produces images where the pages are warped and most OCR software seems to have problems with warped pages.After a fair amount of searching I found ScanTailor: https://github.com/4lex4/scantailor-advanced#scan-tailor-adv... which seems to have the capability of dealing with warped page images. I haven't actually gone through the complete workflow with it yet, but it seems to be a very capable OCR package.

coderguy123 · Answer

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.
https://www.dizzybits.com/Photoplex
It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

Jugurtha · Answer

Site: https://openpaper.work/Repo: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

Brainsnail · Answer

https://github.com/axa-group/Parsr

FloatArtifact · Answer

I'm interested in drawing bounding boxes around text that can be displayed to the end user. In this way I don't care about OCR accuracy but the ability detect text accurately and across different mediums of type. Thoughts for a framework for this that's low latency under 150 ms or so?

jangia · Answer

You may set up your OCR service on AWS Lambda.I wrote a guide how to do it here:https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...Hope it helps

cl0rkster · Answer

just search for "tesseract GUI". if you are more technical, you can write code around tesseract. for what you get for free, it's really impressive what Google has done with this in just a few years to make it something that the average person can really consider using for free.ex. https://github.com/tesseract4java/tesseract4java

misiti3780 · Answer

I know you said you didnt want to upload stuff to third parties but Amazon Textract works great and supports HIPPA data

crocodiletears · Answer

Plenty of fantastic suggestions in the comments, any one of which looks like it could do the trick. Not having any experience in the problem domain, I'm afraid I don't have much to contribute in response, but I look forward to evaluating each framework/service.

lowdose · Answer

Why not upload it to Google Photos. It will do the OCR and make the text on your photos / screenshots searchable with a sweet UI in the browser.
If you still want to grab the text yourself you make a copy to Google Keep and use the "grab text" function.
Works for me, I take full screenshots of interesting stuff so the url is still visible when I want to go back to the original.
Obviously I have a paid G Suite account at Google. That comes with a very good set of privacy protecting rules. Doesn't matter how you roll your stack eventually you are going to be dependent on a 3th party. Better use one that offers full encryption and 2FA to lockup your data.
https://gsuite.google.com/learn-more/security/security-white...

OCR framework for extracting formatted text?

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.
Full disclosure: I'm the primary package developer. Shameless plug. :)
[0] https://github.com/faustomorales/keras-ocr

Try https://screenotate.com/
(no affiliation, just a user)

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.
https://www.dizzybits.com/Photoplex
It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

Site: https://openpaper.work/
Repo: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

https://github.com/axa-group/Parsr

I'm interested in drawing bounding boxes around text that can be displayed to the end user. In this way I don't care about OCR accuracy but the ability detect text accurately and across different mediums of type. Thoughts for a framework for this that's low latency under 150 ms or so?

You may set up your OCR service on AWS Lambda.
I wrote a guide how to do it here:
https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...
Hope it helps

I know you said you didnt want to upload stuff to third parties but Amazon Textract works great and supports HIPPA data

Plenty of fantastic suggestions in the comments, any one of which looks like it could do the trick. Not having any experience in the problem domain, I'm afraid I don't have much to contribute in response, but I look forward to evaluating each framework/service.

OCR framework for extracting formatted text?

Although https://www.willus.com/k2pdfopt/ is meant for reformatting PDFs to view on e-readers, it does do a reasonable job of extracting text via OCR and storing as a PDF layer. The underlying engine can be either https://github.com/tesseract-ocr/tesseract or http://jocr.sourceforge.net/

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.Full disclosure: I'm the primary package developer. Shameless plug. :)[0] https://github.com/faustomorales/keras-ocr

Try https://screenotate.com/(no affiliation, just a user)

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.https://www.dizzybits.com/PhotoplexIt does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

Site: https://openpaper.work/Repo: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

https://github.com/axa-group/Parsr

I'm interested in drawing bounding boxes around text that can be displayed to the end user. In this way I don't care about OCR accuracy but the ability detect text accurately and across different mediums of type. Thoughts for a framework for this that's low latency under 150 ms or so?

You may set up your OCR service on AWS Lambda.I wrote a guide how to do it here:https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...Hope it helps

I know you said you didnt want to upload stuff to third parties but Amazon Textract works great and supports HIPPA data

Plenty of fantastic suggestions in the comments, any one of which looks like it could do the trick. Not having any experience in the problem domain, I'm afraid I don't have much to contribute in response, but I look forward to evaluating each framework/service.

If you are (a) willing to take the word bounding boxes and convert them to paragraphs yourself, and (b) okay with a deep learning approach, you may want to give keras-ocr [0] a try.
Full disclosure: I'm the primary package developer. Shameless plug. :)
[0] https://github.com/faustomorales/keras-ocr

Try https://screenotate.com/
(no affiliation, just a user)

I hate to post my own app but it does do part of what you ask and it does it locally. Nothing is sent to any server.
https://www.dizzybits.com/Photoplex
It does on-device text recognition on your photos, stores on local SQLite database and lets you full text search.

Site: https://openpaper.work/
Repo: https://gitlab.gnome.org/World/OpenPaperwork/paperwork

You may set up your OCR service on AWS Lambda.
I wrote a guide how to do it here:
https://typless.com/2020/05/21/tesseract-on-aws-lambda-ocr-a...
Hope it helps