Now llama.cpp has vision support; I tried out PDFs with it locally (via LM Studio) but the results weren't as good as I hoped for. One time it insisted it couldn't do "OCR", but gave me an example of what the data _could_ look like - which was the data.
The other major problem is sometimes PDFs are actually made up of images; and it got super confused on those as well.
Given this is so new I'm struggling to find any tools which make this easier.
!pip install pytesseract pdf2image pillow
!apt install poppler-utils
#!apt install tesseract-ocr
from pdf2image import convert_from_path
import pytesseract
pages = convert_from_path('k.pdf', dpi=300)
all_text = ""
for page_num, img in enumerate(pages, start=1):
text = pytesseract.image_to_string(img)
all_text += f"\n--- Page {page_num} ---\n{text}"
print(all_text)