https://developer.apple.com/documentation/vision/recognizing...
Does it have to be an "AI" model in the modern usage of it (LLMs, etc.?)
In the past, I found Google's Cloud Vision API to be pretty good for this sort of thing (images in text): https://cloud.google.com/vision?hl=en#demo
AFAIK Tesseract was never state of the art, it was just free and cheap. The commercial offerings (in my limited experience) were usually much more accurate.