[Ohiodig] Transcription advice
Joe Corall
joe at libops.io
Thu Feb 27 16:58:21 EST 2025
Hey Ginnie,
At Lehigh University back in November/December we evaluated something like
three HTR solutions and seven different ollama LLM models, and found OpenAI
ChatGPT is the best model to transcribe handwritten text documents.
We made an Islandora microservice
<https://github.com/lehigh-university-libraries/scyllaridae/tree/main/examples/openai-htr>
and got our first documents in our Islandora repository successfully OCR'd
and added to our search index.
For example for this image:
https://preserve.lehigh.edu/sites/default/files/2024-01/328551.jpg
With this prompt:
https://github.com/lehigh-university-libraries/scyllaridae/blob/main/examples/openai-htr/Dockerfile#L20
ChatGPT returned this OCR:
Dear Sir
I have been expecting to hear from you every day on the subject of the
> Little Poems I am about to publish. No more time must it be delayed. This
> pamphlet will not interfere with any negotiations between us, as it is
> quite a separate thing and is printed for Chandos in purpose—Mr. Wesley
> who has endorsed the letter which is to be printed with it, called on me
> the other day, and he promised to see you as soon as he returned to Town. I
> believe he is there before now. You will therefore be so good as
We have plans to create some tooling around this to support more models,
provide a GUI, and eventually hope to be able to generate hOCR for
handwritten manuscripts.
Joe
On Thu, Feb 27, 2025 at 4:35 PM DRESSLER, Virginia via Ohiodig <
ohiodig at lists.library.ohio.gov> wrote:
> Hi OhioDIG-
>
> I'm working with a faculty member who is looking for advice on turning
> some printed data collection tables with handwritten content into a tabular
> format (without having to manually transfer or retype it).
>
> I tried a few tests using ABBY FineReader to convert an image and PDF
> version of one sample page, but the results were pretty awful.
>
> Anyone work on anything like this and/or have an idea? One of our
> librarians suggested trying ChatGPT, though I'm not sure if this data would
> be a good candidate or not, and another suggested Transkribus-
> https://www.transkribus.org
>
> Thanks in advance!
>
> Ginnie
>
> _______________________________________________
> Ohiodig mailing list
> Ohiodig at lists.library.ohio.gov
> https://lists.library.ohio.gov/mailman/listinfo/ohiodig
> To contact the list owner send an email to
> Ohiodig-owner at lists.library.ohio.gov
> To unsubscribe send an email to Ohiodig-unsubscribe at lists.library.ohio.gov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.library.ohio.gov/pipermail/ohiodig/attachments/20250227/b1f37f41/attachment.htm>
More information about the Ohiodig
mailing list