[Ohiodig] Fwd: "Jellification" of Text

Fri Jul 15 11:56:28 EDT 2022

Dear All,

I haven't forgotten about the text "jellification" issue I brought up the
other day. I followed Chatham's suggestion and reached out to IDEALS.
IDEALS informed me that they simply host the content and play no role in
digitization. However, they directed me to Digitization Services Unit
<http://www.library.illinois.edu/preservation/digitization-services>, who
provided me with an answer. I have reproduced the reply below:

It might be the compression selected in whatever program was used to create
> the PDF/OCR. There is a compression setting in Abbyy FineReader and Adobe
> Acrobat. If the original images were saved at a high DPI and the PDF
> downsized the images significantly you might get this result. Weâ€™ve had
> this happen. This could also happen if you are using JP2000 files to create
> PDFS.
>

Another idea - Abbyy doesnâ€™t always handle graphs or tables well. There are
> also issues with how it handles saving the text and OCR within the image in
> the PDF. The scanned image and OCR end up appearing blurry.
>

Here are a few suggestions I have:
>
>    - You can customize the recognition mode in the  â€œAnalyze and
>    recognize imageâ€ steps in Abbyy. Go to PDF recognition mode. It defaults to
>    Auto but you want to choose only OCR from the PDF.
>    - Go to the Recognition option in Abbyy. Unselect detect header and
>    footers or other structural elements of the document.
>    - Go to the Preprocessing settings in Abbyy and deselect Correct Image
>    Resolution and Reduce ISO Noise.
>    - In the Save Results section of Abbyy select Best Quality next to the
>    Keep Pictures check box.
>
>
> Without working with the files used to generate the PDFs itâ€™s hard to say
> for certain but those are some thoughts I have. I would guess Abbyy was
> used to generate this PDF and OCR.
>

What was strange to me about this whole problem was that I couldn't figure
out how â€“ if my assumption that it was related to OCR is correct â€“ the OCR
would be affecting the actual image. I presume OCR is saved as an "overlay"
of sorts that didn't affect the quality of the original image.

On a related note, does anyone know whether PDF saves images to another,
internal file format? Given that it can handle both vector and raster
graphics as well as text I presume, PDF isn't *technically* an image file
format itself, but just acts as a container to hold all the disparate
formats together with some editing features on top. The reason I ask is
because when you take a large PDF file, open it in both Adobe Acrobat and
an image editing software like GIMP <http://www.gimp.org> the latter shows
a grayscale image, while the former shows only a black and white one.
(Nevermind
this point, it is apparently a result of the resolution, as opening the
file in the latter with a higher resolution (200 vs 100 pixels per inch)
produces a result like the former.) However, I could be completely wrong.

Sincerely,
Noah Stegman Rechtin
*Tri-State Warbird Museum <http://tri-statewarbirdmuseum.org/>*
*Collections Manager & Museum Attendant*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.library.ohio.gov/pipermail/ohiodig/attachments/20220715/35b69724/attachment.htm>