Talk:422: Difference between revisions

Latest revision as of 01:30, 14 January 2024

The PDFs that were marked "bad OCR" were done with ABBYY FineReader and have major problems. FineReader overzealously "corrected" what it thinks is skew. See page 54 of 070-0434-02.pdf for an example. In general FineReader mangles the images rather than just adding an invisible text layer. I don't know whether the mangling can be turned off. I stopped using it. Also, see page 4 of 070-0895-00.pdf. Again, it incorrectly "corrected" the skew. Also see the figures on page 17 and 18 of 070-0895-00.pdf. Also, see page 213 of 070-0895-00.pdf for an even more bizarre example. Tools that seem to work correctly are Adobe's Acrobat and the open source tools based on Tesseract. I've been using OCRMyPDF and it works fine most of the time. And if it encounters and error, it prints error messages rather than silently mangling your document, like ABBYY FineReader. It isn't really practical to babysit OCR software to make sure it didn't mangle the page images. It needs to be reliable. Kurt (talk) 11:26, 10 January 2024 (PST)

Thanks, Kurt - now I understand :-) I'll re-OCR them and see what I get... Qfissler (talk) 06:53, 11 January 2024 (PST)

My major problem with the last version of Acrobat I used for OCR (admittedly some time ago) was that I couldn't persuade it to use a character encoding that would let me add μ, Ω etc., let alone properly recognize them. For documents where I think it's worth it I still use FineReader 14, configured to expect these characters, in the OCR Editor mode where you can change area types, see what it considers bad matches, and fix these. Yes it's a lot of work but for some manuals I find it worthwhile. The only thing I haven't figured out is how to stop FR14 from turning every "dc" into "de" while marking it a bad match - apparently some dictionary gets in the way; "DC" doesn't get hit, only the old style "dc". --Peter (talk) 23:42, 11 January 2024 (PST)

@@ Line 8: / Line 8: @@
 [[User:Kurt|Kurt]] ([[User talk:Kurt|talk]]) 11:26, 10 January 2024 (PST)
 :Thanks, Kurt - now I understand :-) I'll re-OCR them and see what I get... [[User:Qfissler|Qfissler]] ([[User talk:Qfissler|talk]]) 06:53, 11 January 2024 (PST)
+::My major problem with the last version of Acrobat I used for OCR (admittedly some time ago) was that I couldn't persuade it to use a character encoding that would let me add μ, Ω etc., let alone properly recognize them. For documents where I think it's worth it I still use FineReader 14, configured to expect these characters, in the OCR Editor mode where you can change area types, see what it considers bad matches, and fix these. Yes it's a lot of work but for some manuals I find it worthwhile. The only thing I haven't figured out is how to stop FR14 from turning every "dc" into "de" while marking it a bad match - apparently some dictionary gets in the way; "DC" doesn't get hit, only the old style "dc". --[[User:Peter|Peter]] ([[User talk:Peter|talk]]) 23:42, 11 January 2024 (PST)

Talk:422: Difference between revisions

Latest revision as of 01:30, 14 January 2024

Navigation menu