0 purchases
adf2pdf 0.8.3
adf2pdf - a tool that turns a batch of paper pages into a PDF
with a text layer. By default, it detects empty pages (as they
may easily occur during duplex scanning) and excludes them from
the OCR and the resulting PDF.
For that, it uses Sane's scanimage for the scanning,
Tesseract for the optical character recognition (OCR), and
the Python packages img2pdf, Pillow (PIL) and
PyPDF2 for some image-processing tasks and PDF mangling.
Example:
$ adf2pdf contract-xyz.pdf
2017, Georg Sauthoff [email protected]
Features
Automatic document feed (ADF) support
Fast empty page detection
Overlaying of scanning, image processing, OCR and PDF creation
to minimize the total runtime
Fast creation of small PDFs using the fine img2pdf package
Only use of safe compression methods, i.e. no error-prone
symbol segmentation style compression like JBIG2 or JB2
that is used in Xerox photocopiers and the DjVu format.
Install Instructions
Adf2pdf can be directly installed with pip, e.g.
$ pip3 install --user adf2pdf
or
$ pip3 install adf2pdf
See also the PyPI adf2pdf project page.
Alternatively, the Python file adf2pdf.py can be directly
executed in a cloned repository, e.g.:
$ ./adf2pdf.py report.pdf
In addition to that, one can install the development version from
a cloned work-tree like this:
$ pip3 install --user .
Hardware Requirements
A scanner with automatic document feed (ADF) that is supported by
Sane. For example, the Fujitsu ScanSnap S1500 works
well. That model supports duplex scanning, which is quite
convenient.
Example continued
Running adf2pdf for a 7 page example document takes 150 seconds
on an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the
Fujitsu ScanSnap S1500). With the defaults, adf2pdf calls
scanimage for duplex scanning into 600 dpi lineart (black and
white) images. In this example, 6 pages are empty and thus
automatically excluded, i.e. the resulting PDF then just contains
8 pages.
The resulting PDF contains a text layer from the OCR such that
one can search and copy'n'paste some text. It is 1.1 MiB big,
i.e. a page is stored in 132 KiB, on average.
Software Requirements
The script assumes Tesseract version 4, by default. Version 3 can
be used as well, but the new neural network system in Tesseract
4 just performs magnitudes better than the old OCR model.
Tesseract 4.0.0 was released in late 2018, thus, distributions
released in that time frame may still just include version 3 in
their repositories (e.g. Fedora 29 while Fedora 30 features version
4). Since version 4 is so much better at OCR I can't recommend it
enough over the stable version 3.
Tesseract 4 notes (in case you need to build it from the sources):
Build instructions - warning: if you miss the
autoconf-archive dependency you'll get weird autoconf error
messages
Data files - you need the training data for your
languages of choice and the OSD data
Python packages:
img2pdf (Fedora package: python3-img2pdf)
Pillow (PIL) (Fedora package: python3-pillow-devel)
PyPDF2 (Fedora package: python3-PyPDF2)
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.