Scan multi-page documents directly to pdf quickly.

PowerFactor · Last edited by PowerFactor on Sun Feb 11, 2007 4:42 pm; edited 1 time in total

I don't know if anyone else has been as frustrated by the lack of easy to use software focused on document scanning on linux. I'm not talking about ocr, just scanning documents into a portable multi-page image format. In fact the only software that I've found that does exactly what I wanted was Adobe Acrobat on windows. But with that I had to go through the windows twain driver interface for my scanner which seems designed to make the process as slow and clumsy as possible. Still, that was the method I used for the last couple years on occasions when it was useful.
Finally, after I bought a new printer/scanner a couple month ago, (epson cx3200, its nice) I decided it was time figure out how to do the job on linux. By then I knew all the command-line tools to do what I wanted were available, it was "just" a matter of writing a little script to tie it all together. Being the amature I am it took me a couple days to figure it all out, but I got it working. This was back in January.
The other day I was playing around with controlling it with kdialog and I thought maybe someone else would find it usefull (the non kde dependant version that is) So I figured why not post it. I did try to make it a little more user friendly. It's still an ugly hack, but it works for me. Not only that, but once it's setup it's better at its specific purpose than anything else I've tried.

The script depends on the following packages.

sane-frontends
imagemagick
netpbm
ghostscript

If your scanner has a decent 1-bit(Lineart) mode (or if you can actually get convert's threshold function to work for you) then you can modify the script slightly and get rid of the netpbm dependency.

You need to know how to use the scanimage program with your scanner, as you will need to modify the SCANDEVICE and SCANCMD variables to fit. The rest of the configuration is pretty self explanatory I think.

To use it you just put you first page in the scanner then run the script with the name of the file to save as the argument. It will then immediately scan the first page then prompt you for more. The rest is gravy.

It's not very robust, if your scanner has a warm-up period then make sure it's finished before you start. Otherwise scanimage may timeout and the script gets a little confused then. And it's not designed to work with scanners that have an adf.

Anyway, hope someone can use it. Even if just for inspiration. :lol:

EDIT: Later versions posted further down the thread. Chrwei posted one that should work with an ADF(I don't have the hardware to try it) and I've posted the python version I've been using for a while. This version is left here mainly for reference.

FatherBusa · Apprentice Joined: 21 Mar 2004 Posts: 166

Dude, you're a genius. This is just what I was looking for. Thanks!

chrwei · n00b Joined: 16 Feb 2005 Posts: 2

very nice, here's my enhancements :)

summary:
- Added command line options with defaults
- Added ADF support with command line toggle to to use flatbed. can be set to use flatbed by default with command line toggle to use ADF.
- Changed to use scanimage's batch mode and prompt so that timeouts shouldn't be an issue. ADF doesn't use the prompt
- Made the scanner device name optional as scanimage will normaly detect your scanner automaticaly.

scanners tried:
- HP Officejet 6110

TODO:
- add more paper size options
- NetPBM says pgmtopbm is depreciated as of 7/2004 and to use pamditherbw instead. I plan on only doing color or full greyscale documents so I'm not touching this.

bugs:
- "mode" seems to be scanner specific, some want "Grey" others want "Greyscale". - needs testing
- might be an isue with providing -x and -y when using ADF, I need to test more

things-i-wish-worked-better:
- too many temp files!

and the code:

r.abbott · Posted: Sun Feb 20, 2005 2:34 am Post subject:

This thing is great! Thanks.

gcediel · n00b Joined: 27 Jul 2004 Posts: 21 Location: Madrid, Spain

One (maybe silly) question: How can I make scanimage stop scanning more pages? I have tried several keys, but I can't stop it.
_________________
Best regards.

Guillermo

r.abbott · Posted: Fri Apr 22, 2005 8:17 pm Post subject:

Use <Ctrl-D>

chrwei · n00b Joined: 16 Feb 2005 Posts: 2

I haven't used it in a while, but I think it tells you that on screen, at least it did on mine. You should run it in a terminal and not just from a "run" dialog.

gcediel · n00b Joined: 27 Jul 2004 Posts: 21 Location: Madrid, Spain

Well, CTRL+D doesn't work for me.

BTW: very nice stuff!
_________________
Best regards.

Guillermo

djmaze · Posted: Sun Apr 24, 2005 11:09 am Post subject:

CTRL+C works for me. (Try it two times, if it doesn't work.)

gcediel · n00b Joined: 27 Jul 2004 Posts: 21 Location: Madrid, Spain

Thanks, it works, although not a clean way.
_________________
Best regards.

Guillermo

zatalian · Posted: Mon Sep 04, 2006 2:39 pm Post subject:

this script used to work for me but now convert gives me trouble...

convert -page letter converts the original image to a blank postscript file. Converting without the -page option works but then the pdf document is not in the correct format. Is this happening to anybody else? Any sollutions?

bludger · Guru Joined: 09 Apr 2003 Posts: 389

My HP 3500c doesn't have the mode function at all. This means that it can only output colour images. How would you convert something like this to black and white?

Also I had a number of tiff files and managed to convert them into a multi page pdf with:
convert <tif1> <tif2> <tif3> file.pdf

Why not just convert like this, leaving out the intermediate ps stage?

bludger · Guru Joined: 09 Apr 2003 Posts: 389

I solved my problem with the following:

scanimage -d <device> --resolution 150|ppmtopgm|pamthreshold -simple >tempscanfile1.pbm
convert -compress fax tempscanfile*.pbm outfile.pdf

This produced a 33kB file with resolution 150 and a 60kB file with resolution 300. The 150 res version was readable, but a bit ugly and the 300 version was excellent.

bludger · Guru Joined: 09 Apr 2003 Posts: 389

I have been using the above method successfully and conveniently for the last few months now. One problem that I have found is that when I try to convert multiple pbm files into one multi page pdf, I can quickly run out of memory if I get above 6 pages or so. Does anyone have any suggestions as to how to get around this?

martoss · n00b Joined: 09 Dec 2003 Posts: 25

Isn't xsane doing the same?

My xsane version has an option to scan "pages" to a pdf. Works pretty well AFAIR. I don't see a big difference.
Xsane has also other nice features like just "copying stuff" and "emailing stuff". Anyways, your script sounds also nice :-)

PowerFactor · Posted: Sun Feb 11, 2007 3:55 pm Post subject:

It seems I have been lax in keeping up with this. Better late than never I guess.

PowerFactor · Posted: Sun Feb 11, 2007 4:36 pm Post subject:

I've also made some changes since that original version. I converted it to python and added some ncurses "eyecandy" using dialog. Also got rid of the netpbm dependency. I had intended to rewrite it as a "proper" modular program with a seperate config file and such but never got very far with it. It's not something I use very often anyway.

Anyhow, here's my latest working version. Plenty of bugs I'm sure but it mostly works when I need it.

Dependencies have changes a little:

dev-lang/python
media-gfx/sane-frontends
media-gfx/imagemagick
virtual/ghostscript
dev-util/dialog

bludger · Guru Joined: 09 Apr 2003 Posts: 389

PowerFactor · Posted: Fri Feb 16, 2007 3:41 am Post subject:

As I understand it the Map limit option limits the amount of filespace that can be mmaped for pixel cache.

http://en.wikipedia.org/wiki/Memory-mapped_file

I think theres probably no need to use the Map limit on most systems. I think I just put it in mine because I had no clue how mmaping worked back then. It doesn't seem to make any performace difference when I remove it.

bludger · Guru Joined: 09 Apr 2003 Posts: 389

To get the scan device, I had been performing the following:
SCANDEVICE=$(scanimage -L|grep hp3500|awk -F '`' '{print $2}'|awk -F \' '{print $1}')
(my device is an hp3500)

This would read the correct usb port. From your script, I see that it might be possible to use just "hp3500:". I'll give that a try.

csim · n00b Joined: 13 Feb 2006 Posts: 23

Hi,

i have a small suggestion:
i think it would be cool to have the basic parameters accessible via some kind of menu for example:

scanimage -L lists all available devices, it would be cool to select them via dropdown menu...

redwood · Guru Joined: 27 Jan 2006 Posts: 306

I was googling for Zacchaeus Pearsall's original version of this script, when I found this page.
I too used his script as a starting point when writing a shell script for batch document scanning using scanadf.

My version "bscan" is available at http://www.acjlaw.net:8080/~jeremy/Ricoh/usage_bscan.html

It uses a configuration file, ~/.bscanrc
where one can list all your scanners in a bash array,
with devices names as shown by "scanimage -L"
and the default scanner being SCANDEVICE="${scanners[0]}"

Importantly, specifying the scanner names in ~/.bscanrc saves time
since the script then skips finding the scanners using "scanimage -L"

One can also specify which scanners are true duplex,
so the script will scan fake duplex mode when true duplex is not available.
One can also specify lp printer instances so one can scan direclty to printer;
e.g. if you scan a document in duplex mode on letter-sized paper,
it will be printed in duplex from the appropriate tray holding letter-sized paper.

By default the script scans from the ADF in grayscale @300dpi and saves to format PDF.
So to scan a letter-sized document from the ADF @300dpi grayscale,
then compress using lzw, binarize using djvu and save to OUTFILE.pdf
one would use:

bscan --mode=8-bit --shades=2 --page=Letter --comp=lzw -BW OUTFILE

or for legal-sized paper
bscan --mode=8-bit --shades=2 --page=Legal --comp=lzw -BW OUTFILE

or letter-sized paper from the FlatBed:
bscan --mode=8-bit --shades=2 --page=Letter --comp=lzw --source=FB -BW OUTFILE

To simplify things, I usually define some aliases for black/white, grayscale and color scanning:

alias b='bscan --mode=1-bit --page=Letter' --comp='lzw'
alias bl='bscan --mode=1-bit --page=Legal --comp=lzw'

alias B='bscan --mode=8-bit --shades=2 --page=Letter --comp=lzw'
alias BL='bscan --mode=8-bit --shades=2 --page=Legal --comp=lzw'

alias C='bscan --mode=color --shades=32 --page=Letter --comp=lzw'
alias CL='bscan --mode=color --shades=32 --page=Legal --comp=lzw'

alias truecolor='bscan --mode=color --shades=truecolor --page=Letter --comp=lzw'

Then to scan in b/w from the ADF @300dpi grayscale a letter-sized document:
b OUTFILE

for legal-sized:
bl OUTFILE

To scan in grayscale and binarize using djvu wavelet compression:
For letter:
B -BW OUTFILE

For Legal:
BL -BW OUTFILE

For letter using pnmtools' truecolor shades:
truecolor -c44 --djvutopdf=25 OUTFILE

For letter using duplexing and djvu binarization:
B -duplex -BW OUTFILE

Or to rotate the document 180 degrees:
B --rot=r180 OUTFILE

To save to another format, use --format={pnm,tif,pdf,ps,djv} or alternatively,
-pnm <equivalent to --format=pnm>
-tif <equivalent to --format=tif>,
and similarly for the other output options:
-pdf, -ps, -djv

Shortcut options, like the above switches take a single '-'
and arguments requiring a value have the form '--option=value'

One can specify various binarization algorithms,
such as those from Fred Weinhaus http://www.fmwconcepts.com/imagemagick/index.html
using the option --thresh={bw, constant, 2color, fuzzy, isodata, kmeans, sahoo, triangle, }
where the various binarization scripts must be in your $PATH.

If you use xsane or gscan2pdf to scan some images because, e.g. you need to crop the image
or tweak the contrast/brightness/gamma settings,
you can save the images as OUTFILE.%d.pnm
e.g. OUTFILE.0001.pnm, OUTFILE.0002.pnm, ...
Then use can use bscan with the option "-noscan" to skip the scanning,
and instead just process the images:
e.g., to rotate the images 180degrees and binarize using djvu compression:
B -noscan -BW --rot=180 OUTFILE
which would process the series of images and create one multipage OUTFILE.pdf

One can also deskew images using unpaper from http://unpaper.berlios.de/
The options to "unpaper" are hardwired into bscan because the options are just too numerous
to specify on the commandline.
so it might be best to just make alocal copy of bscan,
and modify the line which runs unpaper using whatever unpaper options you need.
Alternatively, you could add an option for unpaper settings
so that you could scan, e.g. B --unpaper=setting1 -BW OUTFILE
where setting1 would be specified in ~/.bscanrc or hardwired into bscan.

To photocopy, i.e. scan the print to printer:
For letter printed to PRINTERLETTER
B -prn --n=<number of copies>

For legal printed to PRINTERLEGAL
BL -prn --n=<#copies>

Or for duplex letter to PRINTERLTRDUP
B -duplex -prn --n=<#copies>
And legal duplex to PRINTERLGLDUP
BL -duplex -prn --n=<#copies>

You just need to define the lp printer instances in /etc/cups/lpoptions or ~/.cups/lpoptions
However, I find that KDE keeps modifying/deleting any printer instances in ~/.cups/lpoptions
so I given up and just use /etc/cups/lpoptions, which KDE leaves untouched.

You can define lp printer instances using lpoptions,
but I find it easier to just directly edit /etc/cups/lpoptions
e.g. for my Xerox Phaser8860 print queue

I can define a letter,color,simplex queue:
Dest Phaser8860/letter Duplex=None fitplot=false InputSlot=Tray2 media=letter MediaType=Auto OutputMode=Enhanced PageRegion=letter PageSize=letter

And in my ~/.bscanrc, I add the name of the printer destination:
PRINTERLETTER="Phaser8860/letter"

And similarly for a color duplex-letter queue:
Dest Phaser8860/ltrdup Duplex=DuplexNoTumble fitplot=false InputSlot=Tray2 media=letter MediaType=Auto OutputMode=Enhanced PageRegion=letter PageSize=letter sides=two-sided-long-edge

with the destination
PRINTERLTRDUP="Phaser8860/ltrdup"
in my ~/.bscanrc

"bscan" will choose the appropriate letter/legal simplex/duplex printer destinations depending on whether the scan was letter/legal, simplex/duplex.

undrwater · Guru Joined: 28 Jan 2003 Posts: 312 Location: Caucasia