This bash script generates bookmarks automatically as shown on the screenshot on the right. On the left PDFSam Basic let's you choose whether to retain existing bookmarks or not. Here the Bible already had bookmarks so it retains them.
Quite impressive script to quickly get bookmarks. I'll spend a little time trying to figure out how to retain bookmarks but most likely that's over my head.
combine multiple PDFs into a single PDF
Create one bookmark (filename) for each PDF in directory.
Won't retain existing bookmarks like PDFSam
author Mateen Ulhaq
Quite impressive script to quickly get bookmarks. I'll spend a little time trying to figure out how to retain bookmarks but most likely that's over my head.
combine multiple PDFs into a single PDF
Create one bookmark (filename) for each PDF in directory.
Won't retain existing bookmarks like PDFSam
author Mateen Ulhaq
#!/bin/bash
out_file="combined.pdf"
tmp_dir="/tmp/pdftk_unite"
bookmarks_file="$tmp_dir/bookmarks.txt"
bookmarks_fmt="BookmarkBegin
BookmarkTitle: %s
BookmarkLevel: 1
BookmarkPageNumber: 1
"
rm -rf "$tmp_dir"
mkdir -p "$tmp_dir"
for f in *.pdf; do
echo "Bookmarking $f..."
title="${f%.*}"
printf "$bookmarks_fmt" "$title" > "$bookmarks_file"
pdftk "$f" update_info "$bookmarks_file" output "$tmp_dir/$f"
done
pdftk "$tmp_dir"/*.pdf cat output "$out_file"This is pretty cool.... Adobe renamed ClearScan in 2015 to Edit Text in Image. It basically creates a anti-aliases around the text and vectorizes the text. Many hours spent try to get this work and will spare you all the details of tried this and that but this is what did actually work:
Download about 700MB and installs 1.8GB of
right click and Permissions | check to Execute allow
Download about 700MB and installs 1.8GB of
texlive. This is a LaTex package as recommended by pdfsak. You may skip the fonts and extra and see if it works but get base and potrace are for sure required.sudo apt install texlive-latex-base texlive-fonts-recommended texlive-fonts-extra texlive-latex-extra
potrace
download appimage of magick (ImageMagick) right click and Permissions | check to Execute allow
sudo cp magick /usr/bin/Now install pdfsak in python
pip3 install —upgrade pdfsak
Acrobatusers
Better PDF OCR: ClearScan is smaller, looks better
In this tutorial, learn about the advantages and disadvantages of ClearScan over Searchable Image OCR in Acrobat 9.
pdfsak -if input.pdf -o clearscan.pdf --clearscanoriginal on top (blurry)
screenshots:
clearscan on bottom
A few years ago I used Adobe Acrobat DC Pro back when I was on a mac and remember it could replace the image with text by creating a custom font on-the-fly and create a much much smaller file size. Still need to manually go through and check the accuracy of the words so it's still somewhat time consuming. This doesn't do that but still it really helps clean up old PDFs.
Screenshot is showing zoomed in after cleanscan. Notice many tiny dots.Best to despeckle it (remove the tiny dots) in ScanTailor Advanced (unpaper is over my head). So some PDFs can do clearscan first then feed to ScanTailor but old ones with crappy background and double pages, etc. will have to ScanTailor it. img2pdf it to a pdf then clearscan them back to ScanTailor to despeckle then to ocrmypdf.
ScanTailorOnly.pdf
109.5 KB
ScanTailorOnly.pdf is 109KiBClearScan-ScanTailor.pdf is 272KiBneed to do these on an entire book to see if it's double or triple in size. If so guess it may not be worth it.
edit: ClearScan not CleanScan
edit2: seems like I won't use ClearScan only when the image is rather good to start with. Meaning the text in the image isn't blurred all that much.
NORMCAP and TextSnatcher both OCR images. Looks like NORMCAP does a tad better especially with line breaks. Both are good to have at your disposal though.
PDF with non-uniform images. Resize larger images smaller ones maintaining aspect ratio to solve this problem.
Smaller ones are: just a few examples
1501 x 1152
1491 x 1155
1547 x 1159
1514 x 1159
Larger ones are: just a few examples
4649 x 3610
4641 x 3647
4677 x 3597
4849 x 3819
Problem with this is in ScanTailor Advanced the smaller images remain small compared to the larger resolution images after setting margins. This pdf is landscape with two pages scanned per page.
A4 = 210mm x 297mm
A4 + A4 = A2 = 420mm x 594mm
I thought just using this
Smaller ones are: just a few examples
1501 x 1152
1491 x 1155
1547 x 1159
1514 x 1159
Larger ones are: just a few examples
4649 x 3610
4641 x 3647
4677 x 3597
4849 x 3819
Problem with this is in ScanTailor Advanced the smaller images remain small compared to the larger resolution images after setting margins. This pdf is landscape with two pages scanned per page.
A4 = 210mm x 297mm
A4 + A4 = A2 = 420mm x 594mm
I thought just using this
scale-to option would solve the trick but it didn't.pdftoppm -r 300 -scale-to 842 -tiff -tiffcompression deflate nonuniformimages.pdf dump/imgso instead just did this and didn't specify dpi
pdftoppm -tiff -tiffcompression deflate nonuniformimages.pdf dump/img
sort by size to select the larger resolution images and move them to a new directory.
mogrify will modify the existing images so make a quick copy of them just in case. ImageMagick doesn't have deflate compression so have to use LZW.
1510 x 1172
1519 x 1172
1516 x 1172
copy these back into your
mogrify will modify the existing images so make a quick copy of them just in case. ImageMagick doesn't have deflate compression so have to use LZW.
-resize 555 resizes width and -resize x555 does height.mogrify -resize x1172 -compress LZW dump3/*.tifThey maintained their aspect ratio and the new resolution on a few of them are:
1510 x 1172
1519 x 1172
1516 x 1172
copy these back into your
/dump directory and ScanTailor will process just fine now.You can try to align chapter numbers by adding blank pages or deleting some unwanted pages so the table of content numbers match the actual PDF page number.
Or you can use vim and a plugin called vim-NumUtils.
Great vim guide
Or you can use vim and a plugin called vim-NumUtils.
sudo apt install vimdownload
vim-NumUtils and after unzipping put the doc/ and plugin/ into the ~/.vim directory. It can do NumUtilsAdd, NumUtilsSub, NumUtilsMul, NumUtilsDiv
Here all the actual PDF page numbers are 3 pages behind the listed page numbers. :% NumUtilsSub 3, '\,'sometimes I type a space after the , so to get those too do
:% NumUtilsSub 3, '\, '
Quick vim commands:wq - write/save file and quit:q - quitu - undoCtrl-R - redoi - insert mode ESC to quit insert modea - append text% is a range for entire documentGreat vim guide
{
foreword by the author,6
{
poem:,
first, my country, 15
this is my own, my dear, and native land,16
god made me free, 21
to the stars and stripes,33
Renumber chapter page numbers in text file with math operations. How to do it with sed and awk
sometimes I'll type a space after the , so let's substitute space after a comma followed by a number then the 2nd substitution separated with a ; does same but has no spacesed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt
-i insert in place the /s (substitutions) looking for , followed by any number of digits. Some chapter titles have , in them and it'll ignore them-r extended regex so don't need to escape capturing groups\2 2nd capture group put a , then single digit to remove space\1 is the first capture group ([0-9]) which is just a single digit and re-inserting it after the %without touching other commas in chapter names
{
foreword by the author%6
{
poem:,
first, my country%15
this is my own, my dear, and native land%16
god made me free%21
to the stars and stripes%33
Now subtract 3 from the chapter page numbersawk -F% '{if (/%/) {print $1 "," $2-3} else {print $0}}' > renumbered.txt
-F% field separator , would work but if the chapters contain , which they do sometimes it screws up fields so needs to be something else like %
if (/%/) there's a match for % on the line thenprint $1 print field 1 which is the text string up till the %
"," will print a comma between $1 and $2 so it's ready for titlecase
print $2-3 print the number after the % and subtract 3 can do + addition * multiplication / division alsoelse print $0 prints entire line if no matchchanges to:
{
foreword by the author,3
{
poem:,
first, my country,12
this is my own, my dear, and native land,13
god made me free,18
to the stars and stripes,30
now it's ready for title casetitlecase -f renumbered.txt -o chapters.txtchanges to:
{
Foreword by the Author,3
{
Poem:,
First, My Country,12
This Is My Own, My Dear, and Native Land,13
God Made Me Free,18
To the Stars and Stripes,30
combine all three commands on one linesed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt && awk -F% '{if (/%/) {print $1 "," $2-3} else {print $0}}' toc.txt > renumbered.txt && titlecase -f renumbered.txt -o chapters.txt
then booky to add chapters to pdfbooky.sh SomeBook.pdf chapters.txt
If you want to offset the pages numbers by +5 just change $2-3 to $2+5 sed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt && awk -F% '{if (/%/) {print $1 "," $2+5} else {print $0}}' toc.txt > renumbered.txt && titlecase -f renumbered.txt -o chapters.txt
changes to:{
Foreword by the Author,11
{
Poem:,
First, My Country,20
This Is My Own, My Dear, and Native Land,21
God Made Me Free,26
To the Stars and Stripes,38