Combine all images into a PDF with
img2pdf and pipe it to ocrmypdf (notice the - by itself). A4 sets all pages to that size with the downside being some white margins on cover but all pages are uniform. A4 is close to in size to Letter.img2pdf --pagesize A4 out/*.tif | ocrmypdf --optimize 3 --jbig2-lossy - ../output_ocr.pdfbooky.sh is a script which let's you edit a text file to generate bookmarks / outline Install booky by unzipping code file then putting it where it's safe to keep. Then add your path to it so it can be run from anywhere the PDF is.
echo $PATH/home/geektips/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/sbin:/usr/sbin:/sbin/
so export it to your path
nano ~/.bashrc and at end of file appendexport= PATH=$PATH:/home/geektips/appimages/apps/booky/save it then refresh bash without logging out with
source ~/.bashrc/home/geektips/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/sbin:/usr/sbin:/sbin:/home/geektips/appimages/apps/booky/
echo $PATH
For
You can have link breaks between chapters doesn't matter. Just put a comma , after the chapter name then the page #. Only one chapter per line.
Easiest way to find chapters if the PDF lists the page # is to search for said page # looking for a matching chapter title on the left pane. Still takes about 5 to 10 mins depending on how many chapters.
Note put a comma after
booky make a text file save it as index.txt (or whatever). Copy from PDF or use NORMCAP or TextSnatcher to screenshot OCR the chapters and paste and edit them in a text editor.You can have link breaks between chapters doesn't matter. Just put a comma , after the chapter name then the page #. Only one chapter per line.
Easiest way to find chapters if the PDF lists the page # is to search for said page # looking for a matching chapter title on the left pane. Still takes about 5 to 10 mins depending on how many chapters.
booky booky.sh Conquest\ of\ a\ Continent.pdf index.txt
and it creates a new PDF with your index with chapters and appends _new to filename.Note put a comma after
Maps like so Maps,Split a double paged scanned into one PDF with
Showing the original pages 54 and 55. PDF is 46.4MB and final PDF is 1.5MB. This is the biggest reason to optimize these files besides making them searchable with OCR.
ScanTailor Advanced open directory of png images.
edit: used to use
ScanTailor Advanced.Showing the original pages 54 and 55. PDF is 46.4MB and final PDF is 1.5MB. This is the biggest reason to optimize these files besides making them searchable with OCR.
mkdir dumpor extract to png instead of tiff if you have a fast computer
pdftoppm -tiff -tiffcompression deflate -r 300 What-About-The-Seedline-Doctrine.pdf dump/img
pdftoppm -png -r 300 What-About-The-Seedline-Doctrine.pdf dump/img
-r 300 sets the dpi to 300...can do 600 if you wish. deflate compression is ~30% better in file sizes than lzw and only about 10% slower. ScanTailor Advanced open directory of png images.
edit: used to use
pdfimages but it will extract each layer per page. Only found MasterPDF Editor | Export all images actually flattens but puts a DEMO watermark. pdftoppm can do the same by flattening the multilayered images in a PDF.Make covers fit exactly without margins in an A4 PDF just resize an image and input in the dimensions exactly
If you wish to add one after you've already OCR'd the PDF then use PDFArranger or PDFSlicer to remove existing cover and add a new cover. Save the file to .jpg then
595 x 842 and save it as a .tif image.If you wish to add one after you've already OCR'd the PDF then use PDFArranger or PDFSlicer to remove existing cover and add a new cover. Save the file to .jpg then
img2pdf -S A4 001cover2x.jpg -o 001cover.pdfExtract URLs from a webpage
https://www.convertcsv.com/url-extractor.htm
Wish to download an entire podcast series to turn it into a few opus audiobooks.
1) Load the url from radiopublic
2) filter URL that only have
Had to change the filenames though.
https://www.convertcsv.com/url-extractor.htm
Wish to download an entire podcast series to turn it into a few opus audiobooks.
Soundcloud I couldn't figure out how to get the urls. Podbean had to input 50 different pages. Radiopublic dot com had all episodes (500) listed on one page thus enabling extraction of all URLs at once.1) Load the url from radiopublic
2) filter URL that only have
mp3
3) extract then save file as linksextracted.txt or whatever you wishHad to change the filenames though.
yt-dlp -j -a linksextracted.txt-j, --dump-json
With this command can find the name I want as it's giving me
TstlYjWBcmTs.128 [TstlYjWBcmTs.128].mp3 which I don't want. Looking at screenshot I want to use the output template command to original_url as it's outputting the webpage_url_basename it appears.• webpage_url (string): A URL to the video webpage which if given to yt-dlp should allow to get the same result again
• webpage_url_basename (string): The basename of the webpage URL
• webpage_url_domain (string): The domain of the webpage URL
• original_url (string): The URL given by the user (or same as webpage_url for playlist entries)
yt-dlp -o "%(original_url)s.%(ext)s" -a linksextracted.txt
in TextEditor (xed) just press
Title Case with exceptions for:
except if they're first or last word of line
single and double quotes
custom exception list for say acronyms
then it will always keep those lowercase
python titlecase
tons of examples here
includes a commandline name titlecase
Try not to feed it all UPPERCASE...better to all lowercase feed it. It will usually preserve acronyms FBI, etc. if already capitalized.
Say you have chapters with numbers...as long as you put a '
3: a victory on a massive scale
3: A Victory on a Massive Scale
25 - the conquest of the world
25 - The Conquest of the World
2 - a bad time and a terrible waste of money
2 - A Bad Time and a Terrible Waste of Money
nothing to be afraid of
Nothing to Be Afraid Of
'small word in quotes - "a trick, perhaps?"'
'Small Word in Quotes - "A Trick, Perhaps?"'
Ctrl-U for UPPERCASE, Ctrl-l lowercase, Ctrl-T Title Case but Title Case it's better to do exceptionsTitle Case with exceptions for:
articles: a, an, theprepositions: at, by, in, on, of, withconjunctions: and, but, for, nor, soexcept if they're first or last word of line
First and last word always capitalized even for articles, prepositions, conjunctions
single and double quotes
keep urls lowercase
custom exception list for say acronyms
touch ~/.titlecase.txt
xed ~/.titlecase.txt
then it will always keep those lowercase
python titlecase
tons of examples here
pip3 install titlecase
includes a commandline name titlecase
titlecase -f input.txt -o output.txt
Try not to feed it all UPPERCASE...better to all lowercase feed it. It will usually preserve acronyms FBI, etc. if already capitalized.
Say you have chapters with numbers...as long as you put a '
: ' after the number (spaces are optional). Colon : or semi-colon ; or a hyphen - work3: a victory on a massive scale
3: A Victory on a Massive Scale
25 - the conquest of the world
25 - The Conquest of the World
2 - a bad time and a terrible waste of money
2 - A Bad Time and a Terrible Waste of Money
nothing to be afraid of
Nothing to Be Afraid Of
'small word in quotes - "a trick, perhaps?"'
'Small Word in Quotes - "A Trick, Perhaps?"'
GitHub
GitHub - ppannuto/python-titlecase: Python library to capitalize strings as specified by the New York Times Manual of Style
Python library to capitalize strings as specified by the New York Times Manual of Style - ppannuto/python-titlecase
This bash script generates bookmarks automatically as shown on the screenshot on the right. On the left PDFSam Basic let's you choose whether to retain existing bookmarks or not. Here the Bible already had bookmarks so it retains them.
Quite impressive script to quickly get bookmarks. I'll spend a little time trying to figure out how to retain bookmarks but most likely that's over my head.
combine multiple PDFs into a single PDF
Create one bookmark (filename) for each PDF in directory.
Won't retain existing bookmarks like PDFSam
author Mateen Ulhaq
Quite impressive script to quickly get bookmarks. I'll spend a little time trying to figure out how to retain bookmarks but most likely that's over my head.
combine multiple PDFs into a single PDF
Create one bookmark (filename) for each PDF in directory.
Won't retain existing bookmarks like PDFSam
author Mateen Ulhaq
#!/bin/bash
out_file="combined.pdf"
tmp_dir="/tmp/pdftk_unite"
bookmarks_file="$tmp_dir/bookmarks.txt"
bookmarks_fmt="BookmarkBegin
BookmarkTitle: %s
BookmarkLevel: 1
BookmarkPageNumber: 1
"
rm -rf "$tmp_dir"
mkdir -p "$tmp_dir"
for f in *.pdf; do
echo "Bookmarking $f..."
title="${f%.*}"
printf "$bookmarks_fmt" "$title" > "$bookmarks_file"
pdftk "$f" update_info "$bookmarks_file" output "$tmp_dir/$f"
done
pdftk "$tmp_dir"/*.pdf cat output "$out_file"This is pretty cool.... Adobe renamed ClearScan in 2015 to Edit Text in Image. It basically creates a anti-aliases around the text and vectorizes the text. Many hours spent try to get this work and will spare you all the details of tried this and that but this is what did actually work:
Download about 700MB and installs 1.8GB of
right click and Permissions | check to Execute allow
Download about 700MB and installs 1.8GB of
texlive. This is a LaTex package as recommended by pdfsak. You may skip the fonts and extra and see if it works but get base and potrace are for sure required.sudo apt install texlive-latex-base texlive-fonts-recommended texlive-fonts-extra texlive-latex-extra
potrace
download appimage of magick (ImageMagick) right click and Permissions | check to Execute allow
sudo cp magick /usr/bin/Now install pdfsak in python
pip3 install —upgrade pdfsak
Acrobatusers
Better PDF OCR: ClearScan is smaller, looks better
In this tutorial, learn about the advantages and disadvantages of ClearScan over Searchable Image OCR in Acrobat 9.
pdfsak -if input.pdf -o clearscan.pdf --clearscanoriginal on top (blurry)
screenshots:
clearscan on bottom
A few years ago I used Adobe Acrobat DC Pro back when I was on a mac and remember it could replace the image with text by creating a custom font on-the-fly and create a much much smaller file size. Still need to manually go through and check the accuracy of the words so it's still somewhat time consuming. This doesn't do that but still it really helps clean up old PDFs.
Screenshot is showing zoomed in after cleanscan. Notice many tiny dots.Best to despeckle it (remove the tiny dots) in ScanTailor Advanced (unpaper is over my head). So some PDFs can do clearscan first then feed to ScanTailor but old ones with crappy background and double pages, etc. will have to ScanTailor it. img2pdf it to a pdf then clearscan them back to ScanTailor to despeckle then to ocrmypdf.