GeekTips

Then clean with exifcleaner (free app) I use the appimage on linux to clean out all PDF metadata. In this case it deleted Creator, Producer and ModifyDate.

101 views19:24

GeekTips

Wanted images pretty much uniform in size. Didn't want a pdf with small, medium and super large images. So kept all of them to be max width and height 1000 pixels max using

img2pdf --imgsize 1000x1000 *.jpg -o output.pdf

But before I did that there were some small say 400 pixel photos in Ben Garrison, Sheeple, Ads collection. So I batched upscaled many of them 2x with waifu2x-ncnn-vulkan. See next post. Even if an image were already say 1500x1500 and you 2X upscale it to 3000x3000 it's doable as you set max to 1000x1000 with img2pdf.

112 viewsedited 19:42

GeekTips

copy waifu2x-ncnn-vulkan into usr/bin path with elevated privileges

sudo nemo

sudo thunar

which let's use copy the 3 directories and waifu2x-ncnn-vulkan exectuable to

/usr/bin

command line (-r = recurvsive)

sudo cp -r waifu2x-ncnn-vulkan models-* /usr/bin

when you wish to delete (be careful)

sudo cd /usr/bin
sudo rm -r waifu2x-ncnn-vulkan models-*

Upscale only one image

waifu2x-ncnn-vulkan -i input.jpg -o output.jpg -s 4 -n 2

To batch upscale many images (jpg/png/webp) and specify the input and output directory plus the image format

waifu2x-ncnn-vulkan -i ~/Pictures/input -o ~/Pictures/output -s 2 -n 0 -f jpg -t 64

-s

scale 1/2/4/8/16/32 (default 2)
-n noise-level -1/0/1/2/3 (default 0)
Don't need an expensive AMD or Nvidia graphics card either. Just takes longer.
-f format type for batch processing dirs
- t tile-size >=32 default=auto...I got a GPU error so reduced tile size to 64 with -t 64 and it worked fine but albeit even slower.

137 viewsedited 19:42

GeekTips

Shows the white space to be cropped / trimmed / removed automatically with the k2pdfopt

k2pdfopt input.pdf -ui- -x -mode tm -om 0.01,0.01,0.01,0.01 -c

outputs a file name

input_k2opt.pdf
-ui-

disables interactive GUI on Linux
-x exits when finished
-c color output as default is black and White
-mode tm (trim margins / auto crop)
-om 0.01,0.01,0.01,0.01 output margins adds just a little bit of margins to left, top, right, bottom of pages

OCR works but I'll stick with ocrmypdf since when it OCR's text and images it converts the text to images which is terrible. Images have overlaying OCR text . In ocrmypdf you can --force-ocr it and it will keep your text as text and overlay text on images too. Huge major difference.

k2pdfopt input.pdf -ui- -p 1-4 -x -mode tm -om 0.01,0.01,0.01,0.01 -ocr t -ocrhmax 1.5 -ocrdpi 400 -ocrvis s -ocrd p -c

-p 1-4 (page ranges 1-2) 1- is page 1 to end, e for even and o for odd pages
-ocrd p is to send Tesseract a page at a time rather a line at a time. This was necessary

116 viewsedited 10:54

GeekTips

k2pdfopt interactive mode. Use this first time if you decide to OCR as it'll download the language training model automatically.

90 views10:54

GeekTips

ScanTailor a PDF from 19.2MB (on left side with yellow background) to 3x smaller at 6.5MB (on right side) and put uniform margins for each page and add a chapter index.

1) extract images with pdfimages

2) change black background with White text to a White background and black texts with convert

3) ScanTailor margins, deskew, despeckle

4) combine .tif images and pipe to ocrmypdf for OCR and image optimization

5) create bookmarks / index with booky.sh

85 views13:31

GeekTips

pdfimages -list -f 1 -l 5 Conquest-of-a-Continent.pdf

--list only lists images in PDF without extractting them
-f first page to process
-l last page to process

Showing just page 1 to 5 you can see each page has 3 multi-layered images. An image with an alpha channel, a yellow background and an smask (soft mask) jbig2.

in the directory with the Conquest of a Continent

mkdir images
pdfimages -jp2 -p Conquest-of-a-Continent.pdf images/img

Extracts all images from pdf and prepends page number -p to each image
img-001-000.jp2
img-001-001.jp2
img-001-002.pbm
img-002-003.jp2
img-002-004.jp2
img-002-005.pbm
img-003-006.jp2
img-003-007.jp2
img-003-008.pbm

Cover looks bad so screenshot it and save it as -001cover.jpg to the images/ directory. If you save image from PDF it'll have an alpha channel.

Only want the pbm files so delete the .jp2 files

cd images/
rm *.jp2

75 viewsedited 13:31

GeekTips

The pbm images are inverted so we need to negative (invert) them with convert to get a White background with black text. Also we'll convert it to png so ScanTailor can read them.

for f in *.pbm; do convert "$f" -negate png"$f".png; done

check to make sure the png images look good then delete the pbm images

rm *.pbm

edit: Oops found out ScanTailor can convert that yellow background to White so no need for pdfimages and inverting, etc. If there are masks, soft-masks, stencils (multi-layered) PDF and want one image per page instead of 3 then use pdftoppm instead of

pdfimages

pdftoppm -png -r 300 input.pdf dump/img

64 viewsedited 13:32

GeekTips

Now add the directory of pngs in ScanTailor Advanced. On Linux ScanTailor Advanced has an appimage or flatpak.

On import I had to Fix DPI so I choose 300 for all images.

64 views13:32

GeekTips

ScanTailor after checking Orientation, Split Pages for one page of two page scans, deskew (slighty rotated), and Select Content (defines text and image areas). I let it batch process all pages by clicking the green arrow. Set Margins and apply to

This page and the following ones.

Before batch processing best to select page 1. Once output is done batch processing it saves *.tif files in a subdirectory named out/ and you can close the project and exit ScanTailor.

69 views13:32

GeekTips

For the cover in Margins uncheck Match size with other pages and Auto Margins should surround the entire image.

In Ouput set Cover to Mixed mode and leave the other pages black and White.

72 views13:32

GeekTips

Combine all images into a PDF with img2pdf and pipe it to ocrmypdf (notice the - by itself). A4 sets all pages to that size with the downside being some white margins on cover but all pages are uniform. A4 is close to in size to Letter.

img2pdf --pagesize A4 out/*.tif | ocrmypdf --optimize 3 --jbig2-lossy - ../output_ocr.pdf

70 views13:32

GeekTips

booky.sh is a script which let's you edit a text file to generate bookmarks / outline

Install booky by unzipping code file then putting it where it's safe to keep. Then add your path to it so it can be run from anywhere the PDF is.

echo $PATH

/home/geektips/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/sbin:/usr/sbin:/sbin/

so export it to your path
nano ~/.bashrc and at end of file append

export= PATH=$PATH:/home/geektips/appimages/apps/booky/

save it then refresh bash without logging out with

source ~/.bashrc
echo $PATH

/home/geektips/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/sbin:/usr/sbin:/sbin:/home/geektips/appimages/apps/booky/

73 views13:32

GeekTips

For booky make a text file save it as index.txt (or whatever). Copy from PDF or use NORMCAP or TextSnatcher to screenshot OCR the chapters and paste and edit them in a text editor.

You can have link breaks between chapters doesn't matter. Just put a comma , after the chapter name then the page #. Only one chapter per line.

Easiest way to find chapters if the PDF lists the page # is to search for said page # looking for a matching chapter title on the left pane. Still takes about 5 to 10 mins depending on how many chapters.

booky booky.sh Conquest\ of\ a\ Continent.pdf index.txt

and it creates a new PDF with your index with chapters and appends _new to filename.

Note put a comma after Maps like so Maps,

70 views13:32

GeekTips

Split a double paged scanned into one PDF with ScanTailor Advanced.

Showing the original pages 54 and 55. PDF is 46.4MB and final PDF is 1.5MB. This is the biggest reason to optimize these files besides making them searchable with OCR.

mkdir dump

pdftoppm -tiff -tiffcompression deflate -r 300 What-About-The-Seedline-Doctrine.pdf dump/img

or extract to png instead of tiff if you have a fast computer

pdftoppm -png -r 300 What-About-The-Seedline-Doctrine.pdf dump/img

-r 300 sets the dpi to 300...can do 600 if you wish. deflate compression is ~30% better in file sizes than lzw and only about 10% slower.

ScanTailor Advanced open directory of png images.

edit: used to use pdfimages but it will extract each layer per page. Only found MasterPDF Editor | Export all images actually flattens but puts a DEMO watermark. pdftoppm can do the same by flattening the multilayered images in a PDF.

66 viewsedited 16:29

GeekTips

ScanTailor under Split Pages it usually detects automatically where to split the page.

58 viewsedited 16:29

GeekTips

ScanTailor outputs the tif images into a subdirectory

out/

img2pdf --pagesize A4 out/*.tif | ocrmypdf --optimize 3 --jbig2-lossy - output_ocr.pdf

final result (kinda...compressed image on telegram) page 54 with a PDF that has a single page rather than double pages per page.

60 viewsedited 16:29

GeekTips

page 55. Didn't make an index as it's only 56 pages....any PDF under 100 pages most likely doesn't need an index.

56 views16:29

GeekTips

Make covers fit exactly without margins in an A4 PDF just resize an image and input in the dimensions exactly 595 x 842 and save it as a .tif image.

If you wish to add one after you've already OCR'd the PDF then use PDFArranger or PDFSlicer to remove existing cover and add a new cover. Save the file to .jpg then
img2pdf -S A4 001cover2x.jpg -o 001cover.pdf

62 viewsedited 04:40

GeekTips

Extract URLs from a webpage

https://www.convertcsv.com/url-extractor.htm

Wish to download an entire podcast series to turn it into a few opus audiobooks.

Soundcloud I couldn't figure out how to get the urls. Podbean had to input 50 different pages. Radiopublic dot com had all episodes (500) listed on one page thus enabling extraction of all URLs at once.

1) Load the url from radiopublic
2) filter URL that only have

mp3

3) extract then save file as linksextracted.txt or whatever you wish

Had to change the filenames though.

71 views04:40

GeekTips

yt-dlp -j -a linksextracted.txt

-j, --dump-json

With this command can find the name I want as it's giving me TstlYjWBcmTs.128 [TstlYjWBcmTs.128].mp3 which I don't want. Looking at screenshot I want to use the output template command to original_url as it's outputting the webpage_url_basename it appears.

• webpage_url (string): A URL to the video webpage which if given to yt-dlp should allow to get the same result again

• webpage_url_basename (string): The basename of the webpage URL

• webpage_url_domain (string): The domain of the webpage URL

• original_url (string): The URL given by the user (or same as webpage_url for playlist entries)

yt-dlp -o "%(original_url)s.%(ext)s" -a linksextracted.txt

64 viewsedited 04:40

About

Blog

Apps

Platform