GeekTips – Telegram

GeekTips

109 subscribers

586 photos

3 videos

77 files

231 links

Linux Mint, video encoding, ffmpeg, geek tips, regex, pdf manipulation, substitcher, mpv config

Download Telegram

About

Blog

Apps

Platform

109 subscribers

in TextEditor (xed) just press Ctrl-U for UPPERCASE, Ctrl-l lowercase, Ctrl-T Title Case but Title Case it's better to do exceptions

Title Case with exceptions for:
articles: a, an, the
prepositions: at, by, in, on, of, with
conjunctions: and, but, for, nor, so
except if they're first or last word of line

First and last word always capitalized even for articles, prepositions, conjunctions

single and double quotes

keep urls lowercase

custom exception list for say acronyms

touch ~/.titlecase.txt
xed ~/.titlecase.txt

then it will always keep those lowercase

python titlecase
tons of examples here

pip3 install titlecase

includes a commandline name titlecase

titlecase -f input.txt -o output.txt

Try not to feed it all UPPERCASE...better to all lowercase feed it. It will usually preserve acronyms FBI, etc. if already capitalized.

Say you have chapters with numbers...as long as you put a ' : ' after the number (spaces are optional). Colon : or semi-colon ; or a hyphen - work

3: a victory on a massive scale
3: A Victory on a Massive Scale

25 - the conquest of the world
25 - The Conquest of the World

2 - a bad time and a terrible waste of money
2 - A Bad Time and a Terrible Waste of Money

nothing to be afraid of
Nothing to Be Afraid Of

'small word in quotes - "a trick, perhaps?"'
'Small Word in Quotes - "A Trick, Perhaps?"'

GitHub - ppannuto/python-titlecase: Python library to capitalize strings as specified by the New York Times Manual of Style

Python library to capitalize strings as specified by the New York Times Manual of Style - ppannuto/python-titlecase

59 viewsedited 20:09

original is on bottom and image with text is very light and hard to read. ScanTailor with Otsu and other algorithms darkens the text as seen on top image.

49 views20:41

Use xreader instead of evince (no zooming thumbnails) for PDF reader to zoom in on thumbnails to quickly locate chapters to assign page numbers to them for booky.sh

In this case I typed them out and didn't both capitalizing anything which saved me some time.

56 views00:04

convert to title case with article exceptions and notice : after chapter number
titlecase -f aaa_index.txt -o titlecase.txt

53 viewsedited 00:06

generate index for the PDF with booky.sh
booky.sh output_ocr.pdf titlecase.txt

57 views00:08

This bash script generates bookmarks automatically as shown on the screenshot on the right. On the left PDFSam Basic let's you choose whether to retain existing bookmarks or not. Here the Bible already had bookmarks so it retains them.

Quite impressive script to quickly get bookmarks. I'll spend a little time trying to figure out how to retain bookmarks but most likely that's over my head.

combine multiple PDFs into a single PDF
Create one bookmark (filename) for each PDF in directory.
Won't retain existing bookmarks like PDFSam
author Mateen Ulhaq

62 views02:58

#!/bin/bash

out_file="combined.pdf"
tmp_dir="/tmp/pdftk_unite"
bookmarks_file="$tmp_dir/bookmarks.txt"
bookmarks_fmt="BookmarkBegin
BookmarkTitle: %s
BookmarkLevel: 1
BookmarkPageNumber: 1
"

rm -rf "$tmp_dir"
mkdir -p "$tmp_dir"

for f in *.pdf; do
    echo "Bookmarking $f..."
    title="${f%.*}"
    printf "$bookmarks_fmt" "$title" > "$bookmarks_file"
    pdftk "$f" update_info "$bookmarks_file" output "$tmp_dir/$f"
done

pdftk "$tmp_dir"/*.pdf cat output "$out_file"

62 viewsedited 02:58

This is pretty cool.... Adobe renamed ClearScan in 2015 to Edit Text in Image. It basically creates a anti-aliases around the text and vectorizes the text. Many hours spent try to get this work and will spare you all the details of tried this and that but this is what did actually work:

Download about 700MB and installs 1.8GB of texlive. This is a LaTex package as recommended by pdfsak. You may skip the fonts and extra and see if it works but get base and potrace are for sure required.

sudo apt install texlive-latex-base texlive-fonts-recommended texlive-fonts-extra texlive-latex-extra

potrace

download appimage of magick (ImageMagick)
right click and Permissions | check to Execute allow

sudo cp magick /usr/bin/

Now install pdfsak in python

pip3 install —upgrade pdfsak

Better PDF OCR: ClearScan is smaller, looks better

In this tutorial, learn about the advantages and disadvantages of ClearScan over Searchable Image OCR in Acrobat 9.

61 views13:39

pdfsak -if input.pdf -o clearscan.pdf --clearscan

screenshots:

original on top (blurry)
clearscan on bottom

A few years ago I used Adobe Acrobat DC Pro back when I was on a mac and remember it could replace the image with text by creating a custom font on-the-fly and create a much much smaller file size. Still need to manually go through and check the accuracy of the words so it's still somewhat time consuming. This doesn't do that but still it really helps clean up old PDFs.

55 viewsedited 13:39

Screenshot is showing zoomed in after cleanscan.  Notice many tiny dots.

Best to despeckle it (remove the tiny dots) in ScanTailor Advanced (unpaper is over my head). So some PDFs can do clearscan first then feed to ScanTailor but old ones with crappy background and double pages, etc. will have to ScanTailor it. img2pdf it to a pdf then clearscan them back to ScanTailor to despeckle then to ocrmypdf.

57 viewsedited 13:39

despeckle of 2.5 in ScanTailor when used with clearscan looks like it's removing ALL dots. 2.0 almost removed all dots (marked in red) say 95% but not all. Default is 1 and max is 3.

48 views13:39

This one is the true test.

screenshot
top one is ScanTailor only

bottom one is clearscan (not cleanscan..get confused with ocrmypdf which has a --clean option) and then processed with ScanTailor. File size is a tad bigger but I believe the improvement is worth it.

48 viewsedited 13:39

ScanTailorOnly.pdf

ScanTailorOnly.pdf is 109KiB
ClearScan-ScanTailor.pdf is 272KiB

need to do these on an entire book to see if it's double or triple in size. If so guess it may not be worth it.
edit: ClearScan not CleanScan
edit2: seems like I won't use ClearScan only when the image is rather good to start with. Meaning the text in the image isn't blurred all that much.

CleanScan-ScanTailor.pdf

47 views13:39

This PDF is already OCR'd but many times it isn't or perhaps you just wish to OCR an image of anti-Termites. But multi-column text is tricky to copy both in PDF and image OCR.

46 viewsedited 14:20

The cool thing here is the PDF annotation, note-taking app Xournalpp the nightly build version now has Select text either Linear or in a Rectangle. I just downloaded the appimage and it works great.

50 views14:20

NORMCAP and TextSnatcher both OCR images. Looks like NORMCAP does a tad better especially with line breaks. Both are good to have at your disposal though.

54 views14:20

PDF with scanned images that have varying image resolutions. This won't work and create chaos, havoc in ScanTailor with Margins and Select Content even.

47 views23:27

PDF with non-uniform images. Resize larger images smaller ones maintaining aspect ratio to solve this problem.

Smaller ones are: just a few examples
1501 x 1152
1491 x 1155
1547 x 1159
1514 x 1159

Larger ones are: just a few examples
4649 x 3610
4641 x 3647
4677 x 3597
4849 x 3819

Problem with this is in ScanTailor Advanced the smaller images remain small compared to the larger resolution images after setting margins. This pdf is landscape with two pages scanned per page.

A4 = 210mm x 297mm
A4 + A4 = A2 = 420mm x 594mm

I thought just using this scale-to option would solve the trick but it didn't.

pdftoppm -r 300 -scale-to 842 -tiff -tiffcompression deflate nonuniformimages.pdf dump/img

so instead just did this and didn't specify dpi

pdftoppm -tiff -tiffcompression deflate nonuniformimages.pdf dump/img

44 viewsedited 23:27

sort by size to select the larger resolution images and move them to a new directory.

mogrify will modify the existing images so make a quick copy of them just in case. ImageMagick doesn't have deflate compression so have to use LZW.

-resize 555 resizes width and -resize x555 does height.

mogrify -resize x1172 -compress LZW dump3/*.tif

They maintained their aspect ratio and the new resolution on a few of them are:
1510 x 1172
1519 x 1172
1516 x 1172
copy these back into your /dump directory and ScanTailor will process just fine now.

40 views23:27

You can try to align chapter numbers by adding blank pages or deleting some unwanted pages so the table of content numbers match the actual PDF page number.

Or you can use vim and a plugin called vim-NumUtils.

sudo apt install vim

download vim-NumUtils and after unzipping put the doc/ and plugin/ into the ~/.vim directory. It can do

NumUtilsAdd, NumUtilsSub, NumUtilsMul, NumUtilsDiv

Here all the actual PDF page numbers are 3 pages behind the listed page numbers.

:% NumUtilsSub 3, '\,'

sometimes I type a space after the , so to get those too do

:% NumUtilsSub 3, '\, '

Quick vim commands
:wq - write/save file and quit
:q - quit
u - undo
Ctrl-R - redo
i - insert mode ESC to quit insert mode
a - append text
% is a range for entire document
Great vim guide

41 viewsedited 23:27