GeekTips
109 subscribers
586 photos
3 videos
77 files
231 links
Linux Mint, video encoding, ffmpeg, geek tips, regex, pdf manipulation, substitcher, mpv config
Download Telegram
This PDF is already OCR'd but many times it isn't or perhaps you just wish to OCR an image of anti-Termites. But multi-column text is tricky to copy both in PDF and image OCR.
The cool thing here is the PDF annotation, note-taking app Xournalpp the nightly build version now has Select text either Linear or in a Rectangle. I just downloaded the appimage and it works great.
NORMCAP and TextSnatcher both OCR images. Looks like NORMCAP does a tad better especially with line breaks. Both are good to have at your disposal though.
PDF with scanned images that have varying image resolutions. This won't work and create chaos, havoc in ScanTailor with Margins and Select Content even.
PDF with non-uniform images. Resize larger images smaller ones maintaining aspect ratio to solve this problem.

Smaller ones are: just a few examples
1501 x 1152
1491 x 1155
1547 x 1159
1514 x 1159

Larger ones are: just a few examples
4649 x 3610
4641 x 3647
4677 x 3597
4849 x 3819

Problem with this is in ScanTailor Advanced the smaller images remain small compared to the larger resolution images after setting margins. This pdf is landscape with two pages scanned per page.

A4 = 210mm x 297mm
A4 + A4 = A2 = 420mm x 594mm

I thought just using this scale-to option would solve the trick but it didn't.

pdftoppm -r 300 -scale-to 842 -tiff -tiffcompression deflate nonuniformimages.pdf dump/img

so instead just did this and didn't specify dpi
pdftoppm -tiff -tiffcompression deflate nonuniformimages.pdf dump/img
sort by size to select the larger resolution images and move them to a new directory.

mogrify will modify the existing images so make a quick copy of them just in case. ImageMagick doesn't have deflate compression so have to use LZW.

-resize 555 resizes width and -resize x555 does height.

mogrify -resize x1172 -compress LZW dump3/*.tif

They maintained their aspect ratio and the new resolution on a few of them are:
1510 x 1172
1519 x 1172
1516 x 1172
copy these back into your /dump directory and ScanTailor will process just fine now.
You can try to align chapter numbers by adding blank pages or deleting some unwanted pages so the table of content numbers match the actual PDF page number.

Or you can use vim and a plugin called vim-NumUtils.
sudo apt install vim
download vim-NumUtils and after unzipping put the doc/ and plugin/ into the ~/.vim directory. It can do NumUtilsAdd, NumUtilsSub, NumUtilsMul, NumUtilsDiv

Here all the actual PDF page numbers are 3 pages behind the listed page numbers.

:% NumUtilsSub 3, '\,'
sometimes I type a space after the , so to get those too do
:% NumUtilsSub 3, '\, '

Quick vim commands
:wq - write/save file and quit
:q - quit
u - undo
Ctrl-R - redo
i - insert mode ESC to quit insert mode
a - append text
% is a range for entire document
Great vim guide
{
foreword by the author,6
{
poem:,
first, my country, 15
this is my own, my dear, and native land,16
god made me free, 21
to the stars and stripes,33

Renumber chapter page numbers in text file with math operations. How to do it with sed and awk
sometimes I'll type a space after the , so let's substitute space after a comma followed by a number then the 2nd substitution separated with a ; does same but has no space
sed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt

-i insert in place the /s (substitutions) looking for , followed by any number of digits. Some chapter titles have , in them and it'll ignore them
-r extended regex so don't need to escape capturing groups
\2 2nd capture group put a , then single digit to remove space
\1 is the first capture group ([0-9]) which is just a single digit and re-inserting it after the %
without touching other commas in chapter names
{
foreword by the author%6
{
poem:,
first, my country%15
this is my own, my dear, and native land%16
god made me free%21
to the stars and stripes%33

Now subtract 3 from the chapter page numbers
awk -F% '{if (/%/) {print $1 "," $2-3} else {print $0}}' > renumbered.txt

-F% field separator , would work but if the chapters contain , which they do sometimes it screws up fields so needs to be something else like %
if (/%/)
there's a match for % on the line then
print $1 print field 1 which is the text string up till the %
","
will print a comma between $1 and $2 so it's ready for titlecase
print $2-3
print the number after the % and subtract 3 can do + addition * multiplication / division also
else print $0 prints entire line if no match

changes to:
{
foreword by the author,3
{
poem:,
first, my country,12
this is my own, my dear, and native land,13
god made me free,18
to the stars and stripes,30

now it's ready for title case
titlecase -f renumbered.txt -o chapters.txt
changes to:
{
Foreword by the Author,3
{
Poem:,
First, My Country,12
This Is My Own, My Dear, and Native Land,13
God Made Me Free,18
To the Stars and Stripes,30

combine all three commands on one line
sed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt && awk -F% '{if (/%/) {print $1 "," $2-3} else {print $0}}' toc.txt > renumbered.txt &&  titlecase -f renumbered.txt -o chapters.txt

then booky to add chapters to pdf
booky.sh SomeBook.pdf chapters.txt

If you want to offset the pages numbers by +5 just change $2-3 to $2+5
sed -i -r 's/(, )([0-9])/,\2/; s/,([0-9])/%\1/' toc.txt && awk -F% '{if (/%/) {print $1 "," $2+5} else {print $0}}' toc.txt > renumbered.txt &&  titlecase -f renumbered.txt -o chapters.txt

changes to:
{
Foreword by the Author,11
{
Poem:,
First, My Country,20
This Is My Own, My Dear, and Native Land,21
God Made Me Free,26
To the Stars and Stripes,38
Tree Style Tab firefox /librewolf extension. I never have that many tabs open but today I did as internet was slow and didn't wanna bookmark them which I ended up having to do anyway. Tried various color tabs and ended only liking this one which is for Tree Style Tabs.
Color Tabs settings anything that started with a, b or c would be green and d, e or f blue, etc.
https?:\/\/(www\.)?[-a-c]
https?:\/\/(www\.)?[-d-f]
Stumbled upon Sidebery and I prefer this one over Tree Style Tab. Sidebery has a bookmark side panel too. I'm aware Tree Style Tab has a bookmark extension.

Can create more panels for your tabs. Has a modern interface / feeling to it. One of those extensions I can live without but will try it out. One huge thing for me is when hovering over sideberry bookmarks it'll also display full URL unlike with regular tabs.
Sidebery — the only settings I changed to get it to my liking was Settings | Appearance | Color Scheme dark

Settings | Styles editor |Tabs
Background color on hover (brown) #63452cff
Active tab background color (purple) #613583ff

Settings | Styles editor |Bookmarks
Bookmark background color on hover (brown) #63452cff
Bookmark background color on click (purple) #613583ff
Closed folder color (blue) #62a0eaff
Expanded folder color (orange) #ff7800ff (not working now..bug)
Modifying opus chaptered audiobooks

linux (some are multiplatform) apps that can

add or remove cover image
tageditor
puddletag
kid3

Modify existing opus chapter names from opus audiobook
MusicBrainz Picard
Ex Falso
puddletag (Edit | Extended tags)
kid3

add metadata tag from filename (so one can name chapter names from filenames before adding to freac)
puddletag
functions choose Filename to Tag
Patter: %title%

kid3
Format: (up arrow) %f
Format: (down arrow) %{title} press Tag 2

Ex Falso
Tags from Path put <title> also can save it as a pattern named whatever

MusicBrainz Picard
Options | Options | User Interface
add Parse File Names
then just select all Ctrl-A and click Parse File Names


Since puddletag and kid3 can do all three I'll quickly show them
puddle tag right click and choose Extended Tags or Ctrl-E. Each chapter name you edit will create a popup dialog. Deleting or removing a cover image is quite easy.
puddletag naming chapters based on filename. Click function choose Filename to Tag and Pattern: %title% ..files can be batch renamed like this.
kid3 edit opus chapter names
kid3 add or delete a cover image
Smart Title Case Converter. This one is the best online one and does it great. Doesn't matter all that much if you use AP, Bluebook, Chicago...just avoid NYTimes as that had weird exceptions.

For an explanation of which words NOT to capitalize see https://titlecaseconverter.com/words-to-capitalize/

So if I want to rename mp3 files with Smart Title case before adding to freac to make an opus chaptered audiobook it's not so easy. Sure any filemanager can Title Case them but not Smart Title case them. Ex Falso has a plugin for Human Title Case but it only works on one file at a time. So here's the bash script I wrote to do that. The end part of renaming old files to new I found online.
#/bin/bash
ls *.mp3 | sort -n > filenames
sed -i -r 's/(^[0-9]{1,3})/\1:/; s/.mp3//' filenames
titlecase -f filenames -o new
sed -i -r 's/(.$)/\1.mp3/' new
ls *.mp3 | sort -n > old
while IFS= read -r old <&3 && IFS= read -r new <&4; do
mv -i -- "$old" "$new"
done 3< old 4< new
rm filenames new old
try this in a test directory
touch '1 this is to test out chapters.mp3' '2 lord of the rings.mp3' '3 the sun rises in the west.mp3' '44 of all the various species.mp3' '45 never again but tomorrow.mp3' '101 should work till 999 king louis xiv here.mp3' '241 but will not work with period between numbers.mp3'

have these mp3 files that look like so:
1 this is to test out chapters.mp3
2 lord of the rings.mp3
3 the sun rises in the west.mp3
44 of all the various species.mp3
45 never again but tomorrow.mp3
101 should work till 999 king louis xiv here.mp3
241 but will not work with period between numbers.mp3

And the goal is to get them to be like so:
1: This Is to Test Out Chapters.mp3
2: Lord of the Rings.mp3
3: The Sun Rises in the West.mp3
44: Of All the Various Species.mp3
45: Never Again but Tomorrow.mp3
101: Should Work Till 999 King Louis XIV Here.mp3
241: But Will Not Work With Period Between Numbers.mp3

1.2 some chapter
1.3 some chapter
these won't work as it'll put 1:.2 and 1:.3

You must add : or a - after a number otherwise titlecase won't capitalize the word after the number.
45 the flowers
45 the Flowers
instead of
45: the flowers
45: The Flowers

it will make xiv capital but not if it's at the end though. I'll share my titlecase.txt file I have in ~/. directory for the titlecase script.

Just name this script chaptersrename.sh
chmod a+x chaptersrename.sh
sudo chaptersrename.sh /usr/bin

also put echo before mv like echo mv if just wish to preview the changes. You do need to replace mp3 three places are better just rename your file extensions from say m4a to mp3 temporarily
titlecase.txt
1.8 KB
This is the ~/.titlecaste.txt file I use for exceptions for titlecase script. A huge list of roman numerals I II I. II. I, II,
remember though if at end of chapter it won't convert it
1: the life of louis xiv
1: The Life of Louis Xiv

5: the life of Louis xiv and other stuff
5: The Life of Louis XIV and Other Stuff

a small preview of text

i.e.
e.g.
etc.
#Roman Numerals 1-99 and . + , after
I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII
XIII
XIV
XV