GeekTips
109 subscribers
586 photos
3 videos
77 files
231 links
Linux Mint, video encoding, ffmpeg, geek tips, regex, pdf manipulation, substitcher, mpv config
Download Telegram
regex searches for a replaces any digits mixed with periods / dots 00.34.77 which are timestamps created by LosslessCut
you need to use a . instead of using quantifier or whatever it's called. I used 23 or 24 ..... periods.
-(\d)........................

this also works
-(\d+).(\d+).(\d+).(\d+).(\d+).(\d+).(\d+).(\d+)
(\D)-(\d)........................
$1
multiple dashes in filename so (\D) matches non-digit (like abcd, etc.) then a dash and replace with first string $1 which is the letter. Otherwise the last letter gets chopped off at end of filename.

If there is a numeral at the end like Part 1 change the first (\D) to lowercase to indicate digit like so
(\d)-(\d)........................
\1
This book I'm making into an audiobook but the original OCR on the document is pretty much impossible to correct.

pdftotext -layout book.pdf output.txt
so had to force ocr it
ocrmypdf - -force-ocr book.pdf book_ocr.pdf
and now it's a tad better. Formatting isn't all that important for text to speech.
Removing hyphens from hyphenated words at the end of a line. Notice for the text to speech to work correctly need to change defi- nitely to definitely and don't change non-partisanship as it's correct as it is.
-$\n\s+
-
is dash
$ says at the end of a line
\n line break
\s is whitespace (blank spaces)
\s+ any amount of whitespace
It didn't get im- portance nor cir- cumstances as there wasn't any whitespace after the dash -. So search and replace all again using -$\n
now importance and circumstances are correct and non-partisanship isn't changed. Now just have to spell check it before feeding it to ttstool (text to speech)
batch compressed each PDF by about 75%.

made a subdirectory output then
parallel --tag -j 2 ocrmypdf -s -O 2 --skip-big .1 '{}' 'output/{}' ::: *.pdf

Got an OutofMemory Heap error in PDFSam when trying to process a ton of PDFs. So start PDFSam this way with using 2.4GB of memory instead of the 512MB default for java apps.

java -jar -Xmx2400m /opt/pdfsam-basic/pdfsam-basic-4.3.0.jar

As to why PDFSam isn't compressing even with PDF 1.5 checked I have no idea. Thus it's necessary to use ocrmypdf to do the compression.
One PDF had the years 1960 to 1985 and if merged it would have a single entry in the Table of Contents named 1960-1985. I wanted each one from 1960 on to have it's own link in the TOC (table of contents).

Solution was to Split the PDF by Bookmark with PDFSAM.
To Split by Bookmark choose level 1 and in File names settings right click and choose [BOOKMARK_NAME] and delete PDF_SAM
Putting spaces between joined capitalized words with regex — renaming files

TheParableofTheFigTree rename it to The Parable of The Fig Tree

Search and replace all using
([a-z])([A-Z])
replace all with
$1 $2    

[a-z] lowercase letters
([a-z]) groups that single letter
[A-Z] uppercase letters
$1 1st string and $2 2nd string

Make sure Case Sensitive Search is checked
Must repeat the search and replace again for 2nd instance and the once again for 3rd instance and so on.
3rd instance
Now do the same for numbers and years, etc.

search using
([a-z])([0-9])
replace all with
\1 \2

[0-9) for numbers
$1 $2 or \1 \2 are the same...use either syntax
This media is not supported in your browser
VIEW IN TELEGRAM
Beavis and Butthead learn of their White Privilege

original video re-encoded to 720p but has a low volume

ffmpeg -i beavisprivilege_original.mp4 -af volumedetect -f null /dev/null
shows the following
mean_volume: -32.7 dB
max_volume: -7.6 dB
To boost the audio volume by 12dB (make sure to capitalize the B) add -af "volume=12dB" just before the -vf (video filter). -af stands for audio filter

If you just wish to re-encode the audio without having to re-encode the video again change the following
-c:v hevc -crf 28 -c:a libopus -b:a 16k -af "volume=12dB" -vf scale="-2:720"
change to as to copy the video not re-encoding it
-c:v copy -c:a libopus -b:a 16k -af "volume=12dB"
This media is not supported in your browser
VIEW IN TELEGRAM
This is the video with boosted audio volume and
ffmpeg -i beavisprivilege_12dB+.mp4 -af volumedetect -f null /dev/null

shows the following
mean_volume: -20.8 dB
max_volume: 0.0 dB
ocrmypdf -O 3 --deskew input.pdf output.pdf

--deskew option straightens out PDFs

batch process PDFs using OCR
parallel --tag -j 2 ocrmypdf -O 3 --deskew '{}' 'output/{}' ::: *.pdf
Use PDF Arranger and use SHIFT to select the range of each magazine issue. CTRL-E to export selection to a single PDF. Rename each PDF to the year and month of the magazine. Then can use PDFSAM to generate a index / table of contents.

Sidenote: PDF Slicer also works but is more cumbersome for this particular task as it requires you to select the issue then CTRL-I (invert selection) then delete all pages selected. Then Save as ...once done then undo and wait for thumbnails to be re-generated.
Convert an epub to PDF with an Outline using free Calibre command line.

ebook-convert input.epub output.pdf --base-font-size=13 --change-justification=justify --embed-font-family=freesans

I prefer justification rather than left alignment and hyphenation.