OCR to produce .srt subtitles

ocr

I recently tinkered with Torbjørn Pedersen’s (National Library of Norway) Python script video-ocr2srt to extract burnt-in English subtitles from a digital video. The script performs optical character recognition (OCR) on video files and generates a .srt subtitle file with a detailed JSON file.

The script leverages on the EAST text detector model for text detection and the Pytesseract library for OCR. I achieved decent results with it, which may improve with a better quality video file. I suspect the extremely poor transfer of the film may be the cause of numerous duplicate lines and inclusion of stray special characters in the subtitles. But what it does so well is the heavy lifting creating the in and out points for the subtitle lines ╰(°▽°)╯! It processed a 110 minute video under 40 minutes, however users will need to ‘clean’ the .srt file for spelling, grammar, punctuation, and timing after.

python video-ocr2srt.py -v input -m frozen_east_text_detection.pb -l eng -f 10 -p

Installing Tesseract on Windows

PyTesseract is a widely used open-source OCR engine for Python that read and recognizes text in images. It determines text lines that are fixed pitch and slices the words into characters based on the pitch. While it is known for its accuracy and versatility, it can be challenging to install it in a Windows environment.

Installation steps

1. Download and install Tesseract

2. Add TESSDATA_PREFIX in the System Environment Variables:

Variable Name - TESSDATA_PREFIX
Variable Value - C:\Program Files (x86)\Tesseract-OCR\tessdata

3. Add another environment variable tesseract.

Variable Name - tesseract
Variable Value - C:\Program Files (x86)\Tesseract-OCR\tesseract.exe

4. Add the path in the PATH environment.

Variable Value –C:\Program Files (x86)\Tesseract-OCR

Lynx and Links2 - Terminal Web Browsers

Here are two simple and fun tools to browse the web with the terminal!

lynx

sudo apt install lynx
lynx examplewebsite.com

links2 is a similar tool and may be set up with the following commands:

sudo apt install links2
links2 examplewebsite.com

Getting .mp4 Files with yt-dlp

yt-dlp is an excellent tool for pulling video files off YouTube, however its default file output is .webm.

The following command will try to pull a native .mp4 file off YouTube and will do necessary transcoding if that fails after downloading.

yt-dlp -S res,ext:mp4:m4a --recode mp4 https://www.youtube.com/watch?v=dQw4w9WgXcQ