I recently tinkered with Torbjørn Pedersen’s (National Library of Norway) Python script video-ocr2srt to extract burnt-in English subtitles from a digital video. The script performs optical character recognition (OCR) on video files and generates a .srt subtitle file with a detailed JSON file.
The script leverages on the EAST text detector model for text detection and the Pytesseract library for OCR. I achieved decent results with it, which may improve with a better quality video file. I suspect the extremely poor transfer of the film may be the cause of numerous duplicate lines and inclusion of stray special characters in the subtitles. But what it does so well is the heavy lifting creating the in and out points for the subtitle lines ╰(°▽°)╯! It processed a 110 minute video under 40 minutes, however users will need to ‘clean’ the .srt file for spelling, grammar, punctuation, and timing after.
PyTesseract is a widely used open-source OCR engine for Python that read and recognizes text in images. It determines text lines that are fixed pitch and slices the words into characters based on the pitch. While it is known for its accuracy and versatility, it can be challenging to install it in a Windows environment.