Automatically Rotating Scanned Text Images with Tesseract OCR
Problem
If you've ever had a batch of scanned images with text, you know how tedious it can be to manually rotate each one to the correct orientation. This process can be especially frustrating when dealing with a large number of images. Wouldn't it be great if there were a way to automatically rotate these images so that the text is always upright and readable?Solution
To solve this problem, I developed a simple script that automatically detects the correct orientation of text in scanned images using Optical Character Recognition (OCR) and dictionary matching. Here's how it works:
OCR Parsing with Tesseract: I used Tesseract, a popular open-source OCR engine, to extract text from the images. Tesseract is powerful and versatile, making it an excellent choice for this task.
Dictionary Matching: I created a list of the most commonly occurring words in the text. This list acts as a reference to determine the correct orientation. While my example includes only 5-6 words, you can easily expand this list to improve accuracy.
Rotation and Validation: The script rotates each image in increments (90°, 180°, 270°) and re-runs the OCR. It then compares the recognized text against the dictionary. Given that OCR is not 100% accurate, I accounted for minor deviations when matching the words. The image is rotated to the orientation where the most dictionary words are recognized, ensuring the text is upright.
How to Use
To get started, simply copy the three files listed below into your~/bin
directory and run the tesseract_rotate_all
script. The script will automatically process all images in the directory, rotating them to the correct orientation based on the detected text.Dependencies
Before running the script, you need to install theperl re::engine::TRE
- a regular expression engine that handles approximate matching, which is essential for dealing with OCR inaccuracies.recognize_good_rotation
#use lib "/root/perl5/lib";
# @author Miroslav Bodis 2014
use strict;
use warnings;
my $file = shift;
my $rotati
my $debug_mode = shift;
my $find = 0;
if (!defined $debug_mode){
$debug_mode = 0;
# $debug_mode = 1; # TODO use for more details
}
if ($debug_mode == 1){
print "file:" . $file."\n";
print $rotation . "\n";
print "debug mode: " . $debug_mode."\n";
}
my @recognize_words = ('then', 'change', 'when', 'over', 'suddenly', 'another');
open(my $fh, "<", $file) or die "cannot open file";
while(<$fh>) {
chomp;
my $line = $_;
{
use re::engine::TRE max_cost => 1;
foreach (@recognize_words) {
if ($line =~ /$_/i) {
$find += 1;
if ($debug_mode == 1){
print "match word: " . $_ . "\n";
}
}
}
}
}
close $fh;
if ($find > 2){
exit 0;
}
exit 1;
tesseract_rotate
#! /bin/sh
# @author Miroslav Bodis 2014
if [ -z "$1" ]
then
echo "
@author Miroslav Bodis 2014
# script to autorotate image of printed text
# - inpnut image try all 4 rotations (0, 90, 270, 180)
# - tesseract current rotation
# - use your dictionary to find word in tesseracted text (with some tollerance - used TRE max_cost => 1)
# - see log for results
# - TODO: copy script to your bin folder e.g.: \"~/bin/recognize_good_rotation.pl\"
# \$1 -> \"input_image\""
exit
fi;
# required 1 arguments
if [ -z "$1" ]
then
echo "required 1 arguments \"image_name\""
exit
fi;
help_rotated_img="rotation_help.jpg"
help_ocr_out="output_ocr"
help_ocr_out_txt="$help_ocr_out.txt"
find=1
# 0 - rotation
echo "image $1 try rotation 0"
tesseract -l slk $1 $help_ocr_out
perl ~/bin/recognize_good_rotation.pl $help_ocr_out_txt 'rotation 0'
find=$?
# 90 - rotation clockwise
if [ $find -eq 1 ]
then
echo "image $1 try rotation 90"
convert $1 -rotate 90 -quality 100 $help_rotated_img
tesseract -l slk $help_rotated_img $help_ocr_out
perl ~/bin/recognize_good_rotation.pl $help_ocr_out_txt 'rotation 90'
find=$?
fi;
# 270 - rotation clockwise
if [ $find -eq 1 ]
then
echo "image $1 try rotation 270"
convert $1 -rotate 270 -quality 100 $help_rotated_img
tesseract -l slk $help_rotated_img $help_ocr_out
perl ~/bin/recognize_good_rotation.pl $help_ocr_out_txt 'rotation 270'
find=$?
fi;
# 180 - rotation clockwise
if [ $find -eq 1 ]
then
echo "image $1 try rotation 180"
convert $1 -rotate 180 -quality 100 $help_rotated_img
tesseract -l slk $help_rotated_img $help_ocr_out
perl ~/bin/recognize_good_rotation.pl $help_ocr_out_txt 'rotaiton 180'
find=$?
fi;
if [ $find -eq 1 ]
then
echo ">>>>>>>>>>>>>>>>> image $1 NOT ROTATED, please update dictionary ! <<<<<<<<<<<<<<<<"
else
echo "image $1 ROTATED"
# if rotated replace new right rotation with old one
cp $help_rotated_img $1
fi;
rm $help_rotated_img
rm $help_ocr_out_txt
tesseract_rotate_all
#! /bin/sh
# @author Miroslav Bodis 2014
# move to current folder with pictures and run "tesseract_rotate_all"
FILES=./*
for f in $FILES
do
echo "--- --- --- --- START FILE $f"
tesseract_rotate $f
done
echo "finished"
Comments
Post a Comment