Doppelkolloquium, Prof. C. V. Lakshmi, Computer Science Dayalbagh Educational Institut, Agra, India / am 05.10.2016, 15:15 Uhr - 16:00 Uhr

05.10.2016 um 15:15 bis 17.10.2016 um 16:00

Institut für Informatik, Vorbau Ludewig-Meyn-Straße 2, 24118 Kiel, Raum Ü2/K

Titel: A fast and robust many fonts printed Optical Character Recognition system for Indian Scripts

Abstract: Making machines “read” text printed or handwritten on paper has been a long cherished goal in Computer Science. Considerable success has been achieved for scripts like Roman script that is quite simple and consists of very few symbols.
This talk describes the design and development of an Optical Character Recognition system for printed many-font Indian scripts. The system takes an image of printed text in Devanagari (Hindi, dominant language in North India) or Telugu (a prominent South Indian Language) as an input. It performs a full OCR and provides output that can be directly taken into an editor. Indian scripts are classified into two classes – one class consists of scripts that are written with a prominent headline called Shirorekha and the other class consists of those scripts which do not have a shirorekha. In our work, Devanagari is taken as a representative of the first class and Telugu as a representative of the second class as specified above.
Indian scripts are extremely complicated due to the presence of a vast number of possible combinations of vowels and consonants which are written joined together to form what are called compound characters. The OCR problem becomes even more difficult when text is present in complicated layouts or on complicated backgrounds.
The OCR system designed and implemented is shown to provide excellent results for text images of Telugu and Devanagari. A large number of fonts are considered – 24 for Telugu and 25 for Devanagari and the results are excellent for all these fonts. This is the first reported attempt that works for such a large number of fonts for Indian scripts. Further the recognition rates are enhanced by judiciously implementing schemes for reducing the number of exemplars and the number of features to be stored in the databases. New types of features in transform domain are shown to give better results. The fact that the system works excellently for all these fonts on both Shirorekha-based script as well as non-Shirorekha based Indian script augurs well for the other Indian scripts.
The talk would highlight the main characteristics of Indian scripts and the challenges in development of OCR for them. It would then describe how these challenges are resolved in our work. Considerable emphasis would be laid on practical demonstration of the OCR system apart from explaining its internal structure and logic.

Prof. Anand Srivastav

