Identification and Classification of Multilingual Document Using Mutual Information and KNN Classifier

Manjula S, Ravindra S. Hegadi
Department of Computer Science
Solapur University, Solapur – 413255, India
manjula.shamarao@gmail.com, rshegadi@gmail.com

Abstract. The document consisting of more than one language is known as multilingual document. This paper is addressing the problem of detecting doc-ument that consists of more than one language by using mutual information technique; identified languages are classified by implementing KNN classifica-tion model. Indian languages have its own characteristics and it can be distin-guished with the help of visual discrimination. To identify these differences through machine we are making use of edge direction based feature to capture the differences present in languages with the help of Edge Direction Histogram (EDH). These techniques are implemented on a dataset of about 400 images consisting of documents with Kannada and Hindi languages and achieved 97.096% of accuracy.

Keywords: Multilingual document, mutual information, edge direction histo-gram, KNN classification.

1         Introduction

In India, more than 22 official languages are used for correspondence and communication. The multilingual documents generally used in many day-to-day transactions such as bank forms, railway/bus reservation forms, government official documents, etc., would contain minimum three languages (English as International language, Hindi as national language and regional language of that particular state). An automatic identification of language in a specified document image provides many salient applications catalogue of multilingual documents. The language identification is one of the main steps that arise in document image analysis, especially in the area of multi language document analysis.

The language refers to the method of human interaction, either spoken or written way, which consisting of the use of words in a sequential arrangement with standard way. This language can vary with state or country or community as the style of a piece of writing or speech changes with change in location. It is a system of communication based upon words and the combination of words into sentences. Communication by means of language may be referred to as linguistic communication. Another way of communication can be carried through the expression such as smiling, facial expression, body language, shrieking and so on. This type of communication is referred to as non-linguistic communication. Languages consist of tens of thousands of signs, which are combinations of forms and meaning. Form in spoken languages is a sequence of sounds, in written languages for example a sequence of letters and in the sign lan-guages of the deaf a certain combination of gestures.

The manual identification of language may be tedious and time consuming. The automatic identification and recognition of language also helps in text area identification, video indexing and retrieval and document sorting in digital libraries while dealing with multilingual documents. The factors which are important for language identification are

  1. Complexity in preprocessing.
  2. Complexity in feature extraction and classification.
  3. Computation speed of entire scheme.
  4. Sensitivity of the scheme to the variation in document text, such variation might be font style, size, document skew and document size.

In this paper we propose identification of multilingual document using mutual infor-mation and classification of identified languages based on KNN classifier. All lan-guages consist of vowels and consonants, known as basic characters of languages. Vowels can be written as independent letters, or it can be used with consonants by using its special marks of vowels on characters above, below, before or after the con-sonant they belong to. When vowels are written in this way they are known as modifi-ers and the characters so formed are called conjuncts. More than one consonant can combine with another consonant character to take new shapes. These new set of cha-racters class is known as compound characters. The multilingual document analysis system may contain handwritten or printed document image as an input for identifica-tion and classification process. The automatic language identification scheme is useful for sorting document images, for developing of appropriate system for identification of multiple languages and for searching online archives of document images contain-ing a particular language.

The challenges involved in automatic language identification are

  1. Length of text data: The documents have to be of sufficient length in order to iden-tify the language for the simple reason that more vocabulary will be included in a lengthy document.
  2. Noisy text: The text data may contain different noisy information like abbrevia-tions, short terms of words and tags.
  3. Character encoding: Character encoding of the text data may be different in differ-ent text [22]. Developing a generic system to handle all type of character encoding needs to be addressed.
  4. Segmentation of document: The scanned input image document is divided into number of paragraph, these paragraphs may be of random size, and font used in these paragraphs may be of different font style and size. These paragraphs will be used as an input for identification of different languages from multilingual docu-ment.
  5. Common words: In case of similar languages, certain words are used commonly in all languages makes language identification task difficult.

2         Literature Survey

In recent years, identification and classification of languages has posed many chal-lenges across the researchers. Many researchers have attempted to implement differ-ent techniques to perform these tasks. Gopal Datt Joshi et. al. have proposed a general framework for language identification for multilingual document by implementing local and global approaches techniques [1]. P. Ramanathan has proposed an automatic identification of handwritten language by implementing histogram processing tech-nique and neighbourhood processing technique with line detection and classification of document [3]. Bhupendra Kumar et. al. has proposed a line based robust language identification for Indian languages which includes Hindi, Gurumukhi and Bengali by implementing hierarchical classification technique and achieved 90% of accuracy [8]. B.V. Dhandra et. al. has developed a method of word wise script identification from bilingual document based on morphological concept [9]. Spitz described a technique to use the upward concavities of connected components for determining script of Asian and European languages [4]. Tan proposed a method of identification of seven languages such as Chinese, English, Koreans, Greek, Malayalam, Persian and Russian based on the texture analysis using multi channel Gabor filters and co- occurrence matrices [5]. Wood et. al., have used horizontal and vertical projection profile of document images to identify the script of the documents [6]. Hochberg et. al. pre-sented a system that automatically identifies the language from using cluster based templates [7]. From the literature survey, it has been revealed that a major amount of work has been carried out for language identification at word level based on the pres-ence and absence of shirorekha. The language identification of document, paragraph and line techniques have been modified for word level identification in printed Indian multilingual documents by including some new features such as headlines feature, distribution of vertical strokes, left and right profiles, deviation features, loop feature, tick feature, etc. Rajendra Rani et. al. have proposed Gurumukhi and English lan-guage identification by using modified Gabor features extraction methods [2]. The existing system of language identification for Telagu and Kannada character is achieving 91% of accuracy. There is an existing model which identifies Hindi, Ben-gali, Kannada and Telugu languages and this model is designed based on projection profile and rule based classifier reported accuracy of 97.83% [1]

3         Methodology

Figure 1 is the flow diagram showing the steps in implementation of Mutual Informa-tion technique for language identification.

3.1      Input Image Document

The input image is a text document image of variable size which consists of different font style, size, and shape. For the implementation of mutual information for language identification we are considering two different languages as an input, which are Kan-nada and Hindi language images. These images may be color images or gray images or black and white images. Data or text present in these images may consists of vo-wels, consonants, modifiers and compound characters of their respective languages.

3.2      Preprocessing of input image

The hardcopy of text document has been converted into digital image using scanner. While converting hardcopy of the document to digital image scanner may create some amount of noise depending upon the quality of the scanner we use and quality of source document. To get a good quality of resultant value, the language identification method requires noise free images as input image. The scanned images are digitized images and they will be in gray tone. These gray tone images will be used for further processing. The noise in these images may be generated due to faulty electronic de-vices used for capturing the images. The noise may also be present in the source doc-ument also. The source document may also contain disconnected line segments or gaps in the character segments. It is necessary to eliminate noise present in the image document through preprocessing. To perform this noise reduction we are making use of morphological operations which will remove all small noisy objects present in binary images. Once noise of the image document has been eliminated from the gray image later this error free image will be used for edge detection process which aims at identification of sharp points present in digital image where intensity of that pixel brightness changes sharply. These points are typically arranged into a set of curved line segment. The main purpose of detecting this intensity changes in digital image is to capture difference between object and background properties present in an input image. To make this to happen we implemented canny edge detector. The canny edge operator is an edge detection technique having multi stage algorithm for detecting a wide range of edges in the images. It is one of the important techniques to extract useful structural information from grayscale image leading to dramatically reducing the amount of data for further processing. This edge detection algorithm has low error rate, which means that the detection should accurately find as many edges as possible from an image. Once these edges are extracted we are further using these edge images as an input for feature extraction stage.

3.3      Feature extraction from preprocessed image

The process of extracting unique information from preprocessed image is known as feature extraction. This feature extraction involves reducing the amount of resources required to describe a large set of data, when performing analysis of complex and large amount of data set. While performing analysis on large amount of data and vari-able it might require large amount of memory and efficient computation technique. Feature extraction is a general term for methods of constructing combinations of the variables to crack these problems with sufficient accuracy. To identify the different languages we have implemented certain techniques which will extract unique data or information from preprocessed image. The following tasks are implemented to per-form feature extraction.

Edge Direction Histogram (EDH). The preprocessed image will be used as an input for this stage. To implement this edge direction histogram we need to find the edge direction of the edge detected image by using image gradient technique. The image gradient is the directional change in the intensity or in the color information in an image. Image gradients are used to extract magnitude and direction values from input image. The gradient of two variable functions at each image point is a 2 dimensional vector having components as derivatives in the horizontal and vertical direction. At each image point, the gradient points in the direction of largest possible intensity in-crease and further the length of gradient vector corresponds to the rate of change in the intensity. These direction values will be used as an input for edge direction histo-gram. This histogram generates number of blocks by using direction matrix which reduces the computational complexity. These block values will be further used to calculate mutual information of the image.

Mutual information (MI). The block values which are extracted by using edge direc-tion histogram will be used to calculate mutual information. This mutual information is defined as, one of many quantities that measure the value of one variable tells about another variable and it is a dimension less quantity with unit bits. High mutual infor-mation means a large reduction in uncertainty; low mutual information means a small reduction and no mutual information among two random variables. This technique uses single directional values by keeping another direction constant. Mathematical representation of mutual information can be represented by using the equation

, Y) − I(X ; X ,

I(X; Y) =   ∫ p X, Y(x, i) log       ( ,) ( ,() )dx                             (1)

Here p(X, Y) is joint probability density function of X and Y; and p(x) and p(y) are the marginal mutual probability distribution functions of X and Y. This equation can be implemented between random process of X which generates features of x and the random process Y which generates labels i. This mutual information can be decom-posed into two components; such as marginal mutual information (MMI) and con-junctive component of mutual information (CCMI), that is

I(X;Y)=M(X;Y)+C(X; Y) (2)

Marginal mutual information is the sum of all mutual information computed for each individual block of edge direction histogram values.

M(X; Y) =            I(X ; Y)                                                                 (3)

Conjunctive component of mutual information is defined as combination of compo-nents or forming a connection with the existing mutual information values of each individual blocks of edge direction histogram.

C(X; Y)=           [I(X ; X ,                                      )]                          (4)

By using these methods we have calculated mutual information of each individual input image. The values which are calculated for each individual language has its own mutual information value and they play vital role for language identification. Hence these values will be further used as input for classification of languages.

3.4      Classification of languages

The features extracted from the preprocessed images can be used for further classifi-cation. In this work mutual information of each individual input image will be consi-dered as an input data for classification purpose. To perform this classification process we are using KNN classifier because of its simplicity and efficiency. This KNN classifier will classify the given test data based on the KNN model which has been designed by using sample data. All the test data will be predicted with the model based on the neighboring values. This KNN model will classify the test data to their respective categories. Here we classify Kannada and Hindi languages based on the mutual information of each individual image value

4         Experimental Results

The proposed methodology has been implemented on the data set of 400 images which is a combination of 200 Kannada text images and 200 Hindi text images having different style of font and variable font size, variable size of images. All the tech-niques has been implemented on each individual image by using MATLAB (R2015a) version and relevant resultant output is represented. Figure 2 represents different stag-es of output, Figure 2(a) and 2(b) are input image of Kannada and Hindi language, Figure 2(b) is its grayscale image, Figure 2(c) is the result of application of canny edge detector to extract the edges of gray image, and Figure 2(d) is image after noise

Further we calculated mutual information of input image and languages are identified based on the mutual information values. KNN classifier is used to classify these lan-guages based on KNN classification model which has been designed by using sample data sets. The accuracy of proposed algorithm is 97.1%.

5         Conclusion

In this work we have proposed a methodology for identification and classification of Indian languages present in multilingual document by using mutual information and classification is done based on KNN classification model. This method has been im-plemented on all type of text document images such as text book images, news paper images, journal images on text blocks. Mutual information of each image is calculated and considered for classification purpose. KNN classifier is classifying all test data set based on the nearest neighbors’ method with the accuracy of 97.1%.

  • References
  1. G D Joshi, S Garg and J Sivaswamy, “A Generalised Framework for Script Identification”, International Journal of Document Analysis and Recognition (IJDAR).Vol. 10, No. 2, PP: 55-68, November 2007.
  2. R Rani, R Dhir and G S Lehal, “Modified Gabor Feature Extraction Method For Word Level ScriptIdentification- Experimentation With Gurumukhi And English Scripts”, In-ternational Journal of Signal Processing, Image Processing and Pattern Recognition Vol. 6, No. 5, PP. 25-38, 2013.
  3. P Ramanathan, “Automatic Identification of Handwritten Scripts”, Middle-East Journal of Scientific Research Vol. 19, No.7, PP. 933-936, 2014.
  4. A L Spitz, “Determination of Script and Language Content of Document Images”, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 19, No.3, PP. 233-245, March 1997.

T N Tan, “Rotation Invariant Texture Features and Their Use in Automatic Script Identifi-cation”, IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 7, PP. 751-756, July 1998.

  1. S L Wood, X Yao, K Krishnamurthi and L Dang, “Language Identification for Printed Text Independent of Segmentation”, Proceedings of International Conference on Image Processing, Vol. 3, PP. 428-431, Oct. 1995.
  2. J Hochberg, P Kelly, T Thomas and L Kerns, “Automatic Script Identification from Doc-ument Images Using Cluster Based Templates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, PP. 176-181, Feb. 1997.
  3. Bhupendra Kumar, Aniket Bera and Tushar Patnaik, “Line Based Robust Script Identifica-tion for Indian Languages”, International Journal of Information and Electronics Engineer-ing, Vol. 2, No.2, March 2012.
  4. B V Dhandra, H Mallikarjun, Ravindra Hegadi and V S Malemath, “Word-Wise Script Identification From Bilingual Documents Based On Morphological Reconstruction”, in First IEEE International Conference on Digital Information Management, 2006, PP. 389 – 394, Dec 2006.
  • Share :