TEXT NORMALIZATION AND SPELLING CORRECTION IN KAZAKH LANGUAGE

Mukhanova M.B.

134 просмотров

Mukhanova M.B.

TEXT NORMALIZATION AND SPELLING CORRECTION IN KAZAKH LANGUAGE *

Аннотация:
text normalization is a significant step in preprocessing of informal, social media and short texts in the Natural Language Processing (NLP) tasks. Researches in the field are mostly on English, but not on the agglutinative languages such as Kazakh, Korean, Japanese, which are determined as morphologically rich languages, and complex compared to English. In this paper, we present text normalization and auto correction of words for Kazakh language, we convert informal text into grammatically correct form. To do the auto correction task, first we countered keyboard error while typing words, then choose the best match from them. Additionally, we categorized words to several groups and separated text into modules of words. The exact match score of the overall system on the provided datasets are 85.40 per cent.

Ключевые слова:
natural language processing, corpora, machine learning, agglutinative language

Introduction Text normalization is the task of transforming informal writing into its standard form in the language. It is an important processing step for a wide range of Natural Language Processing (NLP) tasks such as text-to-speech synthesis, speech recognition, information extraction, parsing, and machine translation. (Richard Sproat, Alan W. Black, Stanley F. Chen, Shankar Kumar, Mari Ostendorf, Christopher Richards, 2001) Text normalization involves merging different written forms of token into a canonical normalized form; for example, a document may contain the equivalent tokens “Mr.”, “Mr”, “mister”, and “Mister” that would all be normalized to a single form (Nitin Indurkhya, Fred J. Damerau, 2010). Normalization poses multiple challenges, as we know it is a task of mapping all out-of-vocabulary non-standard word tokens to invocabulary standard forms, to deal with it we should convert raw text into grammatically correct sentence by modifying punctuation and capitalization, and adding, removing, or reordering words. Also, we gave specific values to some types such as date, phone, currency, URL, etc. In informal texts, as usual there are a lot of mistakes, it is useful to correct them. For spelling correction tasks, we consider keyboard typing mistakes, character repetition and other tools. In this paper, we propose spelling correction and text preprocessing by the techniques mentioned above; it gives higher precision accuracy than other methodologies. The rest of this paper is organized as follows. In Section 2 we discuss previous approaches to the normalization problem. Section 3 presents our normalization framework, including the actual normalization and learning procedures. In Section 4 we introduce evaluation metric, and present experimental results of our model with respect to several categories. Finally, we conclude in Section 5. Related Work Early studies of text normalization include machine learning approaches in text-tospeech and social media, and with usage of neural networks in it. In this paper, we use similar method as in works which investigated text normalization in social media, because of recent rise heavily informal writing in messaging applications, text normalization is a huge problem of every language. Previous works handled text normalization process by producing noisy text where normalized text go through a noisy channel; this approach called noisy channel model. (Moore, Eric Brill and Robert C., 2000) presented a method for modelling the spelling correction as a noisy channel model based on string to string edits; this model gives significant improvements compared to early studies. (Kristina Toutanova and Robert C. Moore, 2002) enhanced the string to string edits model by modelling pronunciation similarities between words achieved a substantial performance improvement over the previous best performing models for spelling correction. (Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Anupam Basu, 2007) introduced a supervised HMM channel model which adopted the spell checking metaphor based on character-level edit which has been extended by (Paul Cook and Suzanne Stevenson, 2009) who used unsupervised noisy channel model using probabilistic models for common abbreviation and various spelling errors types. (Kobus Catherine, François Yvon, and Géraldine, 2008) presented French SMS messages normalization process by normalizing the orthography with combination of Statistical Machine Translation and automatic speech recognition approaches. (Bo Han and Timothy Baldwin, 2011) presented model for identifying and normalizing ill-formed words, generating correction candidates based on morphophonemic similarity over SMS corpus and Twitter. (Joseph Kaufmann and Jugal Kalita, 2010) used a machine translation approach with a pre-processor for syntactic normalization rather than lexical. (Fei Liu, Fuliang Weng, and Xiao Jiang, 2012) proposed a cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. There are fewer studies done on the agglutinative language comparing to English, (Gülşen Eryiğit, Dilara Torunoğlu-Selamet, 2017) introduced social media text normalization for Turkish by analyzing Web 2.0 Turkish texts, categorizing them into seven types and providing candidate spelling correction words. (Mohammad Saloot, Norisma Idris, Rohana Mahmud, 2014) propose an approach to normalize the Malay Twitter messages based on corpus-driven analysis. (Panchapagesan Krishnamurthy, P.P. Talukdar, N Sridhar, A.G. Ramakrishnan, 2004) introduced a novel approach to text normalization, wherein tokenization and initial token classification are combined into one stage followed by a second level of token sense disambiguation, is described. (O. De Clercq, B. Desmet, S. Schulz, E. Lefever, V. Hoste, 2013) used a multi module approach which rely on Machine Translation and transliteration-based systems for social media messages in the Dutch language. Agglutinative languages tend to have longer words than fusional ones (Steffen Eger et al., 2016) and spelling correction model would be complex, because of the morphology. Evaluation In this section we introduce our normalization framework, which considers both spelling correction and text preprocessing processes. Morphologically rich languages such as Kazakh, Korean, Finnish, Arabic, Turkish, etc. are considered as highly inflectional; their characteristic is that one stem in these languages may have hundreds of possible forms. Spelling Correction Spelling errors are categorized into two classes: typographic and cognitive. Cognitive errors phonetic or orthographic similarity of words; person does not know how to spell a word.

Полная версия статьи PDF

Номер журнала Вестник науки №6 (39) том 1

Ссылка для цитирования:

Mukhanova M.B. TEXT NORMALIZATION AND SPELLING CORRECTION IN KAZAKH LANGUAGE // Вестник науки №6 (39) том 1. С. 7 - 17. 2021 г. ISSN 2712-8849 // Электронный ресурс: https://www.вестник-науки.рф/article/4530 (дата обращения: 25.04.2024 г.)

Альтернативная ссылка латинскими символами: vestnik-nauki.com/article/4530

Нашли грубую ошибку (плагиат, фальсифицированные данные или иные нарушения научно-издательской этики) ?
- напишите письмо в редакцию журнала: zhurnal@vestnik-nauki.com

* В выпусках журнала могут упоминаться организации (Meta, Facebook, Instagram) в отношении которых судом принято вступившее в законную силу решение о ликвидации или запрете деятельности по основаниям, предусмотренным Федеральным законом от 25 июля 2002 года № 114-ФЗ 'О противодействии экстремистской деятельности' (далее - Федеральный закон 'О противодействии экстремистской деятельности'), или об организации, включенной в опубликованный единый федеральный список организаций, в том числе иностранных и международных организаций, признанных в соответствии с законодательством Российской Федерации террористическими, без указания на то, что соответствующее общественное объединение или иная организация ликвидированы или их деятельность запрещена.