'
Yessenbay K.S.
STUDY OF PARAPHRASING METHODS FOR MACHINE TRANSLATION *
Аннотация:
this article discusses the methods of paraphrasing in machine learning. Paraphrases are alternative ways of expressing the content of a phrase, sentence, or single word. Paraphrases are used in our language for many reasons: they are designed to clarify, explain, describe, define, and reconstruct an expression, so they are very important for studying natural language semantics. Over the past thirty years, various methods for automatic paraphrasing have emerged
Ключевые слова:
paraphrases, machine translation, natural language processing, statistical machine translation, data-oriented, corpora
People paraphrase naturally, usually when they cannot remember the exact words and generate words to express the same semantic content. The ability to recognize and generate paraphrases is very important for many tasks such as information retrieval, summarization, dialogue, semantic analysis, question answering and machine translation. Different data-driven paraphrasing approaches is based on the sort of data that they use. Until now have been used for paraphrasing three types of data: multiple translations, comparable corpora, and monolingual corpora. When using this data for paraphrasing, each of these types has its own advantages and disadvantages. Let's take a closer look at each data type. Barzilay (2003) proposed that multiple translations are a natural source of paraphrasing, since the translations are produced by different translators who convey the same meaning. Each author has his own different ways of expressing the meaning of the original text, so this is the essence of multiple translation. Barzilay and McKeown (2001) were the first to use multiple translations for paraphrasing by applying sentence alignment techniques. This technique is an ordinary method for extraction paraphrases from multiple translations. Pang et al. (2003) also used multiple translations for paraphrasing. Instead of using sentence alignment techniques that equate phrases surrounded by the same words, Pang et al. used a syntax-based alignment algorithm. Figure 1 picturizes this algorithm. Parse trees were merged by grouping components of the same type (for example the two noun phrases and two verb phrases in the figure). The merged parse trees were mapped onto word lattices, by creating alternative paths for every group of merged nodes. Different paths within the word lattices were treated as paraphrases of each other. Any paraphrasing method that relies on them as a source of data has an inherent disadvantage: multiple translations are a rare resource. Methods using it as a data source will only be able to generate a small amount of paraphrasing for a limited set of linguistic usages and genres. Because many natural language processing applications require broad coverage, multiple translations are an inefficient data source for "real" applications. Comparable corpora are much more common in comparison with multiple translations. Comparable corpora consist of texts on the same topic, for example, it may be news articles on the same topic, but announced in different newspapers or articles in different encyclopedias written on the same topic. Comparable corpora and multiple translations are similar to each other, since both types of data describe the same information, but written by different authors. However, comparable corpora cases are more complicated, because, for example, articles on the same topic do not necessarily contain the same information, they can focus on different details. Therefore, in comparable corpora finding equivalent sentences in articles is challenging. Quirk et al. (2004) used sentences that were paired using the string edit distance method as the data source for extracting paraphrase
Номер журнала Вестник науки №11 (32) том 2
Ссылка для цитирования:
Yessenbay K.S. STUDY OF PARAPHRASING METHODS FOR MACHINE TRANSLATION // Вестник науки №11 (32) том 2. С. 6 - 10. 2020 г. ISSN 2712-8849 // Электронный ресурс: https://www.вестник-науки.рф/article/3703 (дата обращения: 12.12.2024 г.)
Вестник науки СМИ ЭЛ № ФС 77 - 84401 © 2020. 16+
*