UNEATLANTICO presents a neural network-based lemmatization model for the Urdu language

17 Jan 2024
UNEATLANTICO presents a neural network-based lemmatization model for the Urdu language

Researchers of the Universidad Europea del Atlántico (European University of the Atlantic, UNEATLANTICO) collaborate with the Universidad Internacional Iberoamerican (International Iberoamerican University, UNIB) on a study that presents a lemmatization algorithm for the Urdu language.

In the field of natural language processing (NLP), machine translation (MT) optimizes communication between people by bridging the language gap. In machine translation, normalization and morphological analysis are important modules for information retrieval (IR).

Derivation and lemmatization are often used as techniques for finding the correct root of words in a language. However, there are studies on IR systems for the Urdu language that show that lemmatization is more efficient than derivation, given the infixes that are present in Urdu words. In semantics, the goal of lemmatization is to group the inflected forms of a word in order to decompose them into a common form and analyze them as a basic term. In other words, it consists of eliminating the inflectional endings of words to return them to their base form.

There are few studies on the lemmatization of Urdu and such studies tend to focus on rules, leaving aside elementary aspects such as noun identification, the handling of empty words, borrowings, among others. Therefore, the aim of this research is to present an improved lemmatization algorithm based on ordinary neural network models for the Urdu language. Focusing mainly on the detection of proper names, lemmatization of Urdu morphological, inflectional, and derivational words, among others.

Research results

The results showed that this proposed model has the ability to address missing areas of Urdu lemmatization, such as the handling of loanwords, empty words, noun identification, and Urdu words with diacritical marks. Likewise, this model efficiently handles the lemmatization of Urdu morphological, inflectional, and derivational words.

The integration of the AFED model greatly improved the performance of the system achieving accuracy, precision, recall, and F-score of 0.96, 0.95, 0.95, and 0.95, respectively.

If you want to know more about this fascinating study, click here.

For further research, check the UNEATLANTICO repository