Languages: how Inbenta creates the perfect lexicon from scratch

Inbenta currently supports 20+ languages, or nearly half of the world’s native speaking population. These languages include some of the most widely spoken in the world including Mandarin, Spanish and English, but also less commonly spoken languages such as Catalan, Basque and Norwegian.

In charge of our team is Chief Linguistic Officer, Caterina Balcells. Caterina has been with Inbenta for more than ten years, leading our quest to support as many languages as possible. We decided to ask her exactly how each lexicon is developed from scratch:

What inspired you to become a computational linguist?  

I studied linguistics because I was good at languages at school, then I realized linguistics was a great thing to study because it has lots of intersections with other fields: neurology, sociology, anthropology… and computer science. After graduating I started teaching languages to people. I got tired of teaching human beings and was hired at Inbenta to teach computers instead – which are nowhere near as noisy as humans!

What does a computational linguist do day to day?

Most of the work for the linguists at Inbenta is project-related: each project (customer) we work with has at least one linguist and a developer assigned to them.

The linguist is responsible for the quality and adequacy of the knowledge base, suggesting improvements to the customer so their content matches the real concerns of the users, fits with the interface where they will be displayed and is easy to understand.

In addition, the linguist has to ensure that once the knowledge base is ready, we can actually answer a user’s question. Therefore, linguists need to add the necessary words, semantic relations or disambiguation rules to both improve our linguistic resources and adapt them to the customer’s needs.

Once the project is live, the linguists monitor what users are asking in order to fine-tune the project’s ability to match the contents with the right answers and to suggest improvements to the customer.

Finally, Inbenta is constantly evolving and linguists are a cornerstone of our research and development process, we are continually adding new languages and functionalities.

How does Inbenta approach learning a new language?

We first need to analyze what type of language it is: do we have the right tools to develop it or will we need to develop new products to handle this new language? We do that together with our product team.

Then we can start developing the new lexical resources for that language: words, spelling correction rules, being able to solve ambiguities in the language, etc.

What have been the most difficult languages to develop and why?

Two years ago we started working with Asian languages. We began with Japanese and have now added Mandarin and Korean to the list.

Some of the challenges in building the lexicon for these languages include their unique semantic and syntactic features as well as the alphabet itself. However, the hardest part is that neither Japanese nor Chinese words have spaces between them. For example, ThiswouldbeasentenceinJapanese.

We developed our own tokenizer for these kind of languages and we have had to adapt all our tools to cope with the particular linguistic feature of not having spaces between words.

What’s the best word from any language?

“Croquetes” is my own favorite word in Catalan, but I also asked our German linguist, Susana Hariri, who suggested “Bundespräsidentenstichwahlwiederholungsverschiebung”, which was chosen as the Word of the Year in 2016 in Austria and roughly means “postponement of the repeat of the run-off election for federal president.

Another great German word which is in our lexical resource is “studentenauslandskrankenversicherung” or “health insurance for students living abroad” in English!

When it comes to supporting a new language it is not as simple as just translating our AI technology. Inbenta develops the lexicon from scratch with a computational linguist who is a native speaker.

The result of this unique language approach is a highly developed lexicon which will ensure companies will be able to fully understand the meaning behind their customer’s query.

Interested in finding out more? Our team of experts are at your service to design a custom proposal for you.

Let’s get in touch

Share to Twitter
by Inbenta Team