Natural Language Processing & Text Analytics: the glossary

NLP, NLU, text mining, sentiment analysis… The terminology around natural language processing and text analytics technologies may seem difficult to understand. That's why we decided to write this glossary, so that our readers can see more clearly!

NLP (Natural Language Processing) is one of the most prolific areas of AI application. This is the branch of Artificial Intelligence which consists of understanding and processing human language.

Its uses in business are very numerous: the NLP offers a wealth of opportunities to gain productivity, reliability, and make better decisions.

It is considered that 80% of the data present in companies is in text format. So, whether in industry, law, health, or customer relationship management, all situations in which we encounter large quantities of text – to read, to synthesize, to process, to translate – deserve a NLP project. The underlying technology has experienced a significant leap in recent years thanks to machine learning; and will continue to progress!

NLP

NLP, or Natural Language Processing, refers to all the tasks that allow a computer to process the data in human language. It is therefore a computer discipline in its own right, covering many subjects and methods, which are at the origin of search engines in particular. Some authors distinguish between so-called “low-level” tasks allowing a simple representation of the text by a computer, and “high-level” tasks allowing the machine to “understand” the text.

NLU

A Natural Language Understanding (NLU) module extracts a simplified semantic representation from textual statements.

Let’s take the example of a chatbot that receives the following request:

“Hello, I’m looking to book a Chinese restaurant near the 11th arrondissement in Paris for tomorrow at 8:00 p.m., for 4 people.Is that possible?”

NLU will allow the bot to identify the user’s intention (find a restaurant) and qualify it by a list of entities (type of restaurant, number of people, location etc…).

Semantic Analysis

Semantic Analysis is the branch of natural language processing that aims to “understand” the meaning of a text. Quotation marks are required here, because the understanding of text for a machine is much less rich than what happens when a human is reading a text.

NLG

Natural Language Generation or NLG is the counterpart of semantic analysis: its objective is to transform data into text, with a rendering that should not be distinguished from a human creation. Nothing is more annoying, when you receive a letter or an email, than the moment when you realize that it was written by a machine…

Sentiment Analysis

Sentiment Analysis, sometimes called “opinion mining”, in the field of natural language processing, consists of identifying whether a statement (a sentence, a tweet, a piece of feedback…) is positive, neutral or negative according to a certain prism.

Humans can express their opinions spontaneously (for example by writing on social networks or sending complaint emails) or in response to a solicitation (in particular following a satisfaction survey). What opinions are expressed by the speaker? Is it a positive, negative, neutral or mixed feeling? What exactly is this opinion about? What emotions emerge between joy, anger, fear, surprise, sadness, disgust, trust…? Is a consumer who writes “I am surprised that I still haven’t received a response” surprised, disappointed or angry, or does he or she feel several of these emotions simultaneously?

From a technical point of view, sentiment analysis can be seen as a special case of relationship extraction. Indeed, an opinion links a speaker who expresses himself and the objects of the world concerned (market product, service offered, political action…).

Opinion Mining

It is another name, more academic, given to Sentiment Analysis.

Polarity

Polarity, in sentiment analysis, refers to a score or classification of a textual extract according to the tone of the opinion (positive / neutral / negative).

Text Mining

The purpose of Text Mining is to extract knowledge from a given corpus of documents. Text Mining is the Semantic Analysis of documents in a corpus to feed a data mining system. The latter will interpret the results of the textual analysis in such a way as to highlight interesting correlations, perform time series analysis, detect correlations… The text search works better on a corpus that is sufficiently homogeneous so that the same analyzer can process all the documents.

Speech to text

This is a task to extract the text of an audio document, such as voice transcription when dictating a message via a virtual assistant like Siri.

Once the audio is transformed into text, it is then possible to Semantically Analyse it.

Categorization

The automatic categorization of a document (or part of it) means taking a global understanding of its content and putting it in one or more boxes. In what language is it written? What type of documents are they? Is there an urgent that needs to be treated in it? What are its main themes? Is an email a spam or not?

Terminological Extraction

Terminological Extraction, also called chunking, is the task of identifying groups of words that form useful expressions. “Sweet potato”, for example does not have the same meaning as the separate words “sweet” and “potato”. We also speak of multi-word expressions.

Linguists also call collocation a frequent multi-word expression, appearing in everyday language (“Fast Food”) or in a specialized vocabulary (“personality disorder”).

Recognizing multi-word expressions is not as simple as it sounds, because the language can be ambiguous. Take for instance the sentence: “they were milking cows” You don’t know whether the person is referring to milking a cow, or the type of cow.

Named Entity Recognition

Local understanding of a text consists of highlighting words or groups of words in order to recognize concepts. In linguist jargon, this task is called Named Entity Recognition (NER), or Entity Extraction.

The simplest case is to identify in a text a unitary information such as a date, a financial amount, a percentage, a telephone number, an e-mail address, a URL, a license plate number, a social security number…

Lexical Analysis

Lexical Analysis, generally called tokenization, is the task of separating symbols in the text into “words”, thus creating the lexicon of a given corpus. This is therefore a prerequisite for terminology extraction.

Discover our blog posts about NLP applications: