Figure 1: Themes in response to “What are you doing or could you do for the environment?”
Key principle: moving from words to concepts
Language is a complex subject for computers to handle. Human’s speak as easily we breathe, to the point of forgetting that before knowing how to communicate, we first took a few years to learn to speak and then to write, and studied at length before knowing how to summarize text (which remains a difficult exercise even for humans). So, let’s immediately respond to that myth: the machine doesn’t understand, it simulates. It does not really do sentiment analysis: it sorts, arranges & classifies information according to symbols such as letters, words and sentences. To succeed in finding interesting insights, you will have to guide the machine; alone, it will only be able to create “statistics” from the language used in a corpus. But this help is already extremely valuable when we have the right tools.
The Great French National Debate: a useful case study for all of us
The Great French National Debate, initiated on January 15, 2019 by the French Government upon the initiative of Emmanuel Macron, the French President, is based on a digital platform (granddebat.fr) allowing each citizen to express themselves on 4 themes: ecological transition; taxation and public spending; democracy and citizenship; the organization of the state and public services. Around these themes, a number of open-ended questions were asked, such as: “What are you doing today to protect the environment and / or what could you do?”. Hundreds of thousands of contributions have been produced: on the theme of ecological transition, this represented more than 700,000 responses (at the time of this publication) to the 12 open questions.
Analyzing this matter is an inhuman task. The amount of information is more than what a human can reasonably read and synthesize: this volume of text represents some 20 million words, or 40 times the size of the book War and Peace. But assisted by an artificial intelligence system, it becomes quite accessible, without necessarily requiring advanced technical skills.
What can the machine do?
In language processing, there are roughly two schools. The symbolic approach seeks to “encode” the rules of language (grammar, syntax, lexicography) and produces expert systems based on linguistic rules. The statistical approach has recently seen spectacular breakthroughs with the results of artificial neural networks, better known as machine learning and deep learning.
While these two methods are often opposed, in practice, a hybrid approach between the two should be favored in order to take advantage of the incredible capabilities that Artificial Intelligence brings when complemented by human intelligence. By combining these two approaches, we offer humans an “augmented intelligence” thanks to what the machine can produce: the machine proposes, the human optimizes (he validates, corrects, and directs the machine).
When you load your body of text, Proxem Studio will automatically sort, organize and classify the entire vocabulary and offer thousands of concepts that emerge naturally. This method has a double advantage over old school approaches. On the one hand, it makes it possible not to start from an a priori: the user no longer needs to spend days creating a dictionary of words corresponding to the expected themes, they can let themselves be guided by what is really present in the data. On the other hand, it is no longer necessary to manually annotate thousands of document examples to teach the machine the relevant concepts: it is now able to identify them on its own.
The example below shows the result obtained on the data of the “ecological transition” theme after the analysis of 700,000 feedbacks. The machine has already automatically grouped terms (words and expressions) that “go together”. For that, it operates a bit like when we read the Smurfs comics. With our old human brains, we easily understand every occurrence of “smurf” from the context of the words in which the term appears. This is exactly what the machine does by automatically grouping expressions into thematic and sub-thematic.