In our previous data-related blog we took a deep dive into the basics of text mining. In this blog, we established the presence and importance of text data, and how everyone should have a basic understanding of how to work with text data. Also, we addressed the steps someone needs to take when wanting to work with text data and how that fits in the process, and we had a first look at some of the techniques that you’ll need. Overall we got a better understanding of how we should work with text data, but there is a lot more to the field of text mining and natural language processing (NLP). In this blog, we’re going to focus on several more advanced techniques.
Natural Language Processing
The field of natural language processing is concerned with the interactions between humans and computers and focuses on how to help programs understand language data. When a computer is able to (nearly) understand the details and contextual nuances of language, we can put it into use by helping us extract insights and information from texts.
While both text mining and natural language processing deal with textual data, there are some differences between them. Text mining focuses more on how to retrieve information from text data and discover patterns within those texts, while NLP tries to understand and replicate language. Thus, the most important difference lies within their understanding of the texts they analyse. While text mining allows for extracting details and information from the text data, there is no understanding of the information within those texts. The techniques that NLP employs dive into the grammatical and semantic properties of texts and thus build up more of an understanding of those texts.
Those techniques can lead to exciting developments, such as the use of search engines, intelligent chatbots, and spellcheck applications. There are a lot of techniques and methods and in this blog we will be discussing embeddings, topic modelling, and transformer-based language models.
Embeddings; converting words into numerical representations
One of the challenges of working with text data is trying to represent the contextual meaning of specific words within a sentence or text. As we established in our earlier blog, humans have a lot of context we hear, see, or know before we read a text. A computer program often doesn’t have that luxury. In order to try and solve this problem, we can use the technique of word embeddings.
Word embeddings can be seen as numerical representations of words in a large number of dimensions. It is a way of transforming the data from text, words and sentences (in the form of strings) into a generic numerical vector. Instead of having to compare different words, you can then compare the numerical representation of those words, something where computers excel at.
To get an embedding for each word in a text, we’re going to try and score each word in several different categories based on the surrounding words. It might be difficult to visualize this, but I’ll try to make it more clear with an example. Let’s use the word ‘cat’ from the sentence ‘there is a cat sitting in a tree’. Now, let’s give it a score between -1 and 1 in several categories. For the category animals it will score 0.9, for the category food it will score 0.2, and for the category Christmas, it will score -0.7. And now repeat this for at least a couple hundred categories (or dimensions). And then do this for each word in each sentence. Eventually, you’ll end up with a numerical representation for each word. And you can compare the numerical representation, or vectors, to each other (see Figure 1).
Figure 1. A visual representation of word embeddings. On the left, there are several words with their scores in multiple dimensions. Their behavior is plotted on the right. (source)
If we do this automatically, and on a larger scale, we can create vectors for a large number of words. The vectors are scored using the surrounding words as context, so essentially the behavior of each word is captured within the vector. Words with a similar context will be grouped or scored together. With the previous example in mind, the word ‘dog’ will probably behave similarly to the word ‘cat’, because of the surrounding words being similar. This will lead to similar words having a similar score. The similarity in behavior is even captured on a deeper level, with synonyms having similar scores. Very cool! This is all done using shallow neural networks. The most popular technique to do this is called word2vec.
Using the word embedding techniques, we can convert words to a numerical representation, and use that to compare them with each other. There are options to extend this technique to a larger scale, such as paragraphs or entire documents (doc2vec). This is a similar concept to word2vec, except that the corresponding, overarching text is used as a reference during training. Instead of every word in each text being transformed into a vector, the entire text is being transformed into a single vector. And this single vector could summarize the information of the entire text into one numerical representation. But if we want to try and summarize an entire text, other techniques are also available.
Topic modelling; clustering words into topics
The techniques that apply embeddings are all based on the words and how they behave together. Now let’s have a look at a technique that looks at the compositions of words within texts; topic modelling. Topic modelling is based on several core concepts. All texts are a collection of words and based on the composition of the words in a text, we can deduct what the text is about. We can do this by categorizing the words into different topics, in such a way that each topic is represented by a collection of words.
Let’s go into a little more detail about topic modelling. It’s an unsupervised machine learning method that aims to detect common themes within documents based on the coherence of words. Topic modelling essentially tries to cluster the words into groups that are often used together. For example, a text about cats will often have words that are related to that subject, such as ‘kitten, cat, paw, fur, and meow’. By identifying these words, topic modelling can group texts containing these words with other texts that talk about similar topics. Using these concepts as a foundation, we can create a model that can score the topics within a text. An important distinction is that each text consists of a combination of topics. It’s never the case that one text only contains one topic.
We can use topic modelling to quickly categorize or summarize texts. Certainly, when there are a lot of texts to analyse, topic modelling may come in handy. For example, when receiving a lot of reviews in customer support, one could use topic modelling to separate the positive reviews from the negative reviews and act accordingly. It can also be used in recommender systems, using topic modelling to infer which texts you have read and liked, in order to recommend new texts.
Figure 2. A visual representation of topic modelling. A text contains words, and some words are found to be part of specific topics. (source)
Transformer-based language models
In our current age and time, rapid developments are made, especially in the world of data. And every once in a while, one of those developments can have a lot of impacts. The development of transformer-based models was one of those developments. In 2017, they were introduced by Google. That’s one advantage of the tech giants. They develop a lot of exciting technologies, and often they make it open source. This allows us to make use of it, and tune it to our needs.
Transformer-based models have a lot of applications within the world of natural language processing, and outside of it as well (computer vision). These are some of the most complicated models there are, so I won’t go into too much detail on how they work. Making use of them is quite difficult, but the results can be outstanding. One of the biggest developments is that these models can be trained on larger datasets than previously was possible. Due to their architecture, more parallelization in their data processing and training can be applied. Their design led to pre-trained models being developed by (large) tech companies such BERT (Google), BART (Facebook), and GPT (OpenAI). These pre-trained models are trained on large datasets, so already very useful on their own. Personally, I always use a specific Python library called transformers when I want to make use of them. The open-source side of things allows us to finetune those pre-trained models to our own use cases. And we can use them for very exciting things.
On the natural language processing side of things, the possibilities seem endless. We can use transformers-based language models for summarization, translation, text classification, information extraction, question answering, and text generation. And that in over 100 languages. One of the coolest and terrifying examples of the power of these language models is an experiment done by Liam Porr on the text generation capabilities of GPT-3 (a transformer-based language model developed in 2020). He tuned the model in such a way that it would write a blog on its own, and then posted it online to see whether people would notice the difference. The blog was published and subsequently rose to the number one spot on the forum, and almost no one had any idea that it was written by a machine.
Beyond small experiments, the models have a lot of real-life applications. When you are typing in the google search engine, it might suggest the next word. When translating one language to another using google translate, you are making use of transformers. They seem to be everywhere, and rightly so. If you manage to accomplish the difficult task of incorporating a transformer-based language model into your systems, the rewards can be quite high.
Advanced natural language processing
The three techniques discussed in this blog are just a small taste of the possibilities in the world of natural language processing. We looked at the basics of word embeddings and topic modelling, and marveled at the power of transformer-based language models. With this blog, you will have gained more insights into some of the exciting technologies used with text data.
Do you want to learn more about this topic in an interactive way? Just join our free webinar via this link!