The Basics of Text Mining
Everyone has some sort of experience with text mining, even though they might not be aware of it. Who hasn’t used google to find the answer to a question? Text mining is part of our daily lives and it’s everywhere. From the filtering of spam in our mailboxes to chatting with chatbots on websites. The possibilities in the field of text mining and natural language processing are rapidly expanding. Models have been developed that can create summaries from long texts, retrieve the correct answers to questions from texts, and even write their own creative blogs.
The foundation of these developments can all be found in the field of text mining. On a professional level, text mining is also becoming more useful. For example, in the world of finance, text mining is finding its way in things such as stock market prediction, fraud detection, risk management, and customer relationship management.
Text mining is becoming increasingly popular, in our daily and professional lives, and it’s due to the importance of text itself.
Text as data
Text is one of those things that we simply cannot ignore, it’s everywhere and it has been around for thousands of years. With the rise of the internet in recent decades and the emergence of social media, the text has become even more present, especially in a digital format. In the digital age, we are living in, working with data is becoming increasingly accessible and important.
And while working with data is challenging enough, working with text data has its own challenges. For instance, how can you describe a collection of documents that have endless variations of words? How can you take the mean or median of a sentence or set of words? How can you work with different words that mean the same thing?
That’s where the field of text mining comes into play. Text mining is a field of study that deals with text data and focuses on how to convert text into useful information and insights. Anyone aspiring to work in the world of data needs to have some level of understanding of text mining.
Challenges of working with text
Before you can start working with text data, it’s important to understand the challenges that come with working with text data, and why it’s so different from other forms of data. The biggest challenge lies within the variability of text.
While other types of data are often limited or clear in their definition, this is not the case with text data. There are almost endless variations. Ask ten different people to give a description of something as clear as a Christmas tree, and they all come up with a slightly different description of that same Christmas tree. And this is assuming that it’s all in the same language, which might not be the case, with around 6.500 different languages across the globe, each with its own unique grammar and vocabulary.
Even if the ten descriptions are in the same language, there are enough confounding factors that can lead to differences in the descriptions. Those factors are easily recognized and dealt with by humans, but that’s a luxury that computer programs don’t have. We humans learn the basic rules of languages and texts at an early age, while computers need very clear rules before they can even start to tackle the basics of language.
Some of the more general challenges are discussed below.
- Ambiguity is when the meaning of one word or one sentence is unclear, or can even have multiple meanings. The phrase “Call me a taxi, please.” can go different ways on how you interpret it (think on that for a moment). For humans, ambiguity is often resolved by context. From the context, we can know whether someone wants to go and needs a way of transport in the form of a taxi, or whether someone wants you to call him/her by the name Taxi. This context requires knowledge about the world around the phrase, and this is hard to incorporate in computer programs.
- Synonymy is another aspect of language that’s easy to understand for humans, while computer programs will have a hard time working with it. Most people know that “start, commence, begin, and initiate” (nearly) mean the same thing. A computer program does not have that inherent knowledge, it simply sees four different combinations of letters.
- Morphology revolves around how words are formed, and around their relationship to other words. Any human knows or can see that the words “wait, waiting, waited, waits” are all related to the same word, and roughly have the same meaning. Even if we don’t exactly know a word, we can often infer its meaning due to our knowledge about other similar words. And it’s very difficult to transfer this knowledge to a computer program.
The examples above have one thing in common, and that they show that it’s very hard, even near to impossible, to learn a computer about the rules of language. The endless possible variation in language forces us to be creative in how we deal with it. By using key principles of text mining it’s possible to greatly reduce this enormous variability.
Reducing the variability of text
Text mining revolves around simplifying texts and documents to a format that computers can more easily understand and use so those insights can be gained. But, as mentioned before there is simply too much variation in language. Thus, we need to reduce the variation.
But before we’ll go into detail about the different techniques, let’s name our assumptions. We want to get as much insight as possible from the texts, and we assume that the different words in each text will contribute to that insight. Therefore, we want to reduce the variation of the texts while leaving as much potential for insights as possible.
There are numerous techniques you can use, and they are often surprisingly simple while greatly reducing the variation in texts. The most common techniques will be discussed below.
- Tokenizing texts into individual words and/or sentences, essentially cutting them into smaller parts. The way computer programs work is that they see an entire text as an object, but you want to be able to work with individual words or sentences. You need to convert the information from one object to multiple objects.
- From uppercase to lowercase. For a computer program, ‘a’ is different from ‘A’. And while there is some information on the difference between lowercase and uppercase letters, it also greatly increases the variability of a piece of text. Therefore, one of the first steps in text mining is to convert your texts from mixed lowercase and uppercase to entirely lowercase. This can potentially reduce the different combinations of letters by 50%.
- Removing stop words. There are some words that occur very frequently but don’t actually provide much meaning to the sentence they are in. Removing these words to reduce variation in texts is quite an effective method. Words like ‘the’, ‘a’, ‘and’, ‘for’ or ‘he’ don’t carry too much weight, and can therefore be omitted. There are standard lists of stopwords available, and you can always add your own custom words if there are other words that occur too frequently while they don’t provide extra meaning. Removing stop words will help reduce the amount of data and make sure that the remaining words actually contribute to the meaning of the text.
- Another side of the coin is removing extremely rare words. Some words occur only very infrequently, and thus it is difficult to discern if they have meaning. It really depends on the case you’re working on. Sometimes you would want to keep them, while in other situations you would want to remove them. Carefully consider what the added benefit is to removing or keeping those rare words.
- Removing punctuation. Besides letters, there are a lot of extra signs that help in writing and reading a document. But again, these only add to the variability of the texts and combinations of words. They may make a text readable to a human, but a computer program won’t always need them when looking to extract meaning from texts.
- Generalizing types of words. Sometimes, texts might contain words that are different but mean the same thing, such as email addresses and mobile phone numbers. Every phone number is different, but it’s always a combination of some numbers. For your use case, it might not be useful to know about individual telephone numbers, but it could be helpful if it’s about a telephone number. To put this into practice, you could replace all telephone numbers with the same standard phrase. That way, you reduce the variability, but you’ll keep the most relevant information.
Retrieving insights from cleaned text
The mentioned techniques show some essential steps in working with text data and executing them should provide you with a nice, clean dataset based on the original text. The next step in the process is to retrieve insights from the cleaned text.
The two techniques that we’ll be looking at are often what any data scientist will start with when working with cleaned text data.
- Bag-of-words model (BOW). With this model, you can represent your text as a collection of words, without looking at grammar, but still keeping the frequency of words as a factor. It’s one of the most basic methods to represent a text, but still very effective. It’s often used in document classification.
- Term Frequency – Inverse Document Frequency (TF-IDF). TF-IDF is a step higher up from the bag-of-words model. While the BOW only takes the frequency of a word into account, TF-IDF also takes the relative importance of a word into account. It does this by increasing the TF-IDF value proportionally based on the frequency of a word and adjusting this by looking at the number of documents that contain that word. It’s a simple, yet brilliant technique, and it’s often used in document searching as a method to weigh the terms.
While basic, BOW and TF-IDF are both still used often and can help you to quickly gain some insights into your texts. These two techniques can help you take your first steps in the text mining adventure, and eventually prepare you for the really exciting stuff, such as using their output to train machine learning models.
We just looked at the basics of text mining, but there is a whole field about Natural Language Processing (NLP) that includes more advanced and interesting techniques. We’ll discuss these exciting techniques in a later blog.
Do you want to learn more about this topic in an interactive way? Just join our free webinar via this link!