Text mining analysis including full code in R

Recently I was working on a text mining project, but I ran into a few problems which took me some time to sort out. The project itself wasn’t to complicated, but finding the right codes and syntax’s cost me way too much time. Therefore, I thought it’s worth sharing the code with you and hope it saves you some hours!

This post will only focus on basic text mining analysis. I’ll show you how to get the most frequent terms from a document, how to plot the frequency of terms. Also, I’ll show you how to get n-grams from the document (bi-grams in particular), how to find associated terms and how to plot an occurrence and correlation matrix with terms.

As a case, I’ll use the famous movie review dataset from Kaggle. The data is split into a train and test file, for this tutorial I’ll only use the train file. The train file contains more than 150,000 lines each with a sentiment notation. I’m not going to use the sentiment information in this post, just the text.

 

Preprocessing data

Just like any other project, I’ll start by loading the libraries I need. This time, the package we need for our text mining analysis is the package  tm . The package  RWeka is also used for our text analysis, this allows us to create n-grams. After loading the libraries, we’ll read the .tsv file from Kaggle.

The train data contains four columns, starting with a PhraseId . The  PhraseId  is a unique identifier for each row. A  SentenceId  is provided so that you can track which phrases belong to a single sentence. The third column is the  Phrase  with the text and finally we have the  Sentiment  column with a number ranging from 0 to 4 referring to the sentiment. Since I’m not going to use the sentiment, I won’t explain this in more detail. In case you want to know, please check the Kaggle website for more information.

Now that we have loaded the data, it is time to change the format a bit. I’m only interested in the text and actually, I only want one sentence per phrase, not the parsed sentences. Therefore, let’s get the longest  Phrase  per  SentenceId  to get the right texts.

As you can see, the  text dataframe now only contains two columns:

  1. doc_id which originally used to be the  SentenceId
  2. text derived from  Phrase

The dataframe contains 8,529 rows with text. Additionally, I remove words with a ‘ in it. Usually, this is a bit tricky because it would remove all the words such as “actor’s” and “doesn’t”. But in this case, everything that follows after ‘ is written as a new word. So “actor’s” becomes “actor ‘s” and “doesn’t” becomes “doesn ‘t”. We can easily remove ” ‘s” and ” ‘t”, this won’t change the meaning of the words.

 

Create corpus for text mining

Now that the data is in the right format, we can continue with our text mining analyses. In order to analyse the data, we need to apply three transformations:

  1. Make a DataframeSource from the text
  2. Create a corpus from DataframeSource
  3. Transform corpus into DocumentTermMatrix or TermDocumentMatrix

So let’s start with the first step. The  tm library contains a function to create a DataframeSource from a dataframe. A DataframeSource is a type of object which contains a little more information than a normal dataframe, such as the number of documents. In order to create a DataframeSource, the original dataframe should contain two column. It should start with a column name  doc_id  with a unique number for each row. The column with the sentences should be named  text . Please be aware that the order of these two columns is also important!

The next step is to create a corpus from the DataframeSource. A corpus also contains all documents, but this time it is possible to apply logic to the text. The  tm_map() function allows you to:

  • convert all characters in lower case
  • skip specific words (in the example below I only remove English stopwords)
  • remove punctuation
  • eliminate white spaces
  • remove numbers
  • stem words

Keep in mind that the order is very important. the English stopwords only contain lower case characters. Therefore, it is important to make the text lower case first followed by removing the English stopwords. Earlier in this post I explained why we removed all ‘ characters. In this dataset words are written with an extra space such as “doesn ‘t” instead of “doesn’t”. The later is part of the English stopwords object.

As you can see, filtering and cleaning the text gives us a different sentence. The words “A”, “of”, “the”, “that”, “what”, “is”, “for”, “some”, “which”, ” but” and “to” are removes, because these are part of the English stopwords object. Also plural words are now written as a singular word. Punctuation is also removes, since we don’t need them for our analysis.

The corpus is the structure we need for all our analyses. The next step would be to create a DocumentTermMatrix object, which is almost similar to a TermDocumentMatrix. The difference is that one object is the other object transposed. The following example shows you what both objects look like.

The output starts with some additional information about the corpus for example the total number of documents, total number of unique terms and sparsity of the document. As you can see, the DocumentTermMatrix contains document ID’s as rows and terms in columns. For the TermDocumentMatrix this is the other way around. Each cell contains a number with the frequency of the term in the document. It only shows ten terms and ten document, but of course there are many more rows and columns.

Now that we have the data in the right format, it is time to analyse the text.

 

Text mining anlysis

For the text analysis and plots, I’ll use the DocumentTermMatrix object. From this object, you can get information such as the number of documents and the number of terms with the following code:

As you can see, we have 8,529 documents with a total of 11,276 unique words, without English stopwords. You can get the unique terms in alphabetical order by using  dtm$dimnames$Terms[i:j] .

One of the interesting things is to know which terms appear together often.

The function  findAssocs() returns the terms that are in the text if there’s also the word “film” in the text. The number 0.05 is the absolute correlation. 0.05 Is very low, but in this text there are not that many words that correlate, this is something we will look into later in this post. The result of this function is a list with all the terms with the minimum correlation number and sorted from high to low. I’ve only selected the first five terms and it also shows the correlation between the term “film” and its associate terms.

For some analysis you might want to use the terms which occur frequently. There are several ways to filter the data to keep only the terms you need for your analysis. One of the solutions is to keep terms based on sparsity with the function  removeSparseTerms() . You can tell the function to remove the terms which have at least a certain sparsity.

In this example, I want to keep all terms which appear in 95% of the document. The function searches for the terms by rounding down, as you can see below the sparsity is actually 91%. So 91% of the documents contain the stemmed words “film”, “like”, “make”, “movi” or “one”.

Another way to get the most frequent terms is by using  findFreqTerms()  or  findMostFreqTerms() . The first one gives you the terms that occur more than X times in the documents. The second one gives you the n terms that appear most, including how often they appear.

These codes show the output for both function. As you can see, there are eleven terms that appear 300 times or more in the text. The second code gives the ten most frequent terms including how often they are written in the documents. The last line of code only returns the terms without the frequency, this is something we will use later on.

As the code shows below, you can either use a  DocumentTermMatrix or a  TermDocumentMatrix . The only difference is that you should transpose the first one. From now on I’ll use the DocumentTermMatrix object in the examples. First we need to create a matrix from the, but be aware that the size of your corpus isn’t to big. In case it contains too many documents or too many terms, it might run into an error. The size of the corpus we use is fine, so there is no problem creating a matrix. From the matrix we take all the terms and calculate how often they appear.

As you can see, there are many ways to select the right words you need for your analysis.

 

 

Create frequency plot from text

Now that we know which terms appear often in the text, I want to know how often they appear together with another word. Of course, we can do this by simply creating a table or we can create a heatmap as I explained in this post. It isn’t too difficult, with just a few lines you can already get this information. But let’s first create a brachart with the frequency per word.

Frequency plot for text mining in R

I’ve used the  freq_table which includes the ten most frequent terms and its frequency and simply plotted it with  ggplot() . The stemmed words “film” and “movi” are found more than 1,000 times in the text. The words “lrb” and “rrb” both occur 352 times.

In order to create a heatmap of which terms appear together, I want to filter the  DocumentTermMatrix such that it only contains the ten most frequent words.

As you can see, the  DocumentTermMatrix still includes all 8,529 documents but the matrix now only has ten terms. Next step is to create a co-occurrence matrix of how often two terms appear in the same text. This can be done with the following code:

This is already what we need for the heatmap, but I’m not interested in the diagonal so I’ll remove these values just like the lower part of the matrix. All that remains is to plot this matrix.

Term co-occurrence for text mining in R

The plot is much easier to interpret than the matrix. The most interesting result is that the term “lrb” is always in a text together with the term “rrb”. Some other terms also co-occur together, but not as often.

 

Correlation of terms

Just like the co-occurrence plot, we can also plot the correlation of terms. In this post I explain four different ways to create a correlation plot.

Correlation plot for text mining

Just like the co-occurrence plot, the correlation plot shows us that “lrb” and “rrb” always appear together in the documents.

 

Analyse n-grams from text

I want to end this post by creating n-grams. n-grams are different than what we’ve looked into before. In previous examples we’ve checked which terms appear together in the text. With n-grams you analyse which terms are used together, or in other words: which words are used in sequence. I’ll show you how to create bi-grams, but with a minor change you can also analyse tri-grams or any number you want.

Creating n-grams in R took me quit some time. I’ve found a lot of codes on the internet, but unfortunately, some of the packages are updated. This causes some trouble…. But! If you use this code below, it will all work out 🙂 The key is to use a  VCorpus object instead of  Corpus . Please don’t ask me why… It works so I’m happy 🙂

So let’s start by creating a  VCorpus object. I apply the same transformations to the text as I did earlier, but this time I do this line by line. In order to create the bi-grams, we’ll use the  RWeka library we loaded at the start of this post. The function  NGramTokenizer() allows us to find the bi-grams in the text. You can control the min and max length of the n-grams. For bi-grams I set both at two. But you can also choose to find bi- and tri-grams, then you’ll need to set  min at two and  max should be three. We use this function to create a new  DocumentTermMatrix .

This example has 62.607 unique bi-grams. The matrix looks similar to the one with only single terms, but this time the terms are now bi-grams. For example, the bi-gram “lrb rrb” appears in document number 2124.

You can use the same function for n-grams such as finding associate terms/n-grams and filtering on sparsity.

The bi-grams have higher correlations than we obtained for single terms. To make the calculations faster, I filter most of the data by using the  removeSparseTerms() function. The new  DocumentTermMatrix now only contains 105 bi-grams.

Finally I plot the frequency of the bi-grams, just like we did before.

Frequency plot bi-grams for text mining

Hope this post helps you to speed up your text mining analysis in R. If you have any questions, feel free to ask them below in the comments and I’ll try to answer them.

 

 

World full of data author

Who I am


Hi! My name is Claudia, a freelance data analyst/scientist. This is my space on the internet where I share knowledge and experience with everyone who wants to become a better analyst. Read more about my work as a freelancer here.

Share this post on

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.