How to, R, Text mining

Text mining analysis including full code in R

Recently I was working on a text mining project, but I ran into a few problems which took me some time to sort out. The project itself wasn’t to complicated, but finding the right codes and syntax’s cost me way too much time. Therefore, I thought it’s worth sharing the code with you and hope it saves you some hours!

This post will only focus on basic text mining analysis. I’ll show you how to get the most frequent terms from a document, how to plot the frequency of terms. Also, I’ll show you how to get n-grams from the document (bi-grams in particular), how to find associated terms and how to plot an occurrence and correlation matrix with terms.

As a case, I’ll use the famous movie review dataset from Kaggle. The data is split into a train and test file, for this tutorial I’ll only use the train file. The train file contains more than 150,000 lines each with a sentiment notation. I’m not going to use the sentiment information in this post, just the text.

 

Preprocessing data

Just like any other project, I’ll start by loading the libraries I need. This time, the package we need for our text mining analysis is the package tm . The package RWeka is also used for our text analysis, this allows us to create n-grams. After loading the libraries, we’ll read the .tsv file from Kaggle.

library(dplyr)
library(tm)
library(ggplot2)
library(RWeka)
library(gplots)
library(corrplot)
library(RColorBrewer)

train <- read.table("C:/.../train.tsv", fill = TRUE, quote = "", header = TRUE, sep = "\t")
head(train, n = 5)
#	PhraseId	SentenceId	Phrase					  					Sentiment
#1	1			1			A series of escapades demonstrating ...	  	1
#2	2			1			A series of escapades demonstrating ...	  	2
#3	3			1			A series				  					2
#4	4			1			A					 						2
#5	5			1			series					 					2

The train data contains four columns, starting with a PhraseId . The PhraseId  is a unique identifier for each row. A SentenceId  is provided so that you can track which phrases belong to a single sentence. The third column is the Phrase  with the text and finally we have the Sentiment  column with a number ranging from 0 to 4 referring to the sentiment. Since I’m not going to use the sentiment, I won’t explain this in more detail. In case you want to know, please check the Kaggle website for more information.

Now that we have loaded the data, it is time to change the format a bit. I’m only interested in the text and actually, I only want one sentence per phrase, not the parsed sentences. Therefore, let’s get the longest Phrase  per SentenceId  to get the right texts.

text <- train %>%
  select(Phrase, SentenceId) %>%
  group_by(SentenceId) %>%
  slice(which.max(nchar(as.character(Phrase))))
text <- data.frame(doc_id = text$SentenceId,
                   text = text$Phrase)
text$text <- gsub("\\w*\\'\\w*", "", text$text)

As you can see, the text dataframe now only contains two columns:

  1. doc_id which originally used to be the SentenceId
  2. text derived from Phrase

The dataframe contains 8,529 rows with text. Additionally, I remove words with a ‘ in it. Usually, this is a bit tricky because it would remove all the words such as “actor’s” and “doesn’t”. But in this case, everything that follows after ‘ is written as a new word. So “actor’s” becomes “actor ‘s” and “doesn’t” becomes “doesn ‘t”. We can easily remove ” ‘s” and ” ‘t”, this won’t change the meaning of the words.

 

Create corpus for text mining

Now that the data is in the right format, we can continue with our text mining analyses. In order to analyse the data, we need to apply three transformations:

  1. Make a DataframeSource from the text
  2. Create a corpus from DataframeSource
  3. Transform corpus into DocumentTermMatrix or TermDocumentMatrix

So let’s start with the first step. The tm library contains a function to create a DataframeSource from a dataframe. A DataframeSource is a type of object which contains a little more information than a normal dataframe, such as the number of documents. In order to create a DataframeSource, the original dataframe should contain two column. It should start with a column name doc_id  with a unique number for each row. The column with the sentences should be named text . Please be aware that the order of these two columns is also important!

ds <- DataframeSource(text)
View(head(ds, n = 5))
#	doc_id	text
#1	1		A series of escapades demonstrating ...
#2	2		This quiet , introspective and ...
#3	3		Even fans of Ismail Merchant work ...
#4	4		A positively thrilling combination ...
#5	5		Aggressive self-glorification and ...

The next step is to create a corpus from the DataframeSource. A corpus also contains all documents, but this time it is possible to apply logic to the text. The tm_map() function allows you to:

  • convert all characters in lower case
  • skip specific words (in the example below I only remove English stopwords)
  • remove punctuation
  • eliminate white spaces
  • remove numbers
  • stem words

Keep in mind that the order is very important. the English stopwords only contain lower case characters. Therefore, it is important to make the text lower case first followed by removing the English stopwords. Earlier in this post I explained why we removed all ‘ characters. In this dataset words are written with an extra space such as “doesn ‘t” instead of “doesn’t”. The later is part of the English stopwords object.

corpus <- Corpus(ds)
inspect(corpus[1])
#<<SimpleCorpus>>
#Metadata:  corpus specific: 1, document level (indexed): 0
#Content:  documents: 1
#A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 

skipWords <- function(x) removeWords(x, stopwords("english"))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = list(tolower))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = list(skipWords, removePunctuation, stripWhitespace, removeNumbers, stemDocument))
inspect(corpus[1])
#<<SimpleCorpus>>
#Metadata:  corpus specific: 1, document level (indexed): 0
#Content:  documents: 1
#[1] seri  escapad demonstr  adag    good   goos  also good   gander     occasion amus  none   amount  much   stori 

As you can see, filtering and cleaning the text gives us a different sentence. The words “A”, “of”, “the”, “that”, “what”, “is”, “for”, “some”, “which”, ” but” and “to” are removes, because these are part of the English stopwords object. Also plural words are now written as a singular word. Punctuation is also removes, since we don’t need them for our analysis.

The corpus is the structure we need for all our analyses. The next step would be to create a DocumentTermMatrix object, which is almost similar to a TermDocumentMatrix. The difference is that one object is the other object transposed. The following example shows you what both objects look like.

dtm <- DocumentTermMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
inspect(dtm)
#<<DocumentTermMatrix (documents: 8529, terms: 11276)>>
#Non-/sparse entries: 79211/96093793
#Sparsity           : 100%
#Maximal term length: 22
#Weighting          : term frequency (tf)
#Sample             :
#      Terms
#Docs   charact film like lrb make movi one rrb stori time
#  1020       0    0    0   1    0    1   0   1     0    0
#  1199       0    0    0   0    0    1   0   0     0    0
#  2388       1    1    0   1    1    1   0   1     0    0
#  2532       0    0    1   0    0    0   0   0     0    0
#  2733       0    0    0   0    0    0   1   0     0    0
#  3160       0    1    0   1    1    0   0   1     0    0
#  3187       1    0    0   0    0    0   1   0     0    0
#  5550       0    0    0   2    0    0   0   2     0    0
#  625        0    0    0   1    0    0   0   1     0    0
#  8154       0    0    0   0    0    1   0   0     0    0

inspect(tdm)
#<<TermDocumentMatrix (terms: 11276, documents: 8529)>>
#Non-/sparse entries: 79211/96093793
#Sparsity           : 100%
#Maximal term length: 22
#Weighting          : term frequency (tf)
#Sample             :
#         Docs
#Terms     1020 1199 2388 2532 2733 3160 3187 5550 625 8154
#  charact    0    0    1    0    0    0    1    0   0    0
#  film       0    0    1    0    0    1    0    0   0    0
#  like       0    0    0    1    0    0    0    0   0    0
#  lrb        1    0    1    0    0    1    0    2   1    0
#  make       0    0    1    0    0    1    0    0   0    0
#  movi       1    1    1    0    0    0    0    0   0    1
#  one        0    0    0    0    1    0    1    0   0    0
#  rrb        1    0    1    0    0    1    0    2   1    0
#  stori      0    0    0    0    0    0    0    0   0    0
#  time       0    0    0    0    0    0    0    0   0    0

The output starts with some additional information about the corpus for example the total number of documents, total number of unique terms and sparsity of the document. As you can see, the DocumentTermMatrix contains document ID’s as rows and terms in columns. For the TermDocumentMatrix this is the other way around. Each cell contains a number with the frequency of the term in the document. It only shows ten terms and ten document, but of course there are many more rows and columns.

Now that we have the data in the right format, it is time to analyse the text.

 

Text mining anlysis

For the text analysis and plots, I’ll use the DocumentTermMatrix object. From this object, you can get information such as the number of documents and the number of terms with the following code:

# Check information from DocumentTermMatrix
dtm$nrow
#[1] 8529

dtm$ncol
#[1] 11276

dtm$dimnames$Terms[1:5]
#[1] "adag"     "also"     "amount"   "amus"     "demonstr"

As you can see, we have 8,529 documents with a total of 11,276 unique words, without English stopwords. You can get the unique terms in alphabetical order by using dtm$dimnames$Terms[i:j] .

One of the interesting things is to know which terms appear together often.

# Find associate words
findAssocs(dtm, c("film"), 0.05)[[1]][1:5]
#    san   chimp   plagu  lathan shepard 
#   0.08    0.06    0.06    0.06    0.06

The function findAssocs() returns the terms that are in the text if there’s also the word “film” in the text. The number 0.05 is the absolute correlation. 0.05 Is very low, but in this text there are not that many words that correlate, this is something we will look into later in this post. The result of this function is a list with all the terms with the minimum correlation number and sorted from high to low. I’ve only selected the first five terms and it also shows the correlation between the term “film” and its associate terms.

For some analysis you might want to use the terms which occur frequently. There are several ways to filter the data to keep only the terms you need for your analysis. One of the solutions is to keep terms based on sparsity with the function removeSparseTerms() . You can tell the function to remove the terms which have at least a certain sparsity.

# Remove sparse terms
inspect(removeSparseTerms(dtm, 0.95))
#<<DocumentTermMatrix (documents: 8529, terms: 5)>>
#Non-/sparse entries: 3842/38803
#Sparsity           : 91%
#Maximal term length: 4
#Weighting          : term frequency (tf)
#Sample             :
#      Terms
#Docs   film like make movi one
#  1535    1    0    1    1   1
#  2029    1    1    1    1   1
#  2121    1    2    0    0   1
#  259     2    1    0    0   1
#  320     0    0    0    4   0
#  374     0    1    1    1   1
#  3904    2    2    0    1   0
#  6311    1    1    1    2   0
#  809     0    2    0    1   1
#  844     1    0    1    2   0

In this example, I want to keep all terms which appear in 95% of the document. The function searches for the terms by rounding down, as you can see below the sparsity is actually 91%. So 91% of the documents contain the stemmed words “film”, “like”, “make”, “movi” or “one”.

Another way to get the most frequent terms is by using findFreqTerms()  or findMostFreqTerms() . The first one gives you the terms that occur more than X times in the documents. The second one gives you the n terms that appear most, including how often they appear.

# Get words based on frequency
findFreqTerms(dtm, 300)
#[1] "stori"   "one"     "time"    "charact" "like"    "doe"     "movi"    "make"    "film"    "lrb"     "rrb"

findMostFreqTerms(dtm, n = 10, INDEX = rep(1, dtm$nrow))[[1]]
#   film    movi     one    like    make   stori     lrb     rrb charact    time 
#   1291    1132     572     554     437     383     352     352     343     317 

freq_words <- colnames(t(findMostFreqTerms(dtm, n = 10, INDEX = rep(1, dtm$nrow))[[1]]))
freq_words
#[1] "film"    "movi"    "one"     "like"    "make"    "stori"   "lrb"     "rrb"     "charact" "time"

These codes show the output for both function. As you can see, there are eleven terms that appear 300 times or more in the text. The second code gives the ten most frequent terms including how often they are written in the documents. The last line of code only returns the terms without the frequency, this is something we will use later on.

As the code shows below, you can either use a DocumentTermMatrix or a TermDocumentMatrix . The only difference is that you should transpose the first one. From now on I’ll use the DocumentTermMatrix object in the examples. First we need to create a matrix from the, but be aware that the size of your corpus isn’t to big. In case it contains too many documents or too many terms, it might run into an error. The size of the corpus we use is fine, so there is no problem creating a matrix. From the matrix we take all the terms and calculate how often they appear.

# Frequency table
m <- as.matrix(tdm)
m <- t(as.matrix(dtm))
freq_table <- data.frame(term = rownames(m), 
                      freq = rowSums(m), 
                      row.names = NULL)
freq_table <- freq_table[order(-freq_table$freq),][1:10,]
freq_table
#       term freq
#192    film 1291
#129    movi 1132
#26      one  572
#70     like  554
#175    make  437
#14    stori  383
#342     lrb  352
#345     rrb  352
#48  charact  343
#29     time  317

As you can see, there are many ways to select the right words you need for your analysis.

 

 

Create frequency plot from text

Now that we know which terms appear often in the text, I want to know how often they appear together with another word. Of course, we can do this by simply creating a table or we can create a heatmap as I explained in this post. It isn’t too difficult, with just a few lines you can already get this information. But let’s first create a brachart with the frequency per word.

# Frequency plot
freq_plot <- ggplot(freq_table, aes(x = reorder(term, -freq), freq)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(x = "Terms", y = "Frequency", title = "Frequent terms") +
  geom_text(aes(label = freq), vjust = -0.5, size = 3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
freq_plot

Frequency plot for text mining in R

I’ve used the freq_table which includes the ten most frequent terms and its frequency and simply plotted it with ggplot() . The stemmed words “film” and “movi” are found more than 1,000 times in the text. The words “lrb” and “rrb” both occur 352 times.

In order to create a heatmap of which terms appear together, I want to filter the DocumentTermMatrix such that it only contains the ten most frequent words.

dtm_freq_words <- dtm[, Terms(dtm) %in% freq_words]
dtm_freq_words
#<<DocumentTermMatrix (documents: 8529, terms: 10)>>
#Non-/sparse entries: 5541/79749
#Sparsity           : 94%
#Maximal term length: 7
#Weighting          : term frequency (tf)

As you can see, the DocumentTermMatrix still includes all 8,529 documents but the matrix now only has ten terms. Next step is to create a co-occurrence matrix of how often two terms appear in the same text. This can be done with the following code:

m_freq_words <- as.matrix(dtm_freq_words)
heatmap_data <- t(m_freq_words) %*% m_freq_words
heatmap_data
#         Terms
#Terms     stori one time charact like movi make film lrb rrb
#  stori     395  42   17      25   20   35   20   48  15  14
#  one        42 630   32      17   46  134   47  122  21  21
#  time       17  32  341      10   22   59   17   43  17  17
#  charact    25  17   10     351   24   47   18   50  22  22
#  like       20  46   22      24  608  112   48   90  29  29
#  movi       35 134   59      47  112 1222   97   71  43  43
#  make       20  47   17      18   48   97  451   92  31  31
#  film       48 122   43      50   90   71   92 1375  51  51
#  lrb        15  21   17      22   29   43   31   51 380 379
#  rrb        14  21   17      22   29   43   31   51 379 380

This is already what we need for the heatmap, but I’m not interested in the diagonal so I’ll remove these values just like the lower part of the matrix. All that remains is to plot this matrix.

diag(heatmap_data) <- 0
heatmap_data[lower.tri(heatmap_data)] <- NA

# Crete heatmap
heatmap.2(heatmap_data, 
          dendrogram = "none", Colv = FALSE, Rowv = FALSE,
          scale = "none", col = brewer.pal(5, "Blues"),
          key = TRUE, density.info = "none", key.title = NA, key.xlab = "Frequency",
          trace = "none",
          main = "Term co-occurrence",
          xlab = "Term",
          ylab = "Term")

Term co-occurrence for text mining in R

The plot is much easier to interpret than the matrix. The most interesting result is that the term “lrb” is always in a text together with the term “rrb”. Some other terms also co-occur together, but not as often.

 

Correlation of terms

Just like the co-occurrence plot, we can also plot the correlation of terms. In this post I explain four different ways to create a correlation plot.

# Create correlation plot
cor_data <- cor(m_freq_words)
corrplot(cor_data, method = "square", type = "upper", tl.col = "black", order = "hclust", col = brewer.pal(n = 5, name = "RdYlBu"))

Correlation plot for text mining

Just like the co-occurrence plot, the correlation plot shows us that “lrb” and “rrb” always appear together in the documents.

 

Analyse n-grams from text

I want to end this post by creating n-grams. n-grams are different than what we’ve looked into before. In previous examples we’ve checked which terms appear together in the text. With n-grams you analyse which terms are used together, or in other words: which words are used in sequence. I’ll show you how to create bi-grams, but with a minor change you can also analyse tri-grams or any number you want.

Creating n-grams in R took me quit some time. I’ve found a lot of codes on the internet, but unfortunately, some of the packages are updated. This causes some trouble…. But! If you use this code below, it will all work out 🙂 The key is to use a VCorpus object instead of Corpus . Please don’t ask me why… It works so I’m happy 🙂

vcorpus <- VCorpus(VectorSource(text$text))
vcorpus <- tm_map(vcorpus, content_transformer(tolower), lazy = TRUE)
vcorpus <- tm_map(vcorpus, removeWords, stopwords("en")) 
vcorpus <- tm_map(vcorpus, removePunctuation, lazy = TRUE) 
vcorpus <- tm_map(vcorpus, stripWhitespace, lazy=TRUE) 
vcorpus <- tm_map(vcorpus, removeNumbers, lazy=TRUE)
vcorpus <- tm_map(vcorpus, stemDocument, lazy=TRUE)

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtmv <- DocumentTermMatrix(vcorpus, control = list(tokenize = bigram))
inspect(dtmv)
#<<DocumentTermMatrix (documents: 8529, terms: 62607)>>
#Non-/sparse entries: 71341/533903762
#Sparsity           : 100%
#Maximal term length: 32
#Weighting          : term frequency (tf)
#Sample             :
#      Terms
#Docs   feel like look like love stori lrb rrb new york play like romant comedi soap opera special effect subject matter
#  1020         0         0          0       0        0         0             0          0              0              0
#  1199         0         0          0       0        0         0             0          0              0              0
#  2124         0         0          0       1        0         0             0          0              0              0
#  2388         0         0          0       0        0         0             0          0              0              0
#  2532         0         0          0       0        0         0             0          0              0              0
#  2733         0         0          0       0        0         0             0          0              0              0
#  3187         0         0          0       0        0         0             0          0              0              0
#  403          0         0          0       0        0         0             0          0              0              0
#  5550         0         0          0       0        0         0             0          0              0              0
#  8154         0         0          0       0        0         0             0          0              0              0

So let’s start by creating a VCorpus object. I apply the same transformations to the text as I did earlier, but this time I do this line by line. In order to create the bi-grams, we’ll use the RWeka library we loaded at the start of this post. The function NGramTokenizer() allows us to find the bi-grams in the text. You can control the min and max length of the n-grams. For bi-grams I set both at two. But you can also choose to find bi- and tri-grams, then you’ll need to set min at two and max should be three. We use this function to create a new DocumentTermMatrix .

This example has 62.607 unique bi-grams. The matrix looks similar to the one with only single terms, but this time the terms are now bi-grams. For example, the bi-gram “lrb rrb” appears in document number 2124.

You can use the same function for n-grams such as finding associate terms/n-grams and filtering on sparsity.

# Find associate terms and keep 105 terms instead of 62607
findAssocs(dtmv, c("feel like"), 0.15)
#$`feel like`
#  like pilot   like three pilot episod    leav feel    make feel 
#        0.18         0.18         0.18         0.17         0.16

dtmv_small <- removeSparseTerms(dtmv, 0.999)
dtmv_small
#<<DocumentTermMatrix (documents: 8529, terms: 105)>>
#Non-/sparse entries: 1461/894084
#Sparsity           : 100%
#Maximal term length: 14
#Weighting          : term frequency (tf)

The bi-grams have higher correlations than we obtained for single terms. To make the calculations faster, I filter most of the data by using the removeSparseTerms() function. The new DocumentTermMatrix now only contains 105 bi-grams.

Finally I plot the frequency of the bi-grams, just like we did before.

mv <- t(as.matrix(dtmv_small))
freq_tablev <- data.frame(term = rownames(mv), 
                         freq = rowSums(mv), 
                         row.names = NULL)
freq_tablev <- freq_tablev[order(-freq_tablev$freq),][1:10,]

freqv_plot <- ggplot(freq_tablev, aes(x = reorder(term, -freq), freq)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(x = "Bi-grams", y = "Frequency", title = "Frequent bi-grams") +
  geom_text(aes(label = freq), vjust = -0.5, size = 3) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
freqv_plot

Frequency plot bi-grams for text mining

Hope this post helps you to speed up your text mining analysis in R. If you have any questions, feel free to ask them below in the comments and I’ll try to answer them.

 

 

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.