Performing Text Analytics on a text Column in Dataframe in R


Keywords:r 


Question: 

I have imported a CSV file into a dataframe in R and one of the columns contains Text.

I want to perform analysis on the text. How do I go about it?

I tried making a new dataframe containing only the text column.

OnlyTXT= Txtanalytics1 %>%
  select(problem_note_text)
View(OnlyTXT). 

1 Answer: 

This could get you started.

install.packages("gtools", dependencies = T)
library(gtools) # if problems calling library, install.packages("gtools", dependencies = T)
library(qdap) # qualitative data analysis package (it masks %>%)
library(tm) # framework for text mining; it loads NLP package
library(Rgraphviz) # depict the terms within the tm package framework
library(SnowballC); library(RWeka); library(rJava); library(RWekajars)  # wordStem is masked from SnowballC
library(Rstem) # stemming terms as a link from R to Snowball C stemmer

The following assumes your text variable (your OnlyTXT) is in data frame "df" labeled "text".

df$text <- as.character(df$text) # to make sure it is text

# prepare the text by lower casing, removing numbers and white spaces, punctuation and unimportant words.  The `tm::`prefix is being cautious.
df$text <- tolower(df$text)
df$text <- tm::removeNumbers(df$text)
df$text <- str_replace_all(df$text, "  ", "") # replace double spaces with single space
df$text <- str_replace_all(df$text, pattern = "[[:punct:]]", " ")

df$text <- tm::removeWords(x = df$text, stopwords(kind = "SMART"))

corpus <- Corpus(VectorSource(df$text)) # turn into corpus

tdm <- TermDocumentMatrix(corpus) # create tdm from the corpus

freq_terms(text.var = df$text, top = 25) # find the 25 most frequent words

There is much more you can do with the tm package or the qdap package.