Using R Studio to pull phrases from a CSV


Keywords:r 


Question: 

Using R I would like to take a single CSV and pull out the most common two and three word phrases. I've been searching Google and Stackoverflow and could not find a simple way to do this.

I know how to read a CSV into R but I have not found out how to extract the data into the appropriate datatype and perform operations on to get what I am looking for.

Requirements:

  1. Remove all non alpha numeric text from the CSV
  2. Replace words using a synonym list
  3. Remove words with no meaning (at, the, etc)
  4. Get a count of the common phrases for both two word phrases and three word phrases
  5. Make all text lowercase

Also, what data types are best suited for this type of analysis? dataframe? tm? corpus? etc?

My_SRs <- read.csv("C:/example_folder/username/Documents/my_data.csv")

Thanks in advance!


1 Answer: 

The tm package () will do what you are looking for.

From the manual, to load a file:

txt <- system.file("my_data.csv", control = list(removePunctuation = TRUE,
    removeNumbers = TRUE, tolower = TRUE, stopwords = TRUE)))

Create a corpus:

Corpus(DirSource(txt)

From there, you can use TermDocumentMatrix or go a different route with PlainTextDocument and termFreq to deliver word frequencies.