NLP to classify/label the content of a sentence (Ruby binding necesarry)



I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.:

  • Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.)
  • Customer service problems (slow email response time, impolite response, etc.)
  • Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.)
  • Pricing complaint (hidden fee's discovered, etc.)

In order to perform this classification, I need a NLP that can recognize the combination of word groups like:

  • "[they|the company|the firm|the website|the merchant]"
  • "[did not|didn't|no]"
  • "[response|respond|answer|reply]"
  • "[before the next day|fast enough|at all]"
  • etc.

A few of these exemplified groups in combination should then match sentences like:

  • "They didn't respond"
  • "They didn't respond at all"
  • "There was no response at all"
  • "I received no response from the website"

And then classify the sentence as Customer service problems.

Which NLP would be able to handle such a task? From what I read these are the most relevant:

  • Stanford CoreNLP
  • OpenNLP

Check also these suggested NLP's.

2 Answers: 

Using the OpenNLP doccat api, you can create training data and then a model from the training data. The advantage of this over something like a naive bayes classifier is that it returns a probability distribution over your set of categories.

so if you create a file with this format:

customerserviceproblems They did not respond
customerserviceproblems They didn't respond 
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website

etc.... provide as many samples as possible and make sure each line ends with a \n newline

using this appoach you can add anything you want that means "customer service problems" and you can also add any other categories as well, so you don't have to be too deterministic about what data falls into what categories

here is what the java looks like to build the model

DoccatModel model = null;
    InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
    try {

      ObjectStream<String> lineStream =  
              new PlainTextByLineStream(dataIn, "UTF-8");

      ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
      model = DocumentCategorizerME.train("en", sampleStream);
      OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
      System.out.println("Model complete!");
    } catch (IOException e) {
      // Failed to read or parse training data, training failed

Once you have the model, you can then use it something like this:

DocumentCategorizerME documentCategorizerME;
  DoccatModel doccatModel; 

doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
   documentCategorizerME = new DocumentCategorizerME(doccatModel);
 * returns a map of a category to a score
 * @param text
 * @return
 * @throws Exception 
  private Map<String, Double> getScore(String text) throws Exception {
    Map<String, Double> scoreMap = new HashMap<>();
    double[] categorize = documentCategorizerME.categorize(text);
    int catSize = documentCategorizerME.getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = documentCategorizerME.getCategory(i);
      scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
    return scoreMap;


then in the returned hashmap you have each category that you modeled and a score, you can use the scores to decide which category the input text belongs to.


Not entirely sure, but I can think of two ways of trying to solve your problem:

  1. Standard Machine Learning

    As stated in the comment, extract only keywords from each mail and train a classifier using them. Define your relevant keyword set beforehand and extract only those keywords from the email if they are present.

    This is a simple but powerful technique and not to be underestimated as it yields very good results in many cases. You might want to try this one out first as more complex algorithms might be overkill.

  2. Grammars

    If you really want to delve into NLP, based on your question description, you might try defining some sort of grammar and parse the email based on that grammar. I don't have too much experience in ruby, but I'm sure some sort of lex-yacc equivalent tools exist. A quick web search gives this SO question and this. By identifying these phrases, you could judge which category an email falls under by calculating the proportion of phrases found for each category.

    For example, intuitively, some productions within the grammar could be defined as:

    {organization}{negative}{verb} :- delivery problems

    where organization = [they|the company|the firm|the website|the merchant], etc.

These approaches might be a way to start.