machine learning - What do the parameters of the csvIterator mean in Mallet?


Keywords:machine  learning 


Question: 

I am using mallet topic modelling sample code and though it runs fine, I would like to know what the parameters of this statement actually mean?

instances.addThruPipe(new CsvIterator(new FileReader(dataFile),
                                      "(\\w+)\\s+(\\w+)\\s+(.*)",
                                      3, 2, 1)  // (data, target, name) field indices                    
                     );

2 Answers: 

From the documentation:

This iterator, perhaps more properly called a Line Pattern Iterator, reads through a file and returns one instance per line, based on a regular expression.

If you have data of the form

[name] [label] [data]

The call you are interested in is

CsvIterator(java.io.Reader input, java.lang.String lineRegex, 
            int dataGroup, int targetGroup, int uriGroup) 

The first parameter is how data is read in, like a file reader or a string reader. The second parameter is the regex that is used to extract data from each line that's read from the reader. In your example, you've got (\\w+)\\s+(\\w+)\\s+(.*) which translates to:

  • 1 or more alphanumeric characters (capture group, this is the name of the instance), followed by
  • 1 or more whitespace character (tab, space, ..), followed by
  • 1 or more alphanumeric characters (capture group, this is the label/target), followed by
  • 1 or more whitespace character (tab, space, ..), followed by
  • 0 or more characters (this is the data)

The numbers 3, 2, 1 indicate the data comes last, the target comes second, and the name comes first. The regex basically ensures the format of each line is as stated in the documentation:

test1 spam Wanna buy viagra?
test2 not-spam Hello, are you busy on Sunday?

CsvIterator is a terrible name, because it is not actually comma-separated values that this class reads in, it is whitespace-separated (space, tab, ...) values.

 

Explanation given in above answer is way too good.

However one point is missing. Sequence of regular expression(regex) for each one of the data, label and name fields of input instances in Line regex needs to be in correspondence to the way instances are provided in input file i.e. if say you are providing name as 1st field , data as second field and label as 3rd field in your input file then you have to provide regex of name first followed by regex of data and then at last regex of label. Example is as shown below :

Input instance : Mail67(tab space)TCC problems. Hi there, For some reason no administrators in the Old Master Paintings department have been able to get information from TCC. It appears to be going through on JDE, but nothing appears when searched on TCC. Any help or guidance you can offer to f....(tab space)Inc

CsvIterator Parameters: CsvIterator(new FileReader(Path to file), "(\w+)\t(.*)\t(\w+)",2,3,1)