text classification - Works LibShortText with other languages too?


Keywords:text  classification 


Question: 

LibShortText is an open source tool for short-text classification and analysis.

I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.

Who knows the answer? Thank you in advance.


1 Answer: 

I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.

According to this paper, it has internal pre-processing methods to extract features.

libshorttext.converter: For given short texts, LibShortText follows 
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming 
(optional), and stop-word removal (optional). The library also allows 
users to choose between unigram and bigram features.

However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.