text processing - Grouping similar words (bad , worse )

Keywords:text  processing 


I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.

If there are words like

  • bad,worst,poor
  • bag,baggage
  • lost,lose,misplace

I am not able to capture them. Can anyone suggest me a possible way?

2 Answers: 

There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.

Approaches may be:

  1. Based on manually curated datasets that contain how words in a language are related to each other.

  2. Based on statistical or probabilistic measures of words appearing in a corpus.

Method 1:

Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.

enter image description here There are many other similar resources online.

Method 2:

Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).

So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.

Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.

A tiny example from Radim's website:

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
>>> model.similarity('woman', 'man')

First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.

Easy to use tool for Word2vec :

(Warning : Lots of Maths Ahead)


I think programs can't understand language. The only field nearly powerful enough to do that is maschine learning, and Look what happened the last time.

Thus, I think the best way around is just bringing it back to the old set of restrained rules - and looking up an online dictionary.

When I did a similar thing with C#, I used the following dictionary. Whenever you want to check the definition of a word, you can go to http://services.aonaware.com/DictService/Default.aspx?action=define&dict=*&query= <your word here> and you'll get a simple-ish html page which you can download and mess around with. for example, bad and worse -


Bad \Bad\ (b[a^]d), a. [Compar. Worse (w[^u]s); superl. Worst (w[^u]st).] [Probably fr. AS. b[ae]ddel hermaphrodite; cf. b[ae]dling effeminate fellow.] Wanting good qualities, whether physical or moral; injurious, hurtful, inconvenient, offensive, painful, unfavorable, or defective, either physically or morally; evil; vicious; wicked; -- the opposite of good; as, a bad man; bad conduct; bad habits; bad soil; bad air; bad health; a bad crop; bad news.


Worse \Worse\, a., compar. of Bad. [OE. werse, worse, wurse, AS. wiersa, wyrsa, a comparative with no corresponding positive; akin to OS. wirsa, OFries. wirra, OHG. wirsiro, Icel. verri, Sw. v["a]rre, Dan. v["a]rre, Goth. wa['i]rsiza, and probably to OHG. werran to bring into confusion, E. war, and L. verrere to sweep, sweep along. As bad has no comparative and superlative, worse and worst are used in lieu of them, although etymologically they have no relation to bad.] Bad, ill, evil, or corrupt, in a greater degree; more bad or evil; less good; specifically, in poorer health; more sick; -- used both in a physical and moral sense. [1913 Webster]

As you can see, both appear in each other's definition under comparison - thus , you can tell their meaning is similar. If you play around well enough with the data and know what you're looking for, this might even solve it for you. Good luck! (: