algorithm - Prioritizing text based on content



If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?

I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.

thank you

3 Answers: 

There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:

  • Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.

  • Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.

Now to some actual examples.

  • TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.

  • A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.

  • Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.

  • Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.


As a search engine engieneer I think this problem is best solved using two techniques in conjuction.

Technology 1, Search (TF-IDF or other algorithms)

Use search to create a baseline model for content where you dont have user statistics. There are a number of technologies out there but I think the Apache Lucene/Solr code base is by fare the most mature and stable.

Technology 2, User based recommenders (k-nearest neighborhood other algorithms)

When you start getting user statistics use this to enhance your relevance model used by the text analysis system. A fast growing codebase to solv these kinds of problem is the Apache Mahout project.


Check out Programming Collective Intelligence, a really good overview of various techniques along these lines. Also very readable.