machine learning - Why the 30 topics identified by Stanford Topic Modeling Toolkit so similar to each other?

Keywords:machine  learning 


What can be the possible reasons why the 30 topics identified by Stanford Topic Modeling Toolkit (it took ~4 hours) on the corpus of 19,500 articles (shared by Twitter users) so similar to each other? They have pretty much the same terms, and frequencies => essentially, I just have a single topic :)

The topics are identified can be found here

I do standard prep of text docs before learning and inferring stages: removing stop words, collapsing whitespaces, lowercasing everything, etc.

Some of my params:

  • numTopics = 30
  • TermMinimumDocumentCountFilter = (10) ~> // filter terms which occur in < 10 docs
  • TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
  • DocumentMinimumLengthFilter(10) // take only docs with >= 10 terms
  • topicSmoothing = SymmetricDirichletParams(0.01)
  • termSmoothing = SymmetricDirichletParams(0.01)
  • maxIterations = 10

1 Answer: 

I'd say because your methodology seems to be flawed. Raw word counts have intrinsic biases which are characteristics of the language, despite what topics are meant to be mined.

For example, words that have no impact on topic similarity, but are biasing your outcome:


While other words which are purposefully vague as they deal with references to other particulars


While other words are simply commonly used verbs or adverbs


Others are known industry brand names


Until you can construct a verifiable model that word frequencies map to specific topics, all you have done is some data collection, some hand waving (instead of anti-hypothesis disproving), and a jump to the conclusion that your original premise is correct.

Restructure your classification to capture topics instead of words, and then build a model describing the distance between topics, and then attempt to show that within the 30 offered topics, there are really only 29 topics (or less) of "distance" great enough to stand on their own.

Because, it is all very well and good to collect data from users, but the need for data is secondary to the need for good data that is pertinent to being able to know what is wanted. (That sentence constructed purposefully, it has a high Standford Topic Modeling Toolkit "word count", but is likely not a similar topic)