Find the Number of Distinct Topics After LDA in Python/ R



As far as I know, I need to fix the number of topics for LDA modeling in Python/ R. However, say I set topic=10 while the results show that, for a document, nine topics are all about 'health' and the distinct number of topics for this document is 2 indeed. How can I spot it without examining the key words of each topic and manually count the real distinct topics?

P.S. I googled and learned that there are Vocabulary Word Lists (Word Banks) by Theme, and I could pair each topic with a theme according to the word lists. If several topics fall into the same theme, then I can combine them into one distinct topic. I guess it's an approach worth trying and I'm looking for smarter ideas, thanks.

1 Answer: 

First, your question kind of assumes that topics identified by LDA correspond to real semantic topics - I'd be very careful about that assumption and take a look at the documents and words assigned to topics you want to interpret that way, as LDA often have random extra words assigned, can merge two or more actual topics into one (especially with few topics overall) and may not be meaningful at all ("junk" topics).

In answer to you questions then: the idea of a "distinct number of topics" isn't clear at all. Most of the work I've seen uses a simple threshold to decide if a documents topic proportion is "significant".

A more principled way is to look at the proportion of words assigned to that topic that appear in the document - if it's "significantly" higher than average, the topic is significant in the document, but again, this is involves a somewhat arbitrary threshold. I don't think anything can beat close reading of some examples to make meaningful choices here.

I should note that, depending on how you set the document-topic prior (usually beta), you may not have each document focussed on just a few topics (as seems to be your case), but a much more even mix. In this case "distinct number of topics" starts to be less meaningful.

P.S. Using word lists that are meaningful in your application is not a bad way to identify candidate topics of interest. Especially useful if you have many topics in your model (:

P.P.S.: I hope you have a reasonable number of documents (at least some thousands), as LDA tends to be less meaningful with less, capturing chance word co-occurences rather than meaningful ones. P.P.P.S.: I'd go for a larger number of topics with parameter optimisation (as provided by the Mallet LDA implementation) - this effectively chooses a reasonable number of topics for your model, with very few words assigned to the "extra" topics.