lda - Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?



I'm trying to model twitter stream data with topic models. Gensim, being an easy to use solution, is impressive in it's simplicity. It has a truly online implementation for LSI, but not for LDA. For a changing content stream like twitter, Dynamic Topic Models are ideal. Is there any way, or even a hack - an implementation or even a strategy, using which I can utilize Gensim for this purpose?

Are there any other python implementations which derive (preferably) from Gensim or independent? I am preferring python, since I want to get started asap, but if there is an optimum solution with some work, please mention it.


3 Answers: 

Gensim () has a python wrapper for the orig. C++ code.


The DTM wrapper in Gensim is working, but none of the documentation is particularly complete at this time. On the Gensim side, the most useful thing to look at is the DTM example buried in docs/notebooks. This shows you what all of the input variables need to look like. A couple of things to note:

  • the DTM model has been moved into gensim.models.wrappers.dtmmodel
  • initialize_lda=True must be set because of a bug in the DTM code (this will be the default in future -- PR #676)

You'll also need a working compiled version of DTM itself (you provide the path to that executable). You can try using the appropriate executable from a github repo, but if that doesn't work you'll probably need to compile the original code by running the included makefile.


Having talked with David Blei and John Lafferty about exactly this, the answer right now is no, there aren't.

Sean Gerrish's DTM implementation works with a documented memory leak, but works on manageable collections.