Tracking Training Performance of LDA Models

In my recent open source contribution I enabled callbacks in the scalable (multi-core) implementation of Latent Dirichlet Alocation in the gensim library ​1​. This will, in turn, allow users for faster and more accurate turning of the popular topic extraction model.

An obvious use case is monitoring and early stopping of training, with popular coherence metrics such as \(U_{mass}\) and \(C_V\) ​2​. On the News20Group dataset, the training performance looks as follows:

Training performance of Multi-Core LDA on 20 Newsgroups data, monitored by callbacks.

The achieved scores are decent, actually better than reported in the literature​3​ – but this may be due to preprocessing not early stopping. A full example is shared in this Kaggle notebook.

  1. 1.
    R Rehr Uv Rek, P Sojka. Software Framework for Topic Modelling with Large Corpora. Unpublished. Published online 2010. doi:10.13140/2.1.2393.1847
  2. 2.
    Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Published online February 2, 2015. doi:10.1145/2684822.2685324
  3. 3.
    Zhang Z, Fang M, Chen L, Namazi Rad MR. Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Published online 2022. doi:10.18653/v1/2022.naacl-main.285

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *