A more time-efficient gibbs sampling algorithm based on SparseLDA for latent dirichlet allocation

Zhou, Xiaotang; Ouyang, Jihong; Li, Ximing

doi:10.3233/IDA-173609

A more time-efficient gibbs sampling algorithm based on SparseLDA for latent dirichlet allocation

Article type: Research Article

Authors: Zhou, Xiaotang^{a; b} | Ouyang, Jihong^{a; b; *} | Li, Ximing^{a; b}

Affiliations: [a] College of Computer Science and Technology, Jilin University, Changchun 130012, Jilin, China | [b] Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, Jilin, China

Correspondence: [*] Corresponding author: Jihong Ouyang, College of Computer Science and Technology, Jilin University, Changchun 130012, Jilin, China. Tel.: +86 138 4318 4836; E-mail: [email protected].

Abstract: As an efficient sampling algorithm for latent dirichlet allocation SparseLDA uses cache strategy to improve the time and space efficiency of its standard gibbs sampling algorithm (StdGibbs) by recycling previous computation. However, SparseLDA cannot further improve the time-efficiency of StdGibbs, since the amount of recycled computation is limited. This is because the word types of two adjacent tokens are usually different and the previous computation cannot be further recycled easily. To solve this problem, in this paper we propose a new algorithm named Efficient SparseLDA (ESparseLDA) based on SparseLDA. The main idea of ESparseLDA is to first rearrange the tokens within one text according to the word types so that the tokens of the same word type are aggregated together and then recycle more computation while making no approximation and ensuring the exactness. In this paper, we make detailed theoretical explanations and comparative experimental analyses on the correctness, exactness and time-efficiency of ESparseLDA. In detail, the statistical significance tests on perplexities strictly show that ESparseLDA is correct and exact. In addition, the running time results show that the time-efficiency of ESparseLDA is the higher than SparseLDA in varying degrees from 5.06% to 31.85% on the different datasets used in experiments.

Keywords: Latent dirichlet allocation, topic model, gibbs sampling, topic inference

DOI: 10.3233/IDA-173609

Journal: Intelligent Data Analysis, vol. 22, no. 6, pp. 1227-1257, 2018

Published: 12 December 2018

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia