LatentDirichletAllocationVectorGibbsSampler (Cognitive Foundry)

java.lang.Object
- gov.sandia.cognition.util.AbstractCloneableSerializable
- - gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
  - - gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm<ResultType>
    - - gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
      - gov.sandia.cognition.text.topic.LatentDirichletAllocationVectorGibbsSampler

All Implemented Interfaces:

AnytimeAlgorithm<LatentDirichletAllocationVectorGibbsSampler.Result>, IterativeAlgorithm, StoppableAlgorithm, AnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>, BatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>, CloneableSerializable, Randomized, java.io.Serializable, java.lang.Cloneable

Direct Known Subclasses:

ParallelLatentDirichletAllocationVectorGibbsSampler
```
@PublicationReference(author={"David M. Blei","Andrew Y. Ng","Michael I. Jordan"},title="Latent Dirichlet Allocation",year=2003,type=Journal,publication="Journal of Machine Learning Research",pages={993,1022},url="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf") @PublicationReference(author="Gregor Heinrich",title="Parameter estimation for text analysis",year=2009,type=TechnicalReport,url="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.1327&rep=rep1&type=pdf")
public class LatentDirichletAllocationVectorGibbsSampler
extends AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
implements Randomized
```
A Gibbs sampler for performing Latent Dirichlet Allocation (LDA). It operates on input vectors that are expected to have positive integer counts. The LDA model uses a fixed set of latent topics as a generative model for term occurrences in documents. Thus, each document is a mixture of different topics. This implementation uses a Gibbs sampling version of Markov Chain Monte Carlo algorithm to estimate the parameters of the model.

Since:

3.1

Author:

Justin Basilico, Sean Crosby

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class LatentDirichletAllocationVectorGibbsSampler.Result
Represents the result of performing Latent Dirichlet Allocation.

Nested Classes
Modifier and Type	Class and Description
`static class`	`LatentDirichletAllocationVectorGibbsSampler.Result` Represents the result of performing Latent Dirichlet Allocation.

Field Summary

Fields
Modifier and Type	Field and Description
`protected double`	`alpha` The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
`protected double`	`beta` The beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
`protected int`	`burnInIterations` The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
`static double`	`DEFAULT_ALPHA` The default value of alpha is 5.0.
`static double`	`DEFAULT_BETA` The default value of beta is 0.5.
`static int`	`DEFAULT_BURN_IN_ITERATIONS` The default number of burn-in iterations is 2000.
`static int`	`DEFAULT_ITERATIONS_PER_SAMPLE` The default number of iterations per sample is 100.
`static int`	`DEFAULT_MAX_ITERATIONS` The default maximum number is iterations is 10000.
`static int`	`DEFAULT_TOPIC_COUNT` The default topic count is 10.
`protected int`	`documentCount` The number of documents in the dataset.
`protected int[]`	`documentTermCounts` For each unique term (unique per document), the number of times that term occurs in the document.
`protected int[]`	`documentTermPairsCounts` the number of unique terms in each document.
`protected int[]`	`documentTerms` For each unique term (unique per document) which term id it maps to.
`protected int[][]`	`documentTopicCount` For each document, the number of terms assigned to each topic.
`protected int[]`	`documentTopicSum` The number of term occurrences in each document.
`protected int`	`iterationsPerSample` The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
`protected int[]`	`occurrenceTopicAssignments` The assignments of term occurrences to topics.
`protected java.util.Random`	`random` The random number generator to use.
`protected LatentDirichletAllocationVectorGibbsSampler.Result`	`result` The result probabilities.
`protected int`	`sampleCount` The number of model parameter samples that have been made.
`protected int`	`termCount` The number of terms in the dataset.
`protected int`	`topicCount` The number of topics for the algorithm to create.
`protected double[]`	`topicCumulativeProportions` We create this array to be used as a workspace to avoid having to recreate it inside the sampling function.
`protected int[][]`	`topicTermCount` For each topic, the number of occurrences assigned to each term.
`protected int[]`	`topicTermSum` The number of term occurrences assigned to each term.

Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
data, keepGoing

Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
maxIterations

Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
DEFAULT_ITERATION, iteration

Constructor Summary

Constructors
Constructor and Description
`LatentDirichletAllocationVectorGibbsSampler()` Creates a new `LatentDirichletAllocationVectorGibbsSampler` with default parameters.
`LatentDirichletAllocationVectorGibbsSampler(int topicCount, double alpha, double beta, int maxIterations, int burnInIterations, int iterationsPerSample, java.util.Random random)` Creates a new `LatentDirichletAllocationVectorGibbsSampler` with the given parameters.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`cleanupAlgorithm()` Called to clean up the learning algorithm's state after learning has finished.
`double`	`getAlpha()` Gets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
`double`	`getBeta()` Gets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
`int`	`getBurnInIterations()` Gets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
`int`	`getDocumentCount()` Gets the number of documents in the dataset.
`int`	`getIterationsPerSample()` Gets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
`java.util.Random`	`getRandom()` Gets the random number generator used by this object.
`LatentDirichletAllocationVectorGibbsSampler.Result`	`getResult()` Gets the current result of the algorithm.
`int`	`getTermCount()` Gets the number of terms in the dataset.
`int`	`getTopicCount()` Gets the number of topics (k) created by the topic model.
`protected boolean`	`initializeAlgorithm()` Called to initialize the learning algorithm's state based on the data that is stored in the data field.
`protected void`	`readParameters()` Reads the current set of parameters.
`protected int`	`sampleTopic(int document, int term, double[] topicCumulativeProportions)` Samples a topic for a given document and term.
`void`	`setAlpha(double alpha)` Sets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities.
`void`	`setBeta(double beta)` Sets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities.
`void`	`setBurnInIterations(int burnInIterations)` Sets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
`void`	`setIterationsPerSample(int iterationsPerSample)` Sets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
`void`	`setRandom(java.util.Random random)` Sets the random number generator used by this object.
`void`	`setTopicCount(int topicCount)` Sets the number of topics (k) created by the topic model.
`protected boolean`	`step()` Called to take a single step of the learning algorithm.

Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner
clone, getData, getKeepGoing, learn, setData, setKeepGoing, stop

Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm
getMaxIterations, isResultValid, setMaxIterations

Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm
getMaxIterations, setMaxIterations

Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener

Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm
isResultValid

- Field Detail
  - DEFAULT_TOPIC_COUNT
```
public static final int DEFAULT_TOPIC_COUNT
```
    The default topic count is 10.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_ALPHA
```
public static final double DEFAULT_ALPHA
```
    The default value of alpha is 5.0.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_BETA
```
public static final double DEFAULT_BETA
```
    The default value of beta is 0.5.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_MAX_ITERATIONS
```
public static final int DEFAULT_MAX_ITERATIONS
```
    The default maximum number is iterations is 10000.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_BURN_IN_ITERATIONS
```
public static final int DEFAULT_BURN_IN_ITERATIONS
```
    The default number of burn-in iterations is 2000.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_ITERATIONS_PER_SAMPLE
```
public static final int DEFAULT_ITERATIONS_PER_SAMPLE
```
    The default number of iterations per sample is 100.
    
    See Also:
    
    Constant Field Values
  - topicCount
```
protected int topicCount
```
    The number of topics for the algorithm to create.
  - alpha
```
protected double alpha
```
    The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.
  - beta
```
protected double beta
```
    The beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.
  - burnInIterations
```
protected int burnInIterations
```
    The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
  - iterationsPerSample
```
protected int iterationsPerSample
```
    The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
  - random
```
protected java.util.Random random
```
    The random number generator to use.
  - documentCount
```
protected transient int documentCount
```
    The number of documents in the dataset.
  - termCount
```
protected transient int termCount
```
    The number of terms in the dataset.
  - documentTopicCount
```
protected transient int[][] documentTopicCount
```
    For each document, the number of terms assigned to each topic. Thus, the first index is a document index and the second is a term index.
  - documentTopicSum
```
protected transient int[] documentTopicSum
```
    The number of term occurrences in each document.
  - topicTermCount
```
protected transient int[][] topicTermCount
```
    For each topic, the number of occurrences assigned to each term. Thus, the first index is a topic index and the second is a term index.
  - topicTermSum
```
protected transient int[] topicTermSum
```
    The number of term occurrences assigned to each term.
  - occurrenceTopicAssignments
```
protected transient int[] occurrenceTopicAssignments
```
    The assignments of term occurrences to topics.
  - documentTermPairsCounts
```
protected transient int[] documentTermPairsCounts
```
    the number of unique terms in each document.
  - documentTerms
```
protected transient int[] documentTerms
```
    For each unique term (unique per document) which term id it maps to.
  - documentTermCounts
```
protected transient int[] documentTermCounts
```
    For each unique term (unique per document), the number of times that term occurs in the document.
  - topicCumulativeProportions
```
protected transient double[] topicCumulativeProportions
```
    We create this array to be used as a workspace to avoid having to recreate it inside the sampling function.
  - sampleCount
```
protected transient int sampleCount
```
    The number of model parameter samples that have been made.
  - result
```
protected transient LatentDirichletAllocationVectorGibbsSampler.Result result
```
    The result probabilities. Note that if multiple samples are taken, this will be a sum of the probabilities for the different samples until the algorithm is done and they are turned into an average.
- Constructor Detail
  - LatentDirichletAllocationVectorGibbsSampler
```
public LatentDirichletAllocationVectorGibbsSampler()
```
    Creates a new LatentDirichletAllocationVectorGibbsSampler with default parameters.
  - LatentDirichletAllocationVectorGibbsSampler
```
public LatentDirichletAllocationVectorGibbsSampler(int topicCount,
                                                   double alpha,
                                                   double beta,
                                                   int maxIterations,
                                                   int burnInIterations,
                                                   int iterationsPerSample,
                                                   java.util.Random random)
```
    Creates a new LatentDirichletAllocationVectorGibbsSampler with the given parameters.
    
    Parameters:
    
    topicCount - The number of topics for the algorithm to create. Must be positive.
    
    alpha - The alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts. Must be positive.
    
    beta - The beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.
    
    maxIterations - The maximum number of iterations to run for. Must be positive.
    
    burnInIterations - The number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins.
    
    iterationsPerSample - The number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
    
    random - The random number generator to use.
- Method Detail
  - initializeAlgorithm
```
protected boolean initializeAlgorithm()
```
    Description copied from class: AbstractAnytimeBatchLearner
    
    Called to initialize the learning algorithm's state based on the data that is stored in the data field. The return value indicates if the algorithm can be run or not based on the initialization.
    
    Specified by:
    
    initializeAlgorithm in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
    
    Returns:
    
    True if the learning algorithm can be run and false if it cannot.
  - step
```
protected boolean step()
```
    Description copied from class: AbstractAnytimeBatchLearner
    
    Called to take a single step of the learning algorithm.
    
    Specified by:
    
    step in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
    
    Returns:
    
    True if another step can be taken and false it the algorithm should halt.
  - sampleTopic
```
protected int sampleTopic(int document,
                          int term,
                          double[] topicCumulativeProportions)
```
    Samples a topic for a given document and term.
    
    Parameters:
    
    document - The document index.
    
    term - The term index.
    
    topicCumulativeProportions - The array to use to store the proportions in.
    
    Returns:
    
    A topic index sampled from the topic probabilities of the given document and term.
  - cleanupAlgorithm
```
protected void cleanupAlgorithm()
```
    Description copied from class: AbstractAnytimeBatchLearner
    
    Called to clean up the learning algorithm's state after learning has finished.
    
    Specified by:
    
    cleanupAlgorithm in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
  - readParameters
```
protected void readParameters()
```
    Reads the current set of parameters.
  - getResult
```
public LatentDirichletAllocationVectorGibbsSampler.Result getResult()
```
    Description copied from interface: AnytimeAlgorithm
    
    Gets the current result of the algorithm.
    
    Specified by:
    
    getResult in interface AnytimeAlgorithm<LatentDirichletAllocationVectorGibbsSampler.Result>
    
    Returns:
    
    Current result of the algorithm.
  - getTopicCount
```
public int getTopicCount()
```
    Gets the number of topics (k) created by the topic model.
    
    Returns:
    
    The number of topics created by the topic model. Must be greater than zero.
  - setTopicCount
```
public void setTopicCount(int topicCount)
```
    Sets the number of topics (k) created by the topic model.
    
    Parameters:
    
    topicCount - The number of topics created by the topic model. Must be greater than zero.
  - getAlpha
```
public double getAlpha()
```
    Gets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.
    
    Returns:
    
    The alpha parameter.
  - setAlpha
```
public void setAlpha(double alpha)
```
    Sets the alpha parameter controlling the Dirichlet distribution for the document-topic probabilities. It acts as a prior weight assigned to the document-topic counts.
    
    Parameters:
    
    alpha - The alpha parameter. Must be positive.
  - getBeta
```
public double getBeta()
```
    Gets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.
    
    Returns:
    
    The beta parameter.
  - setBeta
```
public void setBeta(double beta)
```
    Sets the beta parameter controlling the Dirichlet distribution for the topic-term probabilities. It acts as a prior weight assigned to the topic-term counts.
    
    Parameters:
    
    beta - The beta parameter. Must be positive.
  - getBurnInIterations
```
public int getBurnInIterations()
```
    Gets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins. Note that if this number is greater than the maximum number of iterations, it will only run up to the maximum number of iterations and will only generate one parameter sample.
    
    Returns:
    
    The number of burn-in iterations. Must be non-negative.
  - setBurnInIterations
```
public void setBurnInIterations(int burnInIterations)
```
    Sets he number of burn-in iterations for the Markov Chain Monte Carlo algorithm to run before sampling begins. Note that if this number is greater than the maximum number of iterations, it will only run up to the maximum number of iterations and will only generate one parameter sample.
    
    Parameters:
    
    burnInIterations - The number of burn-in iterations. Must be non-negative.
  - getIterationsPerSample
```
public int getIterationsPerSample()
```
    Gets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
    
    Returns:
    
    The number of iterations between samples.
  - setIterationsPerSample
```
public void setIterationsPerSample(int iterationsPerSample)
```
    Sets the number of iterations to the Markov Chain Monte Carlo algorithm between samples (after the burn-in iterations).
    
    Parameters:
    
    iterationsPerSample - The number of iterations between samples. Must be positive.
  - getRandom
```
public java.util.Random getRandom()
```
    Description copied from interface: Randomized
    
    Gets the random number generator used by this object.
    
    Specified by:
    
    getRandom in interface Randomized
    
    Returns:
    
    The random number generator used by this object.
  - setRandom
```
public void setRandom(java.util.Random random)
```
    Description copied from interface: Randomized
    
    Sets the random number generator used by this object.
    
    Specified by:
    
    setRandom in interface Randomized
    
    Parameters:
    
    random - The random number generator for this object to use.
  - getDocumentCount
```
public int getDocumentCount()
```
    Gets the number of documents in the dataset.
    
    Returns:
    
    The number of documents.
  - getTermCount
```
public int getTermCount()
```
    Gets the number of terms in the dataset.
    
    Returns:
    
    The number of terms.

Class LatentDirichletAllocationVectorGibbsSampler

Nested Class Summary

Field Summary

Fields inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner

Fields inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm

Fields inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm

Constructor Summary

Method Summary

Methods inherited from class gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner

Methods inherited from class gov.sandia.cognition.algorithm.AbstractAnytimeAlgorithm

Methods inherited from class gov.sandia.cognition.algorithm.AbstractIterativeAlgorithm

Methods inherited from class java.lang.Object

Methods inherited from interface gov.sandia.cognition.algorithm.AnytimeAlgorithm

Methods inherited from interface gov.sandia.cognition.algorithm.IterativeAlgorithm

Methods inherited from interface gov.sandia.cognition.algorithm.StoppableAlgorithm

Field Detail

DEFAULT_TOPIC_COUNT

DEFAULT_ALPHA

DEFAULT_BETA

DEFAULT_MAX_ITERATIONS

DEFAULT_BURN_IN_ITERATIONS

DEFAULT_ITERATIONS_PER_SAMPLE

topicCount

alpha

beta

burnInIterations

iterationsPerSample

random

documentCount

termCount

documentTopicCount

documentTopicSum

topicTermCount

topicTermSum

occurrenceTopicAssignments

documentTermPairsCounts

documentTerms

documentTermCounts

topicCumulativeProportions

sampleCount

result

Constructor Detail

LatentDirichletAllocationVectorGibbsSampler

LatentDirichletAllocationVectorGibbsSampler

Method Detail

initializeAlgorithm

step

sampleTopic

cleanupAlgorithm

readParameters

getResult

getTopicCount

setTopicCount

getAlpha

setAlpha

getBeta

setBeta

getBurnInIterations

setBurnInIterations

getIterationsPerSample

setIterationsPerSample

getRandom

setRandom

getDocumentCount

getTermCount