@PublicationReference(author={"David M. Blei","Andrew Y. Ng","Michael I. Jordan"},title="Latent Dirichlet Allocation",year=2003,type=Journal,publication="Journal of Machine Learning Research",pages={993,1022},url="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf") @PublicationReference(author="Gregor Heinrich",title="Parameter estimation for text analysis",year=2009,type=TechnicalReport,url="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.1327&rep=rep1&type=pdf") public class LatentDirichletAllocationVectorGibbsSampler extends AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result> implements Randomized
Modifier and Type | Class and Description |
---|---|
static class |
LatentDirichletAllocationVectorGibbsSampler.Result
Represents the result of performing Latent Dirichlet Allocation.
|
Modifier and Type | Field and Description |
---|---|
protected double |
alpha
The alpha parameter controlling the Dirichlet distribution for the
document-topic probabilities.
|
protected double |
beta
The beta parameter controlling the Dirichlet distribution for the
topic-term probabilities.
|
protected int |
burnInIterations
The number of burn-in iterations for the Markov Chain Monte Carlo
algorithm to run before sampling begins.
|
static double |
DEFAULT_ALPHA
The default value of alpha is 5.0.
|
static double |
DEFAULT_BETA
The default value of beta is 0.5.
|
static int |
DEFAULT_BURN_IN_ITERATIONS
The default number of burn-in iterations is 2000.
|
static int |
DEFAULT_ITERATIONS_PER_SAMPLE
The default number of iterations per sample is 100.
|
static int |
DEFAULT_MAX_ITERATIONS
The default maximum number is iterations is 10000.
|
static int |
DEFAULT_TOPIC_COUNT
The default topic count is 10.
|
protected int |
documentCount
The number of documents in the dataset.
|
protected int[] |
documentTermCounts
For each unique term (unique per document), the number of times that term
occurs in the document.
|
protected int[] |
documentTermPairsCounts
the number of unique terms in each document.
|
protected int[] |
documentTerms
For each unique term (unique per document) which term id it maps to.
|
protected int[][] |
documentTopicCount
For each document, the number of terms assigned to each topic.
|
protected int[] |
documentTopicSum
The number of term occurrences in each document.
|
protected int |
iterationsPerSample
The number of iterations to the Markov Chain Monte Carlo algorithm
between samples (after the burn-in iterations).
|
protected int[] |
occurrenceTopicAssignments
The assignments of term occurrences to topics.
|
protected java.util.Random |
random
The random number generator to use.
|
protected LatentDirichletAllocationVectorGibbsSampler.Result |
result
The result probabilities.
|
protected int |
sampleCount
The number of model parameter samples that have been made.
|
protected int |
termCount
The number of terms in the dataset.
|
protected int |
topicCount
The number of topics for the algorithm to create.
|
protected double[] |
topicCumulativeProportions
We create this array to be used as a workspace to avoid having to
recreate it inside the sampling function.
|
protected int[][] |
topicTermCount
For each topic, the number of occurrences assigned to each term.
|
protected int[] |
topicTermSum
The number of term occurrences assigned to each term.
|
data, keepGoing
maxIterations
DEFAULT_ITERATION, iteration
Constructor and Description |
---|
LatentDirichletAllocationVectorGibbsSampler()
Creates a new
LatentDirichletAllocationVectorGibbsSampler with
default parameters. |
LatentDirichletAllocationVectorGibbsSampler(int topicCount,
double alpha,
double beta,
int maxIterations,
int burnInIterations,
int iterationsPerSample,
java.util.Random random)
Creates a new
LatentDirichletAllocationVectorGibbsSampler with
the given parameters. |
Modifier and Type | Method and Description |
---|---|
protected void |
cleanupAlgorithm()
Called to clean up the learning algorithm's state after learning has
finished.
|
double |
getAlpha()
Gets the alpha parameter controlling the Dirichlet distribution for the
document-topic probabilities.
|
double |
getBeta()
Gets the beta parameter controlling the Dirichlet distribution for the
topic-term probabilities.
|
int |
getBurnInIterations()
Gets he number of burn-in iterations for the Markov Chain Monte Carlo
algorithm to run before sampling begins.
|
int |
getDocumentCount()
Gets the number of documents in the dataset.
|
int |
getIterationsPerSample()
Gets the number of iterations to the Markov Chain Monte Carlo algorithm
between samples (after the burn-in iterations).
|
java.util.Random |
getRandom()
Gets the random number generator used by this object.
|
LatentDirichletAllocationVectorGibbsSampler.Result |
getResult()
Gets the current result of the algorithm.
|
int |
getTermCount()
Gets the number of terms in the dataset.
|
int |
getTopicCount()
Gets the number of topics (k) created by the topic model.
|
protected boolean |
initializeAlgorithm()
Called to initialize the learning algorithm's state based on the
data that is stored in the data field.
|
protected void |
readParameters()
Reads the current set of parameters.
|
protected int |
sampleTopic(int document,
int term,
double[] topicCumulativeProportions)
Samples a topic for a given document and term.
|
void |
setAlpha(double alpha)
Sets the alpha parameter controlling the Dirichlet distribution for the
document-topic probabilities.
|
void |
setBeta(double beta)
Sets the beta parameter controlling the Dirichlet distribution for the
topic-term probabilities.
|
void |
setBurnInIterations(int burnInIterations)
Sets he number of burn-in iterations for the Markov Chain Monte Carlo
algorithm to run before sampling begins.
|
void |
setIterationsPerSample(int iterationsPerSample)
Sets the number of iterations to the Markov Chain Monte Carlo algorithm
between samples (after the burn-in iterations).
|
void |
setRandom(java.util.Random random)
Sets the random number generator used by this object.
|
void |
setTopicCount(int topicCount)
Sets the number of topics (k) created by the topic model.
|
protected boolean |
step()
Called to take a single step of the learning algorithm.
|
clone, getData, getKeepGoing, learn, setData, setKeepGoing, stop
getMaxIterations, isResultValid, setMaxIterations
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getMaxIterations, setMaxIterations
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
isResultValid
public static final int DEFAULT_TOPIC_COUNT
public static final double DEFAULT_ALPHA
public static final double DEFAULT_BETA
public static final int DEFAULT_MAX_ITERATIONS
public static final int DEFAULT_BURN_IN_ITERATIONS
public static final int DEFAULT_ITERATIONS_PER_SAMPLE
protected int topicCount
protected double alpha
protected double beta
protected int burnInIterations
protected int iterationsPerSample
protected java.util.Random random
protected transient int documentCount
protected transient int termCount
protected transient int[][] documentTopicCount
protected transient int[] documentTopicSum
protected transient int[][] topicTermCount
protected transient int[] topicTermSum
protected transient int[] occurrenceTopicAssignments
protected transient int[] documentTermPairsCounts
protected transient int[] documentTerms
protected transient int[] documentTermCounts
protected transient double[] topicCumulativeProportions
protected transient int sampleCount
protected transient LatentDirichletAllocationVectorGibbsSampler.Result result
public LatentDirichletAllocationVectorGibbsSampler()
LatentDirichletAllocationVectorGibbsSampler
with
default parameters.public LatentDirichletAllocationVectorGibbsSampler(int topicCount, double alpha, double beta, int maxIterations, int burnInIterations, int iterationsPerSample, java.util.Random random)
LatentDirichletAllocationVectorGibbsSampler
with
the given parameters.topicCount
- The number of topics for the algorithm to create. Must be positive.alpha
- The alpha parameter controlling the Dirichlet distribution for the
document-topic probabilities. It acts as a prior weight assigned to
the document-topic counts. Must be positive.beta
- The beta parameter controlling the Dirichlet distribution for the
topic-term probabilities. It acts as a prior weight assigned to
the topic-term counts.maxIterations
- The maximum number of iterations to run for. Must be positive.burnInIterations
- The number of burn-in iterations for the Markov Chain Monte Carlo
algorithm to run before sampling begins.iterationsPerSample
- The number of iterations to the Markov Chain Monte Carlo algorithm
between samples (after the burn-in iterations).random
- The random number generator to use.protected boolean initializeAlgorithm()
AbstractAnytimeBatchLearner
initializeAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
protected boolean step()
AbstractAnytimeBatchLearner
step
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
protected int sampleTopic(int document, int term, double[] topicCumulativeProportions)
document
- The document index.term
- The term index.topicCumulativeProportions
- The array to use to store the proportions in.protected void cleanupAlgorithm()
AbstractAnytimeBatchLearner
cleanupAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends Vectorizable>,LatentDirichletAllocationVectorGibbsSampler.Result>
protected void readParameters()
public LatentDirichletAllocationVectorGibbsSampler.Result getResult()
AnytimeAlgorithm
getResult
in interface AnytimeAlgorithm<LatentDirichletAllocationVectorGibbsSampler.Result>
public int getTopicCount()
public void setTopicCount(int topicCount)
topicCount
- The number of topics created by the topic model. Must be greater
than zero.public double getAlpha()
public void setAlpha(double alpha)
alpha
- The alpha parameter. Must be positive.public double getBeta()
public void setBeta(double beta)
beta
- The beta parameter. Must be positive.public int getBurnInIterations()
public void setBurnInIterations(int burnInIterations)
burnInIterations
- The number of burn-in iterations. Must be non-negative.public int getIterationsPerSample()
public void setIterationsPerSample(int iterationsPerSample)
iterationsPerSample
- The number of iterations between samples. Must be positive.public java.util.Random getRandom()
Randomized
getRandom
in interface Randomized
public void setRandom(java.util.Random random)
Randomized
setRandom
in interface Randomized
random
- The random number generator for this object to use.public int getDocumentCount()
public int getTermCount()