DataType
- The type of the data to cluster. This is typically defined
by the divergence function used.ClusterType
- The type of Cluster
created by the algorithm.
This is typically defined by the cluster creator function used.@CodeReview(reviewer="Justin Basilico", date="2011-03-09", comments={"Should make do a greedy splitting prioritization.","Should make an interface for incremental cluster creation for this to use."}, changesNeeded=true) @PublicationReference(author={"Ying Zhao","George Karypis"}, title="Hierarchical Clustering Algorithms for Document Datasets", type=Journal, year=2005, publication="Data Mining and Knowledge Discovery", pages={141,168}, url="http://www.springerlink.com/index/jx3825j42x4333m5.pdf") public class PartitionalClusterer<DataType,ClusterType extends Cluster<DataType>> extends AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType>> implements BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, Randomized, DivergenceFunctionContainer<ClusterType,DataType>
PartitionalClusterer
implements a partitional clustering
algorithm, which is a type of hierarchical clustering algorithm. Consider
using create(int)
or createSpherical(int)
to instantiate
this clusterer.
Partitional clustering works by creating one cluster for all elements in the collection and then repeatedly partitioning each cluster into two clusters until one of the stopping criteria is satisfied:
This implementation supports multiple methods for determining the distance
between two clusters. The methods can be supplied by providing a
WithinClusterDivergence
object or by providing a
ClusterDivergenceFunction
which is automatically wrapped by
WithinClusterDivergenceWrapper
into a
WithinClusterDivergence
.
Modifier and Type | Field and Description |
---|---|
protected WithinClusterDivergence<? super ClusterType,? super DataType> |
clusterDivergenceFunction
The divergence function used to find the distance between two clusters.
|
protected java.util.ArrayList<ClusterType> |
clusters
The current set of clusters all clusters created.
|
protected java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> |
clustersHierarchy
The current set of hierarchical clusters created.
|
protected IncrementalClusterCreator<ClusterType,DataType> |
creator
The merger used to merge two clusters into one cluster.
|
static double |
DEFAULT_MAX_CRITERION_DECREASE
The default maximum decrease in the training criterion is 1.0.
|
static int |
DEFAULT_MAX_ITERATIONS
The default maximum number of iterations is 2147483647.
|
static int |
DEFAULT_MIN_CLUSTER_SIZE
The default minimum number of elements per cluster is 1.
|
static int |
DEFAULT_NUM_REQUESTED_CLUSTERS
The default number of requested clusters is 2147483647.
|
protected DivergenceFunction<? super ClusterType,? super DataType> |
divergenceFunction
An optional DivergenceFunction that is used to create a
WithinClusterDivergence function via a
WithinClusterDivergenceWrapper . |
protected double |
maxCriterionDecrease
The maximum decrease in training criterion allowed.
|
protected int |
minClusterSize
The minimum number of elements per cluster allowed.
|
protected int |
numRequestedClusters
The number of clusters requested.
|
protected java.util.Random |
random
The source of randomness used during initial partitioning.
|
protected double |
tolerance
Tolerance for determining when improvement has been made.
|
protected boolean |
useCachedClusters
Whether or not the current learning is using cached cluster results.
|
data, keepGoing
maxIterations
DEFAULT_ITERATION, iteration
Constructor and Description |
---|
PartitionalClusterer(int numRequestedClusters,
ClusterDivergenceFunction<ClusterType,DataType> divergenceFunction,
IncrementalClusterCreator<ClusterType,DataType> creator)
Creates a new partitional clusterer.
|
PartitionalClusterer(int numRequestedClusters,
WithinClusterDivergence<ClusterType,DataType> divergenceFunction,
IncrementalClusterCreator<ClusterType,DataType> creator)
Creates a new partitional clusterer.
|
Modifier and Type | Method and Description |
---|---|
protected void |
cleanupAlgorithm()
Called to clean up the learning algorithm's state after learning has
finished.
|
PartitionalClusterer<DataType,ClusterType> |
clone()
This makes public the clone method on the
Object class and
removes the exception that it throws. |
ClusterHierarchyNode<DataType,ClusterType> |
clusterHierarchically(java.util.Collection<? extends DataType> data)
Performs hierarchical clustering on the given elements.
|
static PartitionalClusterer<Vector,CentroidCluster<Vector>> |
create(int numRequestedClusters)
Create a partitional clusterer, using Euclidean distance and a vector
mean centroid cluster creator.
|
static PartitionalClusterer<Vectorizable,NormalizedCentroidCluster<Vectorizable>> |
createSpherical(int numRequestedClusters)
Create a spherical partitional clusterer, using Cosine distance and a
vector mean centroid cluster creator.
|
int |
getClusterCount()
Gets the total number of clusters created.
|
java.util.ArrayList<ClusterType> |
getClusters()
Gets all clusters created.
|
java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> |
getClustersHierarchy()
Gets the hierarchy of clusters.
|
IncrementalClusterCreator<ClusterType,DataType> |
getCreator()
Gets the cluster creator.
|
DivergenceFunction<? super ClusterType,? super DataType> |
getDivergenceFunction()
Gets the stored metric between a cluster and a point.
|
double |
getMaxCriterionDecrease()
Returns the maximum decrease in the training criterion allowed following
a bisection.
|
int |
getMinClusterSize()
Returns the minimum number of elements per cluster to allow.
|
int |
getNumRequestedClusters()
Gets the current number of clusters requested.
|
java.util.Random |
getRandom()
Gets the random number generator used by this object.
|
java.util.ArrayList<ClusterType> |
getResult()
Gets the current result of the algorithm.
|
double |
getTolerance()
Gets the current tolerance value.
|
WithinClusterDivergence<? super ClusterType,? super DataType> |
getWithinClusterDivergenceFunction()
Gets the metric on clusters used for partitioning.
|
protected boolean |
initializeAlgorithm()
Called to initialize the learning algorithm's state based on the
data that is stored in the data field.
|
java.util.Collection<ClusterType> |
learnUsingCachedClusters(java.util.Collection<? extends DataType> data)
Perform clustering by extending the existing cluster hierarchy, if one
exists.
|
protected void |
setClusters(java.util.ArrayList<ClusterType> clusters)
Sets the clusters cache to the provided value.
|
protected void |
setClustersHierarchy(java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> clustersHierarchy)
Sets the hierarchy of clusters.
|
void |
setCreator(IncrementalClusterCreator<ClusterType,DataType> creator)
Sets the cluster creator.
|
void |
setDivergenceFunction(DivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
Use a metric between a cluster and a point to update the metric on
clusters.
|
void |
setMaxCriterionDecrease(double maxCriterionDecrease)
Sets the maximum decrease in the training criterion allowed following a
bisection.
|
void |
setMinClusterSize(int minClusterSize)
Sets the minimum number of elements per cluster to allow.
|
void |
setNumRequestedClusters(int numRequestedClusters)
Sets the number of clusters requested.
|
void |
setRandom(java.util.Random random)
Sets the random number generator used by this object.
|
void |
setTolerance(double tolerance)
Sets the tolerance value.
|
void |
setWithinClusterDivergenceFunction(WithinClusterDivergence<? super ClusterType,? super DataType> clusterDivergenceFunction)
Sets the metric on clusters used for partitioning.
|
protected boolean |
step()
Called to take a single step of the learning algorithm.
|
getData, getKeepGoing, learn, setData, setKeepGoing, stop
getMaxIterations, isResultValid, setMaxIterations
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
learn
getMaxIterations, setMaxIterations
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
isResultValid
public static final int DEFAULT_MIN_CLUSTER_SIZE
public static final double DEFAULT_MAX_CRITERION_DECREASE
public static final int DEFAULT_MAX_ITERATIONS
public static final int DEFAULT_NUM_REQUESTED_CLUSTERS
protected WithinClusterDivergence<? super ClusterType extends Cluster<DataType>,? super DataType> clusterDivergenceFunction
protected DivergenceFunction<? super ClusterType extends Cluster<DataType>,? super DataType> divergenceFunction
WithinClusterDivergence
function via a
WithinClusterDivergenceWrapper
.protected double tolerance
protected IncrementalClusterCreator<ClusterType extends Cluster<DataType>,DataType> creator
protected int minClusterSize
protected double maxCriterionDecrease
protected java.util.Random random
protected transient java.util.ArrayList<ClusterType extends Cluster<DataType>> clusters
protected transient java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType extends Cluster<DataType>>> clustersHierarchy
protected int numRequestedClusters
protected boolean useCachedClusters
public PartitionalClusterer(int numRequestedClusters, ClusterDivergenceFunction<ClusterType,DataType> divergenceFunction, IncrementalClusterCreator<ClusterType,DataType> creator)
numRequestedClusters
- The number of clusters to create. Note that
fewer than the requested number of cluster can be returned due to the
algorithm stopping due to one of the other stopping criteria.divergenceFunction
- The distance metric between a cluster and a
point. The metric is wrapped by a WithinClusterDivergenceWrapper
.creator
- The method for creating clusterspublic PartitionalClusterer(int numRequestedClusters, WithinClusterDivergence<ClusterType,DataType> divergenceFunction, IncrementalClusterCreator<ClusterType,DataType> creator)
numRequestedClusters
- The number of clusters to create. Note that
fewer than the requested number of cluster can be returned due to the
algorithm stopping due to one of the other stopping criteria.divergenceFunction
- The distance metric between a cluster and a
point. The metric is wrapped by a WithinClusterDivergenceWrapper
.creator
- The method for creating clusterspublic static PartitionalClusterer<Vector,CentroidCluster<Vector>> create(int numRequestedClusters)
numRequestedClusters
- public static PartitionalClusterer<Vectorizable,NormalizedCentroidCluster<Vectorizable>> createSpherical(int numRequestedClusters)
numRequestedClusters
- public PartitionalClusterer<DataType,ClusterType> clone()
AbstractCloneableSerializable
Object
class and
removes the exception that it throws. Its default behavior is to
automatically create a clone of the exact type of object that the
clone is called on and to copy all primitives but to keep all references,
which means it is a shallow copy.
Extensions of this class may want to override this method (but call
super.clone()
to implement a "smart copy". That is, to target
the most common use case for creating a copy of the object. Because of
the default behavior being a shallow copy, extending classes only need
to handle fields that need to have a deeper copy (or those that need to
be reset). Some of the methods in ObjectUtil
may be helpful in
implementing a custom clone method.
Note: The contract of this method is that you must use
super.clone()
as the basis for your implementation.clone
in interface CloneableSerializable
clone
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
public ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(java.util.Collection<? extends DataType> data)
BatchHierarchicalClusterer
clusterHierarchically
in interface BatchHierarchicalClusterer<DataType,ClusterType extends Cluster<DataType>>
data
- The elements to cluster.protected boolean initializeAlgorithm()
AbstractAnytimeBatchLearner
initializeAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
protected boolean step()
AbstractAnytimeBatchLearner
step
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
protected void cleanupAlgorithm()
AbstractAnytimeBatchLearner
cleanupAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
public java.util.ArrayList<ClusterType> getResult()
AnytimeAlgorithm
getResult
in interface AnytimeAlgorithm<java.util.Collection<ClusterType extends Cluster<DataType>>>
public java.util.Collection<ClusterType> learnUsingCachedClusters(java.util.Collection<? extends DataType> data)
learn()
.data
- The data to cluster. If the existing clustering was created
using a data set other than data
, these results will likely be
useless.public int getClusterCount()
public WithinClusterDivergence<? super ClusterType,? super DataType> getWithinClusterDivergenceFunction()
public DivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
WithinClusterDivergenceWrapper
.getDivergenceFunction
in interface DivergenceFunctionContainer<ClusterType extends Cluster<DataType>,DataType>
DivergenceFunction
providing a distance between a
cluster and a point.public void setDivergenceFunction(DivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
divergenceFunction
- The metric between a point and a point used to
update the metric on clusters.public void setWithinClusterDivergenceFunction(WithinClusterDivergence<? super ClusterType,? super DataType> clusterDivergenceFunction)
clusterDivergenceFunction
- The metric on clusterspublic double getTolerance()
public void setTolerance(double tolerance)
tolerance
- The new tolerance to use.public IncrementalClusterCreator<ClusterType,DataType> getCreator()
public void setCreator(IncrementalClusterCreator<ClusterType,DataType> creator)
creator
- The creator for clusters.public java.util.Random getRandom()
Randomized
getRandom
in interface Randomized
public void setRandom(java.util.Random random)
Randomized
setRandom
in interface Randomized
random
- The random number generator for this object to use.public int getMinClusterSize()
public void setMinClusterSize(int minClusterSize)
minClusterSize
- The new minimum number of elements per cluster
allowed. Must be greater than zero.public double getMaxCriterionDecrease()
public void setMaxCriterionDecrease(double maxCriterionDecrease)
maxCriterionDecrease
- The new maximum minimum distance.public java.util.ArrayList<ClusterType> getClusters()
protected void setClusters(java.util.ArrayList<ClusterType> clusters)
clusters
- The clusters.public java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> getClustersHierarchy()
protected void setClustersHierarchy(java.util.ArrayList<BinaryClusterHierarchyNode<DataType,ClusterType>> clustersHierarchy)
clustersHierarchy
- The hierarchy of clusters.public int getNumRequestedClusters()
public void setNumRequestedClusters(int numRequestedClusters)
numRequestedClusters
- The number of clusters requested.