DataType
- The type of the data to cluster. This is typically
defined by the divergence function used.ClusterType
- The type of Cluster
created by the algorithm.
This is typically defined by the cluster creator function used.@CodeReview(reviewer="Kevin R. Dixon", date="2008-07-22", changesNeeded=true, comments={"I *really* don\'t like the use of \'continue\', but I will defer.","Please implement the sections previously marked as \'to do\'"}, response=@CodeReviewResponse(respondent="Justin Basilico",date="2008-10-07",moreChangesNeeded=false,comments="The clusterer now supports hierarchical clustering.")) public class AgglomerativeClusterer<DataType,ClusterType extends Cluster<DataType>> extends AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType>> implements BatchClusterer<DataType,ClusterType>, BatchHierarchicalClusterer<DataType,ClusterType>, DivergenceFunctionContainer<ClusterType,ClusterType>
AgglomerativeClusterer
implements an agglomerative clustering
algorithm, which is a type of hierarchical clustering algorithm.
Such a clustering algorithm works by initially creating one
cluster for each element in the collection to cluster and then
repeatedly merging the two closest clusters until the stopping
condition is met or there is only one cluster remaining. This
implementation supports multiple methods for determining the
distance between two clusters by supplying an
ClusterToClusterDivergenceFunction
object. There are two stopping
conditions for the algorithm, which are parameters that can be set. The first
is that the clustering will stop when some minimum number of
clusters is reached, which defaults to 1. The second criteria is
that the clustering will stop when the distance between the two
closest clusters is larger than a given value. This threshold can
be used to create clusters when the number of clusters is not
known ahead of time.Modifier and Type | Class and Description |
---|---|
static class |
AgglomerativeClusterer.HierarchyNode<DataType,ClusterType extends Cluster<DataType>>
Holds the hierarchy information for the agglomerative clusterer.
|
Modifier and Type | Field and Description |
---|---|
protected java.util.ArrayList<ClusterType> |
clusters
The current set of clusters.
|
protected java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> |
clustersHierarchy
The current set of hierarchical clusters.
|
protected ClusterCreator<ClusterType,DataType> |
creator
The merger used to merge two clusters into one element.
|
static double |
DEFAULT_MAX_DISTANCE
The default maximum distance is 1.7976931348623157E308.
|
static int |
DEFAULT_MAX_ITERATIONS
The default maximum number of iterations 2147483647
|
static double |
DEFAULT_MAX_MIN_DISTANCE
Deprecated.
|
static int |
DEFAULT_MIN_NUM_CLUSTERS
The default minimum number of clusters is 1.
|
protected ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> |
divergenceFunction
The divergence function used to find the distance between two clusters.
|
protected double |
maxDistance
The maximum distance between clusters allowed.
|
protected java.util.ArrayList<java.lang.Integer> |
minClusters
The array of indexes that maps the cluster index to the closest cluster.
|
protected java.util.ArrayList<java.lang.Double> |
minDistances
An array list mapping the cached minimum distance from the cluster with
the given index to any other clusters.
|
protected int |
minNumClusters
The minimum number of clusters allowed.
|
data, keepGoing
maxIterations
DEFAULT_ITERATION, iteration
Constructor and Description |
---|
AgglomerativeClusterer()
Creates a new instance of AgglomerativeClusterer.
|
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
ClusterCreator<ClusterType,DataType> creator)
Initializes the clustering to use the given metric between
clusters, and the given cluster creator.
|
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
ClusterCreator<ClusterType,DataType> creator,
double maxDistance)
Initializes the clustering to use the given metric between
clusters, the given cluster merger, and the maximum
distance between clusters to allow when merging.
|
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
ClusterCreator<ClusterType,DataType> creator,
int minNumClusters)
Initializes the clustering to use the given metric between
clusters, the given cluster creator, and the minimum number of
clusters to allow.
|
AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction,
ClusterCreator<ClusterType,DataType> creator,
int minNumClusters,
double maxDistance)
Initializes the clustering to use the given metric between
clusters, the given cluster merger, the minimum number of
clusters to allow, and the maximum minimum distance between
clusters to allow.
|
Modifier and Type | Method and Description |
---|---|
protected void |
cleanupAlgorithm()
Called to clean up the learning algorithm's state after learning has
finished.
|
AgglomerativeClusterer<DataType,ClusterType> |
clone()
This makes public the clone method on the
Object class and
removes the exception that it throws. |
ClusterHierarchyNode<DataType,ClusterType> |
clusterHierarchically(java.util.Collection<? extends DataType> data)
Performs hierarchical clustering on the given elements.
|
java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> |
getClustersHierarchy()
Gets the hierarchy of clusters.
|
ClusterCreator<ClusterType,DataType> |
getCreator()
Gets the cluster creator.
|
ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> |
getDivergenceFunction()
Gets the divergence function used in clustering.
|
double |
getMaxDistance()
The maximum distance between clusters that is allowed
for the two clusters to be merged.
|
double |
getMaxMinDistance()
Deprecated.
Use getMaxDistance
|
int |
getMinNumClusters()
The minimum number of clusters to allow.
|
int |
getNumClusters()
Gets the number of clusters.
|
java.util.ArrayList<ClusterType> |
getResult()
Gets the current result of the algorithm.
|
protected boolean |
initializeAlgorithm()
Called to initialize the learning algorithm's state based on the
data that is stored in the data field.
|
protected int |
mergeClusters(int firstIndex,
int secondIndex,
double distance)
Merges two clusters together, creating a new BinaryTreeCluster
and updating the internal state.
|
protected void |
setClusters(java.util.ArrayList<ClusterType> clusters)
Sets the clusters.
|
protected void |
setClustersHierarchy(java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> clustersHierarchy)
Sets the hierarchy of clusters.
|
void |
setCreator(ClusterCreator<ClusterType,DataType> creator)
Sets the cluster creator.
|
void |
setDivergenceFunction(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
Sets the divergence function.
|
void |
setMaxDistance(double maxDistance)
The maximum distance between clusters that is allowed
for the two clusters to be merged.
|
void |
setMaxMinDistance(double maxMinDistance)
Deprecated.
Use setMaxDistance
|
protected void |
setMinClusters(java.util.ArrayList<java.lang.Integer> minClusters)
Sets the index of the closest cluster.
|
protected void |
setMinDistances(java.util.ArrayList<java.lang.Double> minDistances)
Sets the minimum distances for each clusters.
|
void |
setMinNumClusters(int minNumClusters)
The minimum number of clusters to allow.
|
protected boolean |
step()
Called to take a single step of the learning algorithm.
|
protected void |
updateMinDistance(int index)
Updates the cached minimum distance for this cluster by
comparing it to all the other clusters.
|
getData, getKeepGoing, learn, setData, setKeepGoing, stop
getMaxIterations, isResultValid, setMaxIterations
addIterativeAlgorithmListener, fireAlgorithmEnded, fireAlgorithmStarted, fireStepEnded, fireStepStarted, getIteration, getListeners, removeIterativeAlgorithmListener, setIteration, setListeners
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
learn
getMaxIterations, setMaxIterations
addIterativeAlgorithmListener, getIteration, removeIterativeAlgorithmListener
isResultValid
public static final int DEFAULT_MIN_NUM_CLUSTERS
public static final double DEFAULT_MAX_DISTANCE
@Deprecated public static final double DEFAULT_MAX_MIN_DISTANCE
public static final int DEFAULT_MAX_ITERATIONS
protected ClusterToClusterDivergenceFunction<? super ClusterType extends Cluster<DataType>,? super DataType> divergenceFunction
protected ClusterCreator<ClusterType extends Cluster<DataType>,DataType> creator
protected int minNumClusters
protected double maxDistance
protected java.util.ArrayList<ClusterType extends Cluster<DataType>> clusters
protected java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType extends Cluster<DataType>>> clustersHierarchy
protected transient java.util.ArrayList<java.lang.Double> minDistances
protected transient java.util.ArrayList<java.lang.Integer> minClusters
public AgglomerativeClusterer()
public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator)
divergenceFunction
- The distance metric between clusters.creator
- The method for creating clusters.public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, int minNumClusters)
divergenceFunction
- The distance metric between clusters.creator
- The method for creating clusters.minNumClusters
- The minimum number of clusters to allow. Must
be greater than zero.public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, double maxDistance)
divergenceFunction
- The distance metric between clusters.creator
- The method for creating clusters.maxDistance
- The maximum distance between clusters to allow when
merging them.public AgglomerativeClusterer(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction, ClusterCreator<ClusterType,DataType> creator, int minNumClusters, double maxDistance)
divergenceFunction
- The distance metric between clusters.creator
- The method for creating clusters.minNumClusters
- The minimum number of clusters to allow. Must
be greater than zero.maxDistance
- The maximum distance between clusters to allow when
merging them.public AgglomerativeClusterer<DataType,ClusterType> clone()
AbstractCloneableSerializable
Object
class and
removes the exception that it throws. Its default behavior is to
automatically create a clone of the exact type of object that the
clone is called on and to copy all primitives but to keep all references,
which means it is a shallow copy.
Extensions of this class may want to override this method (but call
super.clone()
to implement a "smart copy". That is, to target
the most common use case for creating a copy of the object. Because of
the default behavior being a shallow copy, extending classes only need
to handle fields that need to have a deeper copy (or those that need to
be reset). Some of the methods in ObjectUtil
may be helpful in
implementing a custom clone method.
Note: The contract of this method is that you must use
super.clone()
as the basis for your implementation.clone
in interface CloneableSerializable
clone
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
public ClusterHierarchyNode<DataType,ClusterType> clusterHierarchically(java.util.Collection<? extends DataType> data)
BatchHierarchicalClusterer
clusterHierarchically
in interface BatchHierarchicalClusterer<DataType,ClusterType extends Cluster<DataType>>
data
- The elements to cluster.protected boolean initializeAlgorithm()
AbstractAnytimeBatchLearner
initializeAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
protected boolean step()
AbstractAnytimeBatchLearner
step
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
protected void cleanupAlgorithm()
AbstractAnytimeBatchLearner
cleanupAlgorithm
in class AbstractAnytimeBatchLearner<java.util.Collection<? extends DataType>,java.util.Collection<ClusterType extends Cluster<DataType>>>
protected void updateMinDistance(int index)
index
- The cluster to update.protected int mergeClusters(int firstIndex, int secondIndex, double distance)
firstIndex
- The first cluster.secondIndex
- The second cluster.distance
- The distance between the clusters.public java.util.ArrayList<ClusterType> getResult()
AnytimeAlgorithm
getResult
in interface AnytimeAlgorithm<java.util.Collection<ClusterType extends Cluster<DataType>>>
public int getNumClusters()
public ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> getDivergenceFunction()
getDivergenceFunction
in interface DivergenceFunctionContainer<ClusterType extends Cluster<DataType>,ClusterType extends Cluster<DataType>>
public void setDivergenceFunction(ClusterToClusterDivergenceFunction<? super ClusterType,? super DataType> divergenceFunction)
divergenceFunction
- The divergence function.public ClusterCreator<ClusterType,DataType> getCreator()
public void setCreator(ClusterCreator<ClusterType,DataType> creator)
creator
- The creator for clusters.public int getMinNumClusters()
public void setMinNumClusters(int minNumClusters)
minNumClusters
- The new minimum number of clusters.@Deprecated public double getMaxMinDistance()
@Deprecated public void setMaxMinDistance(double maxMinDistance)
maxMinDistance
- The maximum distance.public double getMaxDistance()
public void setMaxDistance(double maxDistance)
maxDistance
- The new maximum distance between clusters to merge.protected void setClusters(java.util.ArrayList<ClusterType> clusters)
clusters
- The clusters.public java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> getClustersHierarchy()
protected void setClustersHierarchy(java.util.ArrayList<AgglomerativeClusterer.HierarchyNode<DataType,ClusterType>> clustersHierarchy)
clustersHierarchy
- The hierarchy of clusters.protected void setMinDistances(java.util.ArrayList<java.lang.Double> minDistances)
minDistances
- The array of minimum distances.protected void setMinClusters(java.util.ArrayList<java.lang.Integer> minClusters)
minClusters
- The array of closest cluster indices.