gov.sandia.cognition.learning.algorithm.tree

## Class VectorThresholdInformationGainLearner<OutputType>

• Type Parameters:
`OutputType` - The output type of the data.
All Implemented Interfaces:
BatchLearner<java.util.Collection<? extends InputOutputPair<? extends Vectorizable,OutputType>>,VectorElementThresholdCategorizer>, DimensionFilterableLearner, DeciderLearner<Vectorizable,OutputType,java.lang.Boolean,VectorElementThresholdCategorizer>, PriorWeightedNodeLearner<OutputType>, VectorThresholdLearner<OutputType>, CloneableSerializable, java.io.Serializable, java.lang.Cloneable

```public class VectorThresholdInformationGainLearner<OutputType>
extends AbstractVectorThresholdMaximumGainLearner<OutputType>
implements PriorWeightedNodeLearner<OutputType>```
The `VectorThresholdInformationGainLearner` computes the best threshold over a dataset of vectors using information gain to determine the optimal index and threshold. This is an implementation of what is used in the C4.5 decision tree algorithm.

Information gain for a given split (sets X and Y) for two categories (a and b):
ig(X, Y) = entropy(X + Y)
– (|X| / (|X| + |Y|)) entropy(X)
– (|Y| / (|X| + |Y|)) entropy(Y)
with

entropy(Z) = - (Za / |Z|) log2(Za / |Z|) – (Zb / |Z|) log2(Zb / |Z|)

where
Za = number of a's in Z, and
Zb = number of b's in Z.
In the multi-class case, the entropy is defined as the sum over all of the categories (c) of -Zc / |Z| log2(Zc / |Z|).
Since:
2.0
Author:
Justin Basilico
Serialized Form
• ### Field Summary

Fields
Modifier and Type Field and Description
`protected java.util.ArrayList<OutputType>` `categories`
The categories for the prior.
`protected int[]` `categoryCounts`
The counts for each category.
`protected double[]` `categoryPriors`
The priors for each category.
`protected double[]` `categoryProbabilities`
Following is scratch space used when computing weighted entropy.
• ### Fields inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner

`DEFAULT_MIN_SPLIT_SIZE, dimensionsToConsider, minSplitSize`
• ### Constructor Summary

Constructors
Constructor and Description
`VectorThresholdInformationGainLearner()`
Creates a new instance of VectorDeciderLearner.
`VectorThresholdInformationGainLearner(int minSplitSize)`
Creates a new `VectorThresholdInformationGainLearner`.
• ### Method Summary

All Methods
Modifier and Type Method and Description
`VectorThresholdInformationGainLearner<OutputType>` `clone()`
This makes public the clone method on the `Object` class and removes the exception that it throws.
`double` ```computeSplitGain(DefaultDataDistribution<OutputType> baseCounts, DefaultDataDistribution<OutputType> positiveCounts, DefaultDataDistribution<OutputType> negativeCounts)```
Computes the gain of a given split.
`void` ```configure(java.util.Map<OutputType,java.lang.Double> priors, java.util.Map<OutputType,java.lang.Integer> trainCounts)```
Configure the node learner with prior weights and training counts.
• ### Methods inherited from class gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner

`computeBestGainAndThreshold, computeBestGainAndThreshold, getDimensionsToConsider, getMinSplitSize, learn, setDimensionsToConsider, setMinSplitSize`
• ### Methods inherited from class java.lang.Object

`equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`
• ### Field Detail

• #### categories

`protected java.util.ArrayList<OutputType> categories`
The categories for the prior.
• #### categoryPriors

`protected double[] categoryPriors`
The priors for each category.
• #### categoryCounts

`protected int[] categoryCounts`
The counts for each category.
• #### categoryProbabilities

`protected double[] categoryProbabilities`
Following is scratch space used when computing weighted entropy. It is declared here so it can be allocated once, instead of during every entropy evaluation.
• ### Constructor Detail

• #### VectorThresholdInformationGainLearner

`public VectorThresholdInformationGainLearner()`
Creates a new instance of VectorDeciderLearner.
• #### VectorThresholdInformationGainLearner

`public VectorThresholdInformationGainLearner(int minSplitSize)`
Creates a new `VectorThresholdInformationGainLearner`.
Parameters:
`minSplitSize` - The minimum split size. Must be positive.
• ### Method Detail

• #### clone

`public VectorThresholdInformationGainLearner<OutputType> clone()`
Description copied from class: `AbstractCloneableSerializable`
This makes public the clone method on the `Object` class and removes the exception that it throws. Its default behavior is to automatically create a clone of the exact type of object that the clone is called on and to copy all primitives but to keep all references, which means it is a shallow copy. Extensions of this class may want to override this method (but call `super.clone()` to implement a "smart copy". That is, to target the most common use case for creating a copy of the object. Because of the default behavior being a shallow copy, extending classes only need to handle fields that need to have a deeper copy (or those that need to be reset). Some of the methods in `ObjectUtil` may be helpful in implementing a custom clone method. Note: The contract of this method is that you must use `super.clone()` as the basis for your implementation.
Specified by:
`clone` in interface `CloneableSerializable`
Overrides:
`clone` in class `AbstractVectorThresholdMaximumGainLearner<OutputType>`
Returns:
A clone of this object.
• #### computeSplitGain

```public double computeSplitGain(DefaultDataDistribution<OutputType> baseCounts,
Description copied from class: `AbstractVectorThresholdMaximumGainLearner`
Computes the gain of a given split. The base counts contains the category information before the split.
Specified by:
`computeSplitGain` in class `AbstractVectorThresholdMaximumGainLearner<OutputType>`
Parameters:
`baseCounts` - The base category information before splitting. Contains the sum of the positive and negative counts.
`positiveCounts` - The category information on the positive side of the split.
`negativeCounts` - The category information on the negative side of the split.
Returns:
The gain of the given split computed by comparing the positive and negative counts to the base counts.
• #### configure

```public void configure(java.util.Map<OutputType,java.lang.Double> priors,
java.util.Map<OutputType,java.lang.Integer> trainCounts)```
Description copied from interface: `PriorWeightedNodeLearner`
Configure the node learner with prior weights and training counts.

If the prior weights are not specified, this method will configure default priors that match the relative frequencies of the different categories in the training data. The frequencies are based on the given category counts from the training data.
Specified by:
`configure` in interface `PriorWeightedNodeLearner<OutputType>`
Parameters:
`priors` - Prior weights for each of the possible output values (i.e., the categories for the prediction task). If null, the method will estimate default priors from the training counts.
`trainCounts` - Frequency counts of the possible output values (i.e., categories) in the training data. This parameter should always be specified.