Histograms

Introduction

A histogram is technically a bar graph, so I am taking liberties using the word to name a data-acquisition class. This page describes two concrete classes: DiscreteHistogram and ContinuousHistogram. These two classes are comparable to DiscreteDistribution and ContinuousDistribution. Both the Histogram and Distribution classes describe distributions. The difference is that where DiscreteDistribution and ContinuousDistribution represent their distributions using probability densities which are known at the outset, DiscreteHistogram and ContinuousHistogram represent their distributions using raw tallies gleaned from analysis of a source sequence.

If a source sequence has a discrete range (values are integers) then a DiscreteHistogram tallies up for each value the number of sequence elements which equal that value. If a source sequence has a continuous range (values are real numbers), then a ContinuousHistogram divides the range into equal-sized intervals and tallies up for each interval the number of sequence elements which fall within that interval. Tallies are implemented as long integers, since in my own work for this site I've needed to analyze some very long sequences. Any absolute tally can be expressed as a relative density; simply divide the tally by the total number of samples analyzed.

The total number of samples is one example of a summary statistic which is easy enough to gather when analyzing every member of a population. Another easily gathered statistic is the sum of sample values, which when divided by total samples gives the average sample value. Still another is the sum of squared sample values, which can be used to calculate the standard deviation around the average. My HistogramBase class gathers these additional statistics because while histogram graphs do not typically display his information, there are situations when it would definitely enrich the presention.

An additional set of summary statistics proved useful on the Basics I: Randomness page for quantifying how closely a population of random numbers conforms to a uniform distribution. If the distribution is uniform, then minimum and maximum tallies must be close together relative to the tally average. The standard deviation around the tally average must also be small.

Although the DiscreteHistogram and ContinuousHistogram classes themselves produce no graphics, they do acquire all the information necessary to generate a graph. Examples within the present topic area include the graphs used to illustrate the Poisson Point Process on the "basics" page for Distributions: the continuous histogram in Figure 2 and the discrete histogram in Figure 3. The statistics necessary to generate these graphics were gathered using instances of the ContinuousHistogram and DiscreteHistogram classes described on this page.

/**
 * The abstract {@link HistogramBase} class defines base functionality for a histogram.
 * @param <T> The range type of values:  discrete histograms range over {@link Integer} values, while
 * continuous histograms range over {@link Double} values.
 * @author Charles Ames
 */
public abstract class HistogramBase<T extends Number> extends WriteableEntity {
   /**
    * Lower limit to the distribution range.
    */
   private T minRange;
   /**
    * Upper limit to the distribution range.
    */
   private T maxRange;
   /**
    * The total number of instances analyzed.
    */
   private long sampleCount;
   /**
    * The tally of instances analyzed in the ith region.
    */
   private long instances[];
   /**
    * The number of instances analyzed in regions up to but not including the ith.
    * Calculated only after the histogram has been normalized.
    */
   private double sum[];
   /**
    * The sum of all samples analyzed, which when divided by the
    * sample count, gives the average.
    */
   private double sumOfSamples;
   /**
    * The sum of squares of all samples analyzed, useful for
    * calculating the standard deviation.
    */
   private double sumOfSquaredSamples;
   /**
    * The smallest instance tally.
    * Calculated by {@link #calculateTallyStatistics()}.
    */
   private long minTally;
   /**
    * The largest instance tally.
    * Calculated by {@link #calculateTallyStatistics()}.
    */
   private long maxTally;
   /**
    * The average instance tally.
    * Calculated by {@link #calculateTallyStatistics()}.
    */
   private double avgTally;
   /**
    * The RMS deviation of tallies around {@link #avgTally}.
    * Calculated by {@link #calculateTallyStatistics()}.
    */
   private double devTally;
   /**
    * Constructor for {@link HistogramBase} instances.
    * @param container An entity which contains this histogram.
    */
   public HistogramBase(WriteableEntity container) {
      super(container);
      this.setIDQuality(AttributeQuality.MODIFIABLE);
      this.setNameQuality(AttributeQuality.MODIFIABLE);
      this.sampleCount = Long.MIN_VALUE;
      this.instances = null;
      this.sum = null;
      this.sumOfSamples = 0.;
      this.sumOfSquaredSamples = 0.;
   }
   /**
    * Set the range of the histogram.
    * @param minRange Lower range boundary.
    * @param maxRange Upper range boundary.
    * @return True if range has changed; false otherwise.
    * @throws ObjectAlreadyExistsException when the range has
    * previously been set.
    */
   public boolean setRange(T minRange, T maxRange) {
      if (null != instances)
         throw new UnsupportedOperationException("Range previously set");
      boolean success = false;
      success = setMinRange(minRange) || success;
      success = setMaxRange(maxRange) || success;
      return success;
   }
   protected boolean setMinRange(T minRange) {
      if (null == this.minRange || !this.minRange.equals(minRange)) {
         this.minRange = minRange;
         makeDirty();
         return true;
      }
      return false;
   }
   protected boolean setMaxRange(T maxRange) {
      if (null == this.maxRange || !this.maxRange.equals(maxRange)) {
         this.maxRange = maxRange;
         makeDirty();
         return true;
      }
      return false;
   }
   /**
    * Get the lower range boundary of the histogram.
    * @return The lower range boundary of the histogram.
    */
   public T minRange() {
      return minRange;
   }
   /**
    * Get the upper range boundary of the histogram.
    * @return The upper range boundary of the histogram.
    */
   public T maxRange() {
      return maxRange;
   }
   /**
    * Getter for {@link #sampleCount}.
    * @return The count of calls to {@link #analyze(Number)}.
    */
   public long getSampleCount() {
      return sampleCount;
   }
   /**
    * Check if this histogram is normalized.
    * @return True if the histogram is normalized; false otherwise.
    */
   public final boolean isNormalized() {
      return null != sum;
   }
   /**
    * Calculate the number of discrete values or histogram regions.
    * @return The number of discrete values or histogram regions.
    */
   protected abstract int calculateItemTallies();
   /**
    * Calculate the number of discrete values or histogram regions.
    * @return The number of discrete values or histogram regions.
    */
   public int itemCount() {
      if (null == instances)
         throw new UninitializedException("Item tallies not initialized");
      return instances.length;
   }
   /**
    * Clear the histogram.  Do this before analyzing any values.
    */
   public void initialize() {
      this.sampleCount = 0;
      int instanceTallies = calculateItemTallies();
      this.instances = new long[instanceTallies];
      for (int index = 0; index < instanceTallies; index++) {
         this.instances[index] = 0;
      }
      denormalize();
   }
   /**
    * Blank out summary statistics which are invalidated by
    * acquisition of new analysis data.
    */
   protected final void denormalize() {
      if (!isNormalized()) return;
      this.sum = null;
      this.minTally = Long.MIN_VALUE;
      this.maxTally = Long.MIN_VALUE;
      this.avgTally = Long.MIN_VALUE;
      this.devTally = Long.MIN_VALUE;
   }
   /**
    * Getter for {@link #minTally}.
    * @return The minimum tally statistic.
    * @throws UninitializedException when this tally statistic
    * has not been populated by {@link #normalize()}.
    */
   public long minTally() {
      if (!isNormalized())
         throw new UninitializedException(
            "The histogram must be normalized before tally statistics may be accessed");
      return minTally;
   }
   /**
    * Getter for {@link #maxTally}.
    * @return The maximum tally statistic.
    * @throws UninitializedException when this tally statistic
    * has not been populated by {@link #normalize()}.
    */
   public long maxTally() {
      if (!isNormalized())
         throw new UninitializedException(
            "The histogram must be normalized before tally statistics may be accessed");
      return maxTally;
   }
   /**
    * Getter for {@link #avgTally}.
    * @return The average tally statistic.
    * @throws UninitializedException when this tally statistic
    * has not been populated by {@link #normalize()}.
    */
   public double avgTally() {
      if (!isNormalized())
         throw new UninitializedException(
            "The histogram must be normalized before tally statistics may be accessed");
      return avgTally;
   }
   /**
    * Getter for {@link #devTally}.
    * @return The RMS deviation of instance tallies around {@link #avgTally}.
    * @throws UninitializedException when this tally statistic
    * has not been populated by {@link #normalize()}.
    */
   public double devTally() {
      if (!isNormalized())
         throw new UninitializedException(
            "The histogram must be normalized before tally statistics may be accessed");
      return devTally;
   }
   /**
    * Update this histogram to reflect the indicated value.
    * @param value An array of numbers in range set by
    * {@link #setRange(Number, Number)}.
    * @throws IllegalArgumentException when the value falls outside
    * the range from the assigned range.
    */
   public abstract void analyze(T value);
   /**
    * Iterate through an array of values, updating this histogram
    * to reflect each value.
    * @param values An array of values all in range set by
    * {@link #setRange(Number, Number)}.
    * @throws IllegalArgumentException when the length of the
    * array is zero.
    * @throws IllegalArgumentException when any value falls outside
    * the range established by {@link #setRange(Number, Number)}.
    */
   public final void analyze(T[] values) {
      int length = values.length;
      if (0 == length) throw new IllegalArgumentException("Empty value array");
      for (int index = 0; index < length; index++) {
         analyze(values[index]);
      }
   }
   /**
    * Iterate through a {@link List} of values, updating this histogram
    * to reflect each value.
    * @param values A list of values all in range set by
    * {@link #setRange(Number, Number)}.
    * @throws IllegalArgumentException when the size of the
    * list is zero.
    * @throws IllegalArgumentException when any value falls outside
    * the range established by {@link #setRange(Number, Number)}.
    */
   public final void analyze(List<T> values) {
      if (0 == values.size()) throw new IllegalArgumentException("Empty value list");
      for (T value : values) analyze(value);
   }
   /**
    * Iterate through the collection of values stored within a
    * {@link Sequence}, updating this histogram
    * to reflect each value.
    * @param sequence A unit which iterates through a stored collection
    * of values.
    * @throws IllegalArgumentException when the size of the
    * stored collection is zero.
    * @throws IllegalArgumentException when any value falls outside
    * the range established by {@link #setRange(Number, Number)}.
    */
   public final void analyze(Sequence<T> sequence) {
      int length = sequence.getItemCount();
      for (int index = 0; index < length; index++) {
         analyze(sequence.valueAt(index));
      }
   }
   /**
    * Accumulate a value into {@link #sumOfSamples}.
    * {@link #analyze(Number)} should call this for each sample value.
    * @param value A sample value.
    */
   protected void incrementSumOfSamples(double value) {
      this.sumOfSamples += value;
      this.sumOfSquaredSamples += value*value;
   }
   /**
    * Increment the instance tally referenced by the indicated index.
    * @param index The tally index.
    */
   protected void incrementTally(int index) {
      if (null == instances)
         throw new UninitializedException("Instance tallies not initialized");
      sampleCount++;
      instances[index]++;
   }
   /**
    * Indicate whether this histogram needs initializing or not.
    * @return True if {@link #instances} has been allocated; false otherwise.
    */
   boolean hasInstances() {
      return null != instances;
   }
   /**
    * Set the instance tally referenced by the indicated index.
    * @param index The tally index.
    * @param tally The intended value.
    */
   protected void setTally(int index, long tally) {
      if (null == instances)
         throw new UninitializedException("Instance tallies not initialized");
      sampleCount -= instances[index];
      sampleCount += tally;
      instances[index] = tally;
   }
   /**
    * Calculates tally statistics {@link #minTally}, {@link #maxTally},
    * {@link #avgTally}, {@link #devTally}.
    * Also populates {@link #sum} array whose kth element holds
    * sum of tallies up through k-1.
    * Call this method after values have been
    * analyzed, but before using either {@link #densityUpTo(Number)}
    * or {@link #quantile(double)} method.
    */
   public final void normalize() {
      int tallies = itemCount();
      minTally = sampleCount;
      maxTally = 0;
      long sumTally = 0;
      for (int i = 0; i < tallies; i++) {
         long num = itemTally(i);
         if (num < minTally) minTally = num;
         if (num > maxTally) maxTally = num;
         sumTally += num;
      }
      if (sumTally != sampleCount) throw new RuntimeException("Sum of tallies inconsistent with count of instances analyzed");
      avgTally = sumTally / (double) tallies;
      double var = 0.;
      for (int i = 0; i < tallies; i++) {
         double num = itemTally(i) - avgTally;
         var += num*num;
      }
      devTally = Math.sqrt(var/tallies);
      int length = instances.length;
      sum = new double[length];
      sum[0] = 0.;
      for (int index = 1; index < length; index++) {
         sum[index] = sum[index-1] + (double) instances[index-1];
      }
      int limit = length - 1;
      if (sampleCount != sum[limit] + instances[limit]) {
         throw new RuntimeException("Partial sums inconsistent with sample count");
      }

   }
   /**
    * Get the number of instances for the region specified by
    * the indicated index.
    * @param index The indicated index.  Must be in the range
    * from 0 to {@link #itemCount()}.
    * @return The number of instances for the region specified
    * by the indicated index.
    */
   public long itemTally(int index) {
      if (0 > index || instances.length <= index) {
         throw new IllegalArgumentException("Index " + index + " out of range from 0 to " + instances.length);
      }
      return instances[index];
   }
   /**
    * Get the sum of instance tallies for indices prior to the
    * indicated index.
    * @param index The indicated index.  Must be in the range
    * from 0 to {@link #itemCount()}.
    * @return The number of instances for the region specified
    * by the indicated index.
    * @throws UninitializedException when the histogram has not been normalized.
    * @throws IllegalArgumentException when the index is invalid.
    */
   public double itemSum(int index) {
      if (!isNormalized())
         throw new UninitializedException("The histogram must be normalized before partial sums may be accessed");
      if (!isNormalized())
      if (0 > index || sum.length <= index) {
         throw new IllegalArgumentException("Index " + index + " out of range from 0 to " + (instances.length-1));
      }
      return sum[index];
   }
   /**
    * Get the density for the region specified by the indicated index.
    * @param index The indicated index.  Must be in the range from
    * 0 to {@link #itemCount()}.
    * @return The density for the region specified by the
    * indicated index.
    * @throws IllegalArgumentException when the index is invalid.
    */
   public final double itemDensity(int index) {
      if (0 > index || instances.length <= index) {
         throw new IllegalArgumentException("Index " + index + " out of range from 0 to " + (instances.length-1));
      }
      return instances[index] / (double) sampleCount;
   }
   /**
    * Get the sum of all histogram weights.
    * @return The sum of all histogram weights.
    * @throws UnsupportedOperationException when the histogram is
    * not normalized.
    * @throws RuntimeException when the histogram is
    * normalized, but there are no weighted items.
    */
   public final double sumOfTallies() {
      if (0 == sampleCount) throw new UnsupportedOperationException("No samples analyzed");
      return (double) sampleCount;
   }
   /**
    * Determine the largest region density.
    * @return The largest region density.
    */
   public final long maxInstances() {
      long result = 0;
      for (int index = 0; index < instances.length; index++) {
         result = Math.max(result, instances[index]);
      }
      return result;
   }
   /**
    * Calculate a practical lower bound for graphing output
    * from this transform.
    * @param tail The combined weight of out-of-range values.
    * Ranges from zero to unity, but smaller values are suggested.
    * @return A practical lower bound for graphing output from
    * this transform.
    * @throws IllegalArgumentException when the tail falls
    * outside the range from zero to unity.
    */
   public final T minGraphValue(double tail) {
      return minRange();
   }
   /**
    * Calculate a practical upper bound for graphing output
    * from this transform.
    * @param tail The combined weight of out-of-range values.
    * Ranges from zero to unity, but smaller values are suggested.
    * @return A practical upper bound for graphing output from
    * this transform.
    * @throws IllegalArgumentException when the tail falls
    * outside the range from zero to unity.
    */
   public final T maxGraphValue(double tail) {
      return maxRange();
   }
   /**
    * Use the indicated driver value to select a range value.
    * @param driver The indicated driver value.
    * Driver values must fall within the range from zero to unity.
    * The distribution will be faithfully reproduced only if the
    * driver values are uniformly distributed between zero and unity.
    * This means that if you divide the range from zero to unity
    * into N equal-sized regions, then each region will
    * contain roughly the same number of driver values.
    * @return The selected range value.
    * @throws IllegalArgumentException when the driver value falls outside the range from zero to unity.
    */
   public abstract T quantile(double driver);
   /**
    * Get the probability density at the indicated value.
    * @param value A value in the range of the distribution.
    * @return The probability density at the indicated value.
    */
   public abstract double densityAt(T value);
   /**
    * Sum together all weights for values less than or equal to
    * the argument.
    * @param value A value in the range of the distribution.
    * @return The sum of weights for values less than or equal
    * to the argument.  Ranges from zero to unity.
    */
   public abstract double densityUpTo(T value);
   /**
    * Get the statistical mean of sample values.
    * @return The sum of sample values divided by the count of samples.
    * @throws UnsupportedOperationException when no samples have been analyzed.
    */
   public final double sampleAverage() {
      if (sampleCount <= 0) throw new UnsupportedOperationException("No samples analyzed");
      return sumOfSamples / sampleCount;
   }
   /**
    * Get the variance (squared deviation) of sample values around their statistical mean.
    * @return The standard deviation.
    */
   public final double sampleVariance() {
      double average = sampleAverage();
      double averageOfSquares = sumOfSquaredSamples / sampleCount;
      double squareOfAverages = average*average;
      return averageOfSquares - squareOfAverages;
   }
   /**
    * Get the standard deviation of sample values around their statistical mean.
    * @return The standard deviation.
    */
   public final double sampleDeviation() {
      return Math.sqrt(sampleVariance());
   }
}
Listing 1: The HistogramBase base class.

Shared Functionality

The abstract HistogramBase class presented as Listing 1 implements functionality shared between DiscreteHistogram and ContinuousHistogram.

The first thing to notice in Listing 1 is the generic parameter T which extends java.lang.Number. Number is an abstract base class whose implementing sub-classes include Integer and Double. In practice, T can be either of these. Of the two concrete subclasses. DiscreteHistogram extends HistogramBase<Integer>, while ContinuousHistogram extends HistogramBase<Double>.

Central to everything are the array instances, which manages the individual tallies and the sampleCount field, which holds the total number of samples (also the sum of all tallies).

The life cycle of a DiscreteHistogram or ContinuousHistogram instance has a very brief construction phase, an analysis phase which processes source samples to infer a distribution, and a consumption phase which communicates results gathered during the analysis phase.

Construction begins with instance creation using the new operator. A call to setRange() establishes the lower and upper range boundaries. For DiscreteHistogram the number of tallies is inferred from the range boundaries. For ContinousHistogram the number of intervals needs to be specified using a call to setItemCount(). The construction phase completes with a call to initialize(). The initialize() method allocates instances (the array of tallies), zeros out instances The facts once established by setRange() and setItemCount() may never change. However once you are done analyzing a sequence and have consumed the results, you can analyze another sequence so long as the range and granularity carry over. Simply initiate each subsequent analysis with a call to initialize().

Analysis begins after the call to initialize() returns. It encompasses successive calls to analyze(T value), and pauses (reserving the option to resume) with a call to normalize().

The single-value signature analyze(T value) is abstract; the implementing class is responsible for determining, based upon the argument, which element of the instances array should be incremented.

Three helper methods are provided to analyze samples in bulk:

Consumption begins after the call to normalize() returns. The normalize() method does two things: First, it iterates through the tallies in the instances array to determine the minimum tally minTally, the maximum tally maxTally, the average tally avgTally, and the standard deviation around this average devTally. Second, it allocates an array of partial sums sum sized just like instances, then populates this array so that sum[0]=0 and sum[k]=sum[k-1]+instances[k-1] (for k>0).

For DiscreteHistogram, the instances array provides the same granularity as DiscreteDistribution: each range value has an associated population count. For ContinuousHistogram, the instances array associates population counts with intervals. HistogramBase exposes individual instances elements via method itemTally(int index). The argument is an array index from 0 to instances.length (the item count); the caller being responsible for determining an array index from a range value. The densityAt(T value) calculates a relative density as the quotient of the absolute tally from itemTally() and the count of samples; it also does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses.

The sum array is to the cumulative distribution function (CDF) as the incidents array is to the PDF. HistogramBase exposes individual sum elements via method itemSum(int index). The caller once again being responsible for determining an array index from a range value. Method densityUpTo(T value) does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses.

The return value from quantile(double driver) has type T element, which can be either Integer or Double. The quantile function inverts the CDF, meaning that it takes a driver value between zero and unity and maps that value back to the distribution's range. This can be accomplished with no more infrastructure than has already been described. A long chooser variable is calculated as the driver value times the count of samples. Next step is a binary search to identify the largest sum array element which is still ≤ the chooser. For DiscreteHistogram, an arithmetic offset converts the identified array index to a discrete range value. For ContinuousHistogram, the identified array index specifies which equal-sized interval will contain the result, but the result still needs to be placed within the interval. The pro-ratio is calculated by subtracting the identified sum element from the chooser, then dividing the remainder by the corresponding instances element.

The boolean isNormalized() method can be used to verify that histogram results are ready for consumption. Summary statistics will then be available from the following methods:

Details concerning specific range values will then be available from the following methods:

Four other statistics, gleaned through secondary analysis of tallies, are useful measuring the uniformity of the analyzed distribution. The getters for these four statistics, available only when isNormalized() returns true, are:

/**
 * The {@link DiscreteHistogram} class analyzes
 * how many samples in a population fall within each point of a discrete range.
 * @author Charles Ames
 */
public class DiscreteHistogram extends HistogramBase<Integer> {
   /**
    * Item index for deserializing histogram tallies.
    */
   private int itemIndex = Integer.MIN_VALUE;
   /**
    * Constructor for {@link DiscreteHistogram} instances.
    * @param container An entity which contains this histogram.
    */
   public DiscreteHistogram(WriteableEntity container) {
      super(container);
      super.setRange(Integer.MAX_VALUE, Integer.MIN_VALUE);
   }
   /**
    * Constructor for {@link DiscreteHistogram} instances.
    */
   public DiscreteHistogram() {
      this(null);
   }
   /**
    * Constructor for {@link DiscreteHistogram} instances.
    * @param container An entity which contains this histogram.
    * @param distribution The {@link DiscreteHistogram} whose set of values this histogram
    * is to follow.
    */
   public DiscreteHistogram(WriteableEntity container, DiscreteDistribution distribution) {
      this(container);
      int low = Integer.MAX_VALUE;
      int high = Integer.MIN_VALUE;
      Iterator<DiscreteDistributionItem> iterator = distribution.iterateItems();
      while (iterator.hasNext()) {
         DiscreteDistributionItem item = iterator.next();
         int id = item.getID();
         if (id < low) low = id;
         if (id > high) high = id;
      }
      setRange(low, high);
   }
   private int regularize(int value) {
      int minRange = minRange();
      int maxRange = maxRange();
      if (value < minRange || value > maxRange)
         throw new IllegalArgumentException("Value [" + value + "] outside range from " + minRange + " to " + maxRange);
      return value - minRange;
   }
   @Override
   public boolean setRange(Integer minRange, Integer maxRange) {
      if (maxRange < minRange)
         throw new IllegalArgumentException("Invalid range from " + minRange + " to " + maxRange);
      return super.setRange(minRange, maxRange);
   }
   @Override
   protected int calculateItemTallies() {
      int minRange = minRange();
      int maxRange = maxRange();
      return maxRange - minRange + 1;
   }
   @Override
   public Integer minRange() {
      int minRange = super.minRange();
      if (minRange == Integer.MAX_VALUE) throw new UninitializedException("Min range not initialized");
      return minRange;
   }
   @Override
   public Integer maxRange() {
      int maxRange = super.maxRange();
      if (maxRange == Integer.MIN_VALUE) throw new UninitializedException("Min range not initialized");
      return maxRange;
   }
   @Override
   public void analyze(Integer value) {
      super.denormalize();
      int index = regularize(value);
      super.incrementTally(index);
      super.incrementSumOfSamples(value);
   }
   /**
    * Get the instance count  associated with the indicated value.
    * @param value The indicated value.
    * @return The instance count specified by the indicated index.
    * @throws IllegalArgumentException when the value is out of range from {@link DiscreteHistogram#minRange} to {@link DiscreteHistogram#maxRange}.
    */
   public long itemTally(int value) {
      int index = regularize(value);
      return super.itemTally(index);
   }
   /**
    * Determine the largest point density.
    * @return The largest point density.
    * @throws UnsupportedOperationException when the histogram is not normalized.
    */
   public double getMaxDensity() {
      return super.maxInstances() / (double) getSampleCount();
   }
   @Override
   public double densityAt(Integer value) {
      int index = regularize(value);
      return itemTally(index) / (double) getSampleCount();
   }
   @Override
   public double densityUpTo(Integer value) {
      int index = regularize(value);
      return (itemTally(index) + itemSum(index)) / getSampleCount();
   }
   @Override
   public Integer quantile(double driver) {
      if (driver < 0. || driver >= 1.) {
         throw new IllegalArgumentException("Driver outside range from zero (inclusive) to unity (exclusive)");
      }
      double chooser = getSampleCount() * driver;
      for (int index = 0; index < itemCount(); index++) {
         double instanceTally = itemTally(index);
         if (chooser < instanceTally) {
            int min = minRange();
            int max = maxRange();
            return ((max-min)*index) + min;
         }
         chooser -= instanceTally;
      }
      throw new RuntimeException("Something is wrong");
   }
}
Listing 2: The DiscreteHistogram implementation class.

Discrete Range

The type hierarchy for DiscreteHistogram is:

Construction — The range of a DiscreteHistogram instance can run from any integer value to another larger integer. The method calculateItemTallies() infers from these two facts the length of the instances array, while the regularize() converts any range value into a valid lookup index for instances.

Analysis — The single-argument method analyze(T value) declared abstractly by HistogramBase<T> assumes the concrete signature analyze(Integer value) here in the implementation class. A call to HistogramBase.denormalize() reverts from consumption phase back to analysis phase, should that be necessary. The regularize() call converts the range value argument into a lookup index for instances, which then is acted upon by incrementTally(). The sample value itself is passed to incrementSumOfSamples(), which incidentally also increments the sum of squared samples.

Consumption — The instances array provides the same granularity as DiscreteDistribution: each range value has an associated population count. Method densityAt(Integer value) calculates a relative density as the quotient of the absolute tally from itemTally() and the count of samples; it also does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses.

Method densityUpTo(Integer value) does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses. It adds the indexed sum element from itemSum() to the indexed instances element from itemTally(), then divides the result by the count of samples.

Method quantile(double driver) returns an Integer range value. A long chooser variable is calculated as the driver value times the count of samples. Next step is a binary search to identify the largest sum array element which is still ≤ the chooser. Finally, an arithmetic offset converts the identified array index to a discrete range value.

/**
 * The {@link ContinuousHistogram} class divides a continuous range into equal-sized regions and analyzes
 * how many samples in a population fall within each region.
 * @author Charles Ames
 */
public class ContinuousHistogram extends HistogramBase<Double> {
   /**
    * Number of regions into which the declared range is to be divided.
    * Must be positive; 8 or more recommended.
    */
   int itemCount;
   /**
    * Item index for deserializing histogram tallies.
    */
   private int itemIndex = Integer.MIN_VALUE;
   /**
    * Constructor for {@link ContinuousHistogram} instances.
    * @param container An entity which contains this histogram.
    */
   public ContinuousHistogram(WriteableEntity container) {
      super(container);
      super.setRange(Double.NaN, Double.NaN);
      this.itemCount = Integer.MIN_VALUE;
   }
   /**
    * Constructor for {@link ContinuousHistogram} instances.
    */
   public ContinuousHistogram() {
      this(null);
   }
   @Override
   public boolean setRange(Double minRange, Double maxRange) {
      double oldMin = super.minRange();
      double oldMax = super.maxRange();
      if (!Double.isNaN(oldMin) || !Double.isNaN(oldMax)) {
         throw new UnsupportedOperationException("Range previously initialized");
      }
      if (maxRange - minRange < MathMethods.TINY)
         throw new IllegalArgumentException("Invalid range from " + MathMethods.formatDouble(minRange,3) + " to " + MathMethods.formatDouble(maxRange,3));
      return super.setRange(minRange, maxRange);
   }
   /**
    * Rescale the argument so that instead of ranging from {@link #minRange} to {@link #maxRange}
    * it return value ranges from zero to unity.
    * @param value A double value which ranges from {@link #minRange} to {@link #maxRange}.
    * @return A double value which ranges from zero to unity.
    * @throws IllegalArgumentException when the value falls outside the range from {@link #minRange} to {@link #maxRange}.
    */
   private double regularize(double value) {
      double minRange = minRange();
      double maxRange = maxRange();
      if (value < minRange || value > maxRange)
         throw new IllegalArgumentException("Value [" + value + "] outside range from " + minRange + " to " + maxRange);
      return (value - minRange) / (maxRange - minRange);
   }
   @Override
   public Double minRange() {
      double minRange = super.minRange();
      if (Double.isNaN(minRange)) throw new UninitializedException("Min range not initialized");
      return minRange;
   }
   @Override
   public Double maxRange() {
      double maxRange = super.maxRange();
      if (Double.isNaN(maxRange)) throw new UninitializedException("Min range not initialized");
      return maxRange;
   }
   @Override
   protected int calculateItemTallies() {
      if (0 >= itemCount) throw new UninitializedException("Item count not initialized");
      return itemCount;
   }
   /**
    * Setter for {@link #itemCount}.
    * @param itemCount The intended {@link #itemCount} value.
    * @return True if the item count has changed; false otherwise.
    * @throws IllegalArgumentException when the argument is not positive.
    */
   public boolean setItemCount(int itemCount) {
      if (itemCount < 1) throw new IllegalArgumentException("Argument not positive");
      if (this.itemCount != itemCount) {
         this.itemCount = itemCount;
         makeDirty();
         return true;
      }
      return false;
   }
   /**
    * Getter for {@link #itemCount}.
    * @return The assigned {@link #itemCount} value.
    */
   public int itemCount() {
      if (Integer.MAX_VALUE == this.itemCount) throw new UninitializedException("Item count not initialized");
      return itemCount;
   }
   @Override
   public void analyze(Double value) {
      super.denormalize();
      int count = itemCount();
      int index = (int) (count * regularize(value));
      if (count == index) --index;
      super.incrementTally(index);
      super.incrementSumOfSamples(value);
   }
   @Override
   public Double quantile(double driver) {
      if (driver < 0. || driver >= 1.) {
         throw new IllegalArgumentException("Driver outside range from zero (inclusive) to unity (exclusive)");
      }
      double chooser = getSampleCount() * driver;
      int index = 0;
      double residue = Double.NaN;
      while (index < itemCount()) {
         double itemTally = itemTally(index);
         if (chooser < itemTally) {
            residue = (itemTally - chooser) / getSampleCount();
         }
         chooser -= itemTally;
         index++;
      }
      double min = minRange();
      double max = maxRange();
      return ((max-min)*(index + residue)) + min;
   }
   @Override
   public double densityAt(Double value) {
      double min = minRange();
      double max = maxRange();
      double u = (value-min)/(max-min);
      int index = (int) (itemCount() * u);
      return itemTally(index)/(double) getSampleCount();
   }
   @Override
   public double densityUpTo(Double value) {
      double min = minRange();
      double max = maxRange();
      double u = itemCount() * (value-min)/(max-min);
      int index = (int) u;
      double sum = itemSum(index);
      double tally = itemTally(index)*(u-index);
      return (sum+tally)/getSampleCount();
   }
}
Listing 3: The ContinuousHistogram implementation class.

Continuous Range

The type hierarchy for ContinuousHistogram is:

Construction — The range of a ContinuousHistogram instance can run from any real-number value to another larger real-number. Since the the length of the instances array cannot be inferred from these two facts, the end user must provide this information using an explicity call to setItemCount(). The method calculateItemTallies() simply passes that number back to HistogramBase. Method regularize() normalizes range values into values from zero to unity.

Analysis — The single-argument method analyze(T value) declared abstractly by HistogramBase<T> assumes the concrete signature analyze(Double value) here in the implementation class. A call to HistogramBase.denormalize() reverts from consumption phase back to analysis phase, should that be necessary. The regularize() call normalizes the range-value argument into a value from zero to unity. Method analyze(Double value) converts this normalized value into a lookup index for instances, which then is acted upon by incrementTally(). The continuous range value is converted to a lookup index by scaling it to itemCount, then truncating the integer part, which should always produce a smaller number, right? Well, in my experience, no; hence the extra check for count = index. The sample value itself is passed to incrementSumOfSamples(), which incidentally also increments the sum of squared samples.

Consumption — For ContinuousHistogram, the instances array associates population counts with intervals. This provides a coarser fit to the probability density function (PDF) than does ContinuousDistribution, where each interval is represented by a ContinuousDistributionItem which interpolates between an origin and a goal. The fit can be improved by increasing the number of intervals, but this only makes sense for longer source sequence lengths, when the average number of samples per interval exceeds, say, 10. Method densityAt(Double value) calculates a relative density as the quotient of the absolute tally from itemTally() and the count of samples; it also does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses.

Method densityUpTo(DarkBlue value) does the necessary work of converting range values into array indices — which is why implementation is left to the range-specific subclasses. It starts with the indexed sum element; however it is necessary to pro-rate the indexed instances element, based upon the sample value's location within the interval. The method adds the sum element to the pro-rated instances element, then divides the result by the count of samples.

The return value from quantile(double driver) has type T element, which can be either Integer or Double. A long chooser variable is calculated as the driver value times the count of samples. Next step is a binary search to identify the largest sum array element which is still ≤ the chooser. The identified array index specifies which equal-sized interval will contain the result, but the result still needs to be placed within the interval. The pro-ratio is calculated by subtracting the identified sum element from the chooser, then dividing the remainder by the corresponding instances element.

This topic area of charlesames.net focuses on sequence generation. The ContinuousHistogram class can play a direct role in sequence generation, where it can analyze a non-uniform driver sequence, then reverse the inferred distribution to flatten things out while retaining the up-and-down shape.

© Charles Ames Page created: 2022-08-29 Last updated: 2022-08-29