A distribution is a way of summarizing the content of a population. Statisticians generally start with a
population, then seek to determine which distribution fits best. However, the
Transform
component of the
Driver/Transform pattern for sequence generation goes somewhat the other way:
It starts with a distribution and uses it to shape the intermediate output from the
Driver
component, whose
values presumably are uniform.
Rarely will the shaping distribution will be representable using an explicit
mathematical formula such as the familiar bell curve.
More common alternatives include the composition of the orchestra in Xenakis's Stochastic
Music Program, where the probability density for each timbre selection is a number calculated in response to the current
section density. Whether formulaic or numeric, the discipline of
probability theory provides the mathematical constructs
necessary both to describe the distribution (the probability
density function or PDF) and to transform uniform
Driver
values so that the results conform to said
distribution (by inverting the cumulative
density function or CDF). This page and its dependents explain
how to make these mathematical constructs functional as DistributionBase
subclasses. The distribution classes can be used outright,
but they can also be leveraged by Transform
classes when no closed-form inverse-CDF formula exists (the common situation).
But first. There's some terminology associated with the distribution concept, and the best way to explain the terms is with reference to an example taken from the statistician's perspective. Hence Figure 1.
Figure 1 shows the kind of situation which distributions can be helpful in summarizing. It in fact illustrates a Poisson Point Process, which is a scenario developed by probability theorists to model the experience of waiting for something to happen.^{1} Xenakis was familiar with this scenario; his Stochastic Music Program models both section durations and note-to-note waiting durations as Poisson Point Processes.
Taking a mass of data and boiling it down is the heavy lift of most statistical analysis. The Poisson Point Process treats two complementary domains (perspectives) in modeling situations like Figure 1, the time domain and the frequency domain. The time domain models the durations spent waiting between consecutive events. The frequency domain models how many events happen over each unit of time.
Figure 1 has 74 events but 75 waiting durations dispersed along 25 time units. At the highest level of summary, a statistician could calculate two averages: the average waiting duration (25/75=0.333) for the time domain, and the average number of events per time unit (74/25=2.960) for the frequency domain. These numbers may be very useful, but only to a point. They give no insight into the spread of larger and smaller values. Further summary statistics (e.g. standard deviation) can account for the spread, but for most of us its more insightful to determine the range of values, to identify where values are most concentrated within the range, and to identify where values are more sparse. In other words, we want to know the distribution.
Of the two complementary perspectives, the time-domain perspective involves the sequence of
75 samples detailed in the leftward graph of Figure 2. Remember that sequence is a
population presented in one specific order.
Since these clock-time durations have digits right of the decimal point,
we characterize the range of sample values as
continuous
and represent them using the Java datatype
double
.
The frequency-domain perspective involves the sequence of 25 segments detailed in
the leftward graph of Figure 3. Since these event counts never have digits right of the decimal
point, we characterize the range of sample values as
discrete
and represent them using the Java datatype
int
.
An important thing to understand about statistical distributions is that while many distributions are motivated by probability theory, probability does not figure in the fundamental nature of the concept. A distribution describes characteristics of a population whose size is abstracted as unity (100%). The describing focuses on the range of sample values, whether that range is discrete or continuous, and how sample values are weighted (high weights where values concentrate, low weights where values are sparse). No assumptions are made about the genesis of the population, which may be random, chaotic, or entirely deterministic.
Various constructs underpin the mathematical theories of probability and statistics. The notion of the probability domain provides an all-some-none ideal, while various mathematical functions (PDF, CDF, and quantile) enable us to translate back and forth between this ideal and the distribution's real-world range.
Normally in the general world of mathematics, the source set from which a function draws its input is called the domain, while the target set which receives the function's output is called the range. However, the more specialized world of probability and statistics speaks contrariwise, of functions drawing input from the range of a distribution (also called the sample space) and limiting output to the probability domain.
Probabilities are continuous. They are numbers with digits right of the decimal point (mostly zero to the left), and they
represented using a floating-point datatype
such as double
. Probabilities run from zero to unity. Here a further
bifurcation of worlds applies: In the chance world of probability theory, zero means impossibility (won't happen) while unity means
certainty (always happens). However in the chance-optional world of statistics, "probability" descends from the concept of a
percentage, which word has been around since Roman times.
It represents a proportion of a total population: zero (0%) means nothing; unity (100%) means everything.
For example, Figure 4 graphs population density around the state of California. The large low-population swaths on the lower right of the graph cover features like the Mojave desert and Death Valley, The red coastal blotch in the bottom-left county is San Diego, and the next blotch up the coast is the urban sprawl covering greater Los Angeles, Orange County, and stretching eastward to San Bernardino. The coastal cluster up north comprises the Bay Area cities of San Francisco, San Jose, and Oakland. The mid-state chain includes the Central Valley cities of (proceeding southward) Sacramento, Stockton, Fresno, and Bakersfield. The orange smear in the far north, mid way between the coast and the Nevada border and surrounded by National Forests, is Redding.
It would be possible to mark off the state of California into a mile-square grid, then count the number of individuals in each square, but that is not what happens. Instead the Census Bureau divides the United States into 8,057 tracts; in urban areas these tracts divide into block groups (23,212 of those), and the block groups further divide into blocks (710,145). The numbers given here source from the Census Bureau's 2010 information page for California. The Bureau is somehow able to tie every individual's residence address (which the most important fact it collects about people) to a tract, and where they are defined, to a block group and to a block. Census tracts, block groups, and blocks are accurately mapped; the Bureau precisely knows both how each geographical region is bounded and how many square miles it encloses. Wikipedia reports that localities cooperate with the Bureau in defining tracts, groups, and blocks (originally city blocks but no longer), so that municipal boundaries are respected and populations are "homogeneous". Wikipedia also passes along the comment that each tract averages "about 4,000 inhabitants" but Wikipedia does not clarify why this average would be known before the actual census is taken. However the uniform average does imply that tracts cover smaller areas in urban locations where population is dense and larger areas in rural locations where population is sparse.
The Census Bureau thus collects data on how many individuals live in each geographical region. However comparing raw population numbers is not directly meaningful between regions of differing size. Calculating the population density is a simple matter of dividing individuals by square miles — and densities are comparable.
Census regions being singular entities, the relationship between regions and population densities is a discrete distribution, that is, a distribution over a discrete range. However the larger the number of range values covered by a discrete distribution (in this case 710,145), the closer a discrete distribution comes to resemble a continuous distribution. A glance at Figure 4 confirms this. The visual impression is not of points, but rather of a surface with high red peaks descending to low green valleys.
Figure 5 drills down from California statewide (Figure 4) to the city of Fresno. The image details census blocks (outlined in light gray) — which are the smallest geographical regions for which census data is collected — also their enclosing census tracts (outlined in black). Each census block defines a prism whose base is the physical area, whose height is the population density, and whose volume (base × height) is the population enclosed within the region. Since the height of the prism measures the distance from the x-y plane up the z axis to the surface forming the red peaks and green valleys, we can properly speak of calculating the population as calculating the "volume" under the population-density "surface". Calculating the volume of the whole comes down to summing the volumes of the parts; thus calculating the population of each census tract in Figure 5 comes down to summing the volumes of the census blocks that make up the tract, while calculating the population of Fresno comes down to summing the volumes of its census tracts.
The probability density function or PDF associates every sample value in the distribution's range to a probability density ranging from zero to unity. Probability densities (or just "probabilities") are population densities which have been normalized to abstract away the specific population size. For example, the 2010 population of Fresno, CA was 531,576 spread out over 112 square miles. The population density calculates as 531,576 / 112 = 4746, the probability density is that amount divided by the 2010 population of California: 4746 / 39,512,223 = 0.00012 and the probability that an individual chosen at random from the 2010 California tax rolls will reside in Fresno is 0.00012 × 112 = 0.013 or 1.3%.
A distribution is formally specified by its PDF. Summary statistics such as the mean (average) and standard deviation are contingent on the PDF, as are the CDF and quantile function described below. A PDF will be either discrete or continuous, depending on the distribution's range:
The cumulative distribution function or CDF returns the probability (zero to unity) that any other sample value will be less than or equal to sample value given as input. In particular, CDF(b) - CDF(a) gives the probability that a sample value v will fall between a and b, where a < b.
I hinted previously that the CDF enriches the population-based interpretation of values in the probability domain. This insight pertains to continuous distributions along a single axis. Suppose there's a sample value v such that CDF(v) = 0.5. Then half of the population upon which the distribution is based will have values less than v, and the other half of the population will have values greater than v. But that means that v must be the average sample value. Now suppose there are two sample values: a such that CDF(a) = 0.5-0.341 and b such that CDF(b) = 0.5+0.341. Then the range from a to b embraces 0.341+0.341 = 0.682 = 68.2% of the population. When the distribution is a bell curve, this corresponds to the interval within ±1 standard deviation of the average. (The probability difference for the 2nd standard deviation reduces from 34.1 to 13.6).
Knowing the CDF is a necessary prerequisite for transforming driver values into range values (see the quantile function below). However, the concept becomes problematic when a distribution ranges continuously over an area — for example, over the state of California. It is not meaningful to say that one location is "greater" than another when both north-south and east-west axes must be taken into account. One workaround would be to calculate a separate CDF can for each axis. A second workaround would be to granularize the area, much as the Census Bureau granularizes the United States into tracts. This would transform the range of the distribution from a continuous area into a discrete collection of regions, and each region can be assigned an identification number. A CDF can work with that.
Implementing a CDF programmatically requires one simple enhancement to the collections already
used to implement the PDF. That enhancement is to add a sum to each distribution item, which
sum will accumulate the weight of the current item (for ContinuousDistributionItem
instances, that means the trapezoidal area) with weights of all the item's predecessors. Calculating the
CDF(v) for a specific sample value v first requires
finding which collection item pertains to v. If the distribution is discrete, that's it. If
the distribution is continuous, there's an interpolation formula which I explain on the page devoted specifically to
continuous distributions.
The inverse cumulative distribution function, indicated CDF^{-1}, is also called the quantile function. This function maps every value q in the range from zero to unity to the sample value v for which q is the proportion of sample values ≤ v. For example, if the population embraces possible scores on the PSAT, then CDF^{-1}(0.95) gives the score which places a student in the top 5% (95th percentile) of test takers. The population-based interpretation of the probability domain coincides with what the Driver/Transform pattern refers to as the driver domain. The range from zero to unity could alternatively be designated the quantile domain: I have not seen this term used, but it is entirely appropriate.
The quantile function is also used by statisticians to calculate confidence intervals. This is how a statistician can claim to be 90% certain of an assertion.
Inverting the CDF programatically requires no more data support than what was described for the CDF itelf. To start out, my code calculates a chooser by multiplying the driver by the total weight of all distribution items combined. The code then identifies the distribution item whose cumulative-sum-of-weights is minimally larger than or equal to the chooser. If the distribution is discrete, that's it. If the distribution is continuous, then the chooser is localized to the current item (by subtracting off the next lower sum-of-weights). It then comes down to applying the CDF^{-1} for a trapezoidal distribution, which happily exists in closed form. For details see the page devoted specifically to continuous distributions.
If the distribution's range is two dimensional, then the challenge becomes mapping a driver value to two separate range coordinates. I previously mentioned two workarounds. One workaround constructed one CDF for the x-axis and a separate CDF for the y-axis. This workaround requires separate driver inputs for each coordinate. The other workaround granularized the range. In this circumstance the path from driver value to grain is the same as any other a discrete CDF. This still leaves selecting two coordinates within the interior of the already-selected grain. The residual assumption that interior probilities are uniform is doubtful, but without more information than the total population of the grain, there's no better assumption.
I don't have experience using distributions to populate two-dimensional areas, so the advice I have to offer is sometimes-educated speculation. Here are some options:
DistributionBase
base class.
The abstract DistributionBase
class presented as Listing 1
acts much like a Java interface; for the most part it enforces
commonality of method calls. However implementing this class as an interface would have required making it public, and
would therefore have tempted package consumers to reference it.
Details of implementation are so different that it is not worth consolidating
any functionality into a shared generic base class. The abstraction falls short to the extent that when using this
package in an application, one cannot manage a distribution's contents through
DistributionBase
. Instead one needs to work directly with
the implementation classes: DiscreteDistribution
and
ContinuousDistribution
.
The first thing to notice in Listing 1 is the generic parameter T
which
extends java.lang.Number
.
Number
is an abstract base class whose implementing sub-classes include
Integer
and
Double
.
In practice, T
can be either of these. DistributionBase
has two concrete subclasses.
DiscreteDistribution
extends
DistributionBase<Integer>
,
while ContinuousDistribution
extends
DistributionBase<Double>
.
These concrete distribution classes each implement a single-tier composite design pattern
where a single parent object extending DistributionBase
stands for the
whole, while multiple child items detail the parts. Package consumers typically interact with the parent only: they
define a contour's content by calling some variant of addItem()
,
they translate values into probability densities by calling densityAt()
(implementing
the PDF) or densityUpTo()
(implementing
the CDF), and they translate probability densities back into range values
by calling quantile()
(implementing
the quantile function).
An exception to only parent-only access would be for a graphic editor, which would need item details in order to render graphs efficiently. However,
bypassing addItem()
can seriously break things, so item access needs to be accomodated in a read-only way.
The designs assume that distributions is refreshable, so that the life cycle alternates between a construction phase, when the items are appended in sequential order and a Consumption phase when facts about the distribution may be retrieved.
Construction begins either when the distribution instance is allocated using the new operator, or after call to
initialize()
signifies that the distribution is to be refreshed.
This phase iterates through calls to some flavor of addItem()
(see below), and completes with a call to
normalize()
.
The addItem()
method is not mentioned by DistributionBase
because DiscreteDistributionItem
differs essentially from
ContinuousDistributionItem
:
DiscreteDistributionItem
maps a single integer value to a single weight.
ContinuousDistributionItem
describes a trapezoid with its base (an interval of real numbers)
aligned horizontally to the range axis and its sides aligned vertically to the weight axis.
Notice that the measure of relative concentration in an item is "weight" rather than "probability density". Rather than taking
all the trouble to ensure that probabilities always sum to unity, my implementation maintains a
sumOfWeights
field at the distribution level.
Calls to densityAt()
convert weights to probability densities by
dividing the 'raw' weight associated with an individual item by sumOfWeights
.
The DistributionBase
superclass prescribes a
getSumOfWeights()
method for read-only access to the
sumOfWeights
field.
Consumption begins after normalize()
returns; it encompasses calls to any of
densityAt(T value)
querys the PDF, returning a double.
densityUpTo(T value)
querys the CDF), returning a double.
quantile(double driver)
querying the quantile function), returning a T.
Because the sum of weights is maintained on the fly, it is allowable — though not advisable — to call
densityAt()
while a distribution is still under construction.
The same is not true of densityUpTo()
, since implementing
the CDF and quantile function
require each item to include a cumulative sum of weights. The cumulative sums are calculated
in batch by the normalize()
method, which should only be invoked
once the complete range of the distribution has been accounted for with addItem()
calls.
The DistributionBase
superclass provides for
normalize()
, and it also prescribes an
isNormalized()
method to verify that this step has
taken place. (If not, then densityUpTo()
and quantile()
) will throw an
UnsupportedOperationException
.
The initialize()
method is intended to be used in the following
way. You have a mechanism for calculating probability densities, and from time to time the probabilities
need to be recalculated. Perhaps a parameter changes, or one gets swapped out for another.
One option is to discard the existing distribution instance
and create a new one. A less disruptive option is to call initialize()
on the
existing instance and rebuild the distribution items in place.
Random
generator (25,214,903,917).
© Charles Ames | Page created: 2022-08-29 | Last updated: 2022-08-29 |