Bayesian Inference is thought by many to have originated
through an English statistician and philosopher’s work on probability theory.
Thomas Bayes, the statistician in question, had a paper, “An Essay Towards Solving a
Problem in the Doctrine of Chances”, published posthumously in 1763 by
Richard Price (Fienberg 2006) containing ground breaking work on conditional
probability and alongside a number of interesting propositions, the most
influential work he completed was his examination of the problem, “Given
the number of times in which an unknown event has happened and failed: Required
the chance that the probability of its happening in a single trial lies somewhere
between any two degrees of probability that can be named” (Bayes 1763).
Bayes’ paper introduced a special continuous case of what is
today known as Bayes’ Theorem but a statement of the general formula for the
theorem was never given. His special case used a uniformly distributed prior, effectively
taking what he thought the probability of something was, now known as the
prior, then combining new data to produce an improved result, the posterior.
Many notable statisticians had serious issues with Bayes’
prior distribution choices they came from something very close to guess work
and as a result his paper did not get much traction throughout the community at
the time (Lee 1988). There also has been a lot of controversy regarding the
name of the Bayes Theorem as there is evidence to suggest Bayes’ work alone was
not enough to be credited as the founder and that Price was the one
to understand the significance of the notes he received from
the late Bayes which he then used to produce a publishable paper.
Headway was again made when a French scientist by the name of
Pierre-Simon Laplace, now widely considered to be the world’s first Bayesian, independently
released, “M´emoire sur la Probabilit´e des Causes par les Ev´ ´enements”,
several years after Bayes’ paper was published, shedding new light into
probability theory and Bayesian statistics (Laplace 1774). Like Bayes, Laplace also
used constant uniform priors for his uninformative priors though this was
because he considered his own central limit theorem and the principle of
insufficient reason, now known as the principle of indifference, an obvious
explanation for his assumption, rather than Bayes’ mathematical simplicity
He went on to introduce the first conjugate priors and a
general version of Bayes’ theorem which worked for continuous as well as
discrete data and even cases with multiple parameters. His methods became known
later as Inverse Probability, due to the direction he worked, from effect to
cause (Fienberg 2006).
early 20th century, many statisticians were probing into areas
rather different from Bayesian methods. Sir Ronald Fisher, an English
statistician and geneticist, argued against inverse probability due to its
dependence on the choice of ignorance prior, furthermore he began working on
new methods which resulted in the 1922 publication of his ground-breaking paper,
the Mathematical Foundations of Theoretical Statistics”.
This paper extensively changed
statistical thoughts and processes by introducing the notion of likelihood,
which led directly to maximum likelihood estimators alongside basic tests of
significance, new forms of variance analysis and randomization methods not yet
seen. Throughout the paper terms such as sufficiency, efficiency and parameter
were used for the first time (Fisher 1922
Fisher’s book cast a shadow of sorts over Bayesian statistics
for several years and resulted in a decline in its study and usage, numerous
statisticians, such as Jerzy Neyman and Egon Pearson, developed various
frequentist methods, such as hypothesis testing and confidence intervals primarily
stemming from Fisher’s work (Fienberg 2006, Fisher 1921).
There was new life for Bayesian statistics when early
ideas of Laplace resurfaced and were broken into two separate paths, objective
and subjective probability
Bruno de Finetti, through the 1920s, developed thoughts on
subjective probability and exchangeability in Italy whilst independently Frank
Ramsey did the same in England, both released books in 1930 and 1931
respectively (Fienberg 2006
Objective Bayesian inference was making headway with Sir Harold
Jeffreys at its front, he produced a paper named, “Theory of Probability”,
in 1939, which was crucial in what is now known as Bayesian’s revival. Jeffreys
incorporated an invariance approach to help derive the ignorant objective
priors, pacifying some of the major concerns anti-Bayesians had about only
using a constant prior distribution but also making major headway in Inverse
Probability (Fienberg 2006 2nd).
Views were still largely focused on frequentist methods but
throughout World War Two men like Alan Turing and Irving Jack Good continued
Jeffreys work and were fathering applied Bayesian statistics at Bletchley Park.
It is well known that Turing, Good and other members of Hut 8 used Bayesian
Inference for crucial deciphering to utilize German intelligence to shorten and
ultimately win the war. Their work and progress as Bayesians was not
declassified until the middle of the 1970s.
It was around this time when frequentist methods truly became
over shadowed as statisticians such as Dennis Lindley published books relating
wholly to Bayesian Inference which gained significant recognition globally. “Introduction
to Probability and Statistics from a Bayesian Viewpoint” and “The
Future of Statistics – a Bayesian 21st Century”, in
1965 and 1975 respectively, were two such books which had a
large impact on the overall opinion of Bayesian statistics (Lindley 1965, 1975).
By the early 1990s computers were becoming less scarce and
more commonly accessible in addition to having far superior capabilities to their
predecessors a couple years before. This meant that when computation techniques
called Markov Chain Monte Carlo methods emerged they were easily used and found
to be pivotal in solving one of the few major problems Bayesian Inference still
had, which was sampling from unusual distributions. Monte Carlo approximates
any posterior distribution by taking an extremely large number of random
samples from the same distribution (Casella 2011).