Probability is used to quantify an attitude of mind towards some uncertain proposition. The proposition of interest is usually of the form “Will a specific event occur?” The attitude of mind is of the form “How certain are we that the event will occur?” The certainty we adopt can be described in terms of a numerical measure and this number, between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty), we call probability. Thus the higher the probability of an event, the more certain we are that the event will occur. A simple example would be the toss of a fair coin. Since the 2 outcomes are deemed equiprobable, the probability of “heads” equals the probability of “tails” and each probability is 1/2 or equivalently a 50% chance of either “heads” or “tails”.
These concepts have been given an axiomatic mathematical formalization in probability theory (see probability axioms), which is used widely in such areas of study as mathematics, statistics, finance, gambling, science (in particular physics), artificial intelligence/machine learning, computer science, and philosophy to, for example, draw inferences about the expected frequency of events. Probability theory is also used to describe the underlying mechanics and regularities of complex systems.
When dealing with experiments that are random and well-defined in a purely theoretical setting (like tossing a fair coin), probabilities can be numerically described by the statistical number of outcomes considered favorable divided by the total number of all outcomes (tossing a fair coin twice will yield head-head with probability 1/4, because the four outcomes head-head, head-tails, tails-head and tails-tails are equally likely to occur). When it comes to practical application however there are two major competing categories of probability interpretations, whose adherents possess different views about the fundamental nature of probability:
- Objectivists assign numbers to describe some objective or physical state of affairs. The most popular version of objective probability is frequentist probability, which claims that the probability of a random event denotes the relative frequency of occurrence of an experiment’s outcome, when repeating the experiment. This interpretation considers probability to be the relative frequency “in the long run” of outcomes. A modification of this is propensity probability, which interprets probability as the tendency of some experiment to yield a certain outcome, even if it is performed only once.
- Subjectivists assign numbers per subjective probability, i.e., as a degree of belief. The degree of belief has been interpreted as, “the price at which you would buy or sell a bet that pays 1 unit of utility if E, 0 if not E.” The most popular version of subjective probability is Bayesian probability, which includes expert knowledge as well as experimental data to produce probabilities. The expert knowledge is represented by some (subjective) prior probability distribution. The data is incorporated in a likelihood function. The product of the prior and the likelihood, normalized, results in a posterior probability distribution that incorporates all the information known to date. Starting from arbitrary, subjective probabilities for a group of agents, some Bayesians[who?] claim that all agents will eventually have sufficiently similar assessments of probabilities, given enough evidence (see Cromwell’s rule).
The word probability derives from the Latin probabilitas, which can also mean “probity“, a measure of the authority of a witness in a legal case in Europe, and often correlated with the witness’s nobility. In a sense, this differs much from the modern meaning of probability, which, in contrast, is a measure of the weight of empirical evidence, and is arrived at from inductive reasoning and statistical inference.
The scientific study of probability is a modern development. Gambling shows that there has been an interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions arose much later. There are reasons of course, for the slow development of the mathematics of probability. Whereas games of chance provided the impetus for the mathematical study of probability, fundamental issues[clarification needed] are still obscured by the superstitions of gamblers.
According to Richard Jeffrey, “Before the middle of the seventeenth century, the term ‘probable’ (Latin probabilis) meant approvable, and was applied in that sense, univocally, to opinion and to action. A probable action or opinion was one such as sensible people would undertake or hold, in the circumstances.” However, in legal contexts especially, ‘probable’ could also apply to propositions for which there was good evidence.
Like other theories, the theory of probability is a representation of probabilistic concepts in formal terms—that is, in terms that can be considered separately from their meaning. These formal terms are manipulated by the rules of mathematics and logic, and any results are interpreted or translated back into the problem domain.
There have been at least two successful attempts to formalize probability, namely the Kolmogorov formulation and the Cox formulation. In Kolmogorov’s formulation (see probability space), sets are interpreted as events and probability itself as a measure on a class of sets. In Cox’s theorem, probability is taken as a primitive (that is, not further analyzed) and the emphasis is on constructing a consistent assignment of probability values to propositions. In both cases, the laws of probability are the same, except for technical details.
There are other methods for quantifying uncertainty, such as the Dempster–Shafer theory or possibility theory, but those are essentially different and not compatible with the laws of probability as usually understood.
Relation to randomness
In a deterministic universe, based on Newtonian concepts, there would be no probability if all conditions were known (Laplace’s demon), (but there are situations in which sensitivity to initial conditions exceeds our ability to measure them, i.e. know them). In the case of a roulette wheel, if the force of the hand and the period of that force are known, the number on which the ball will stop would be a certainty (though as a practical matter, this would likely be true only of a roulette wheel that had not been exactly levelled — as Thomas A. Bass’ Newtonian Casino revealed). Of course, this also assumes knowledge of inertia and friction of the wheel, weight, smoothness and roundness of the ball, variations in hand speed during the turning and so forth. A probabilistic description can thus be more useful than Newtonian mechanics for analyzing the pattern of outcomes of repeated rolls of a roulette wheel. Physicists face the same situation in kinetic theory of gases, where the system, while deterministic in principle, is so complex (with the number of molecules typically the order of magnitude of Avogadro constant 6.02·1023) that only a statistical description of its properties is feasible.
Probability theory is required to describe quantum phenomena. A revolutionary discovery of early 20th century physics was the random character of all physical processes that occur at sub-atomic scales and are governed by the laws of quantum mechanics. The objective wave function evolves deterministically but, according to the Copenhagen interpretation, it deals with probabilities of observing, the outcome being explained by a wave function collapse when an observation is made. However, the loss of determinism for the sake of instrumentalism did not meet with universal approval. Albert Einstein famously remarked in a letter to Max Born: “I am convinced that God does not play dice”. Like Einstein, Erwin Schrödinger, who discovered the wave function, believed quantum mechanics is a statistical approximation of an underlying deterministic reality. In modern interpretations, quantum decoherence accounts for subjectively probabilistic behavior.
In Kolmogorov’s probability theory, the probability P of some event E, denoted , is usually defined such that P satisfies the Kolmogorov axioms, named after the famous Russian mathematician Andrey Kolmogorov, which are described below.
Cox’s theorem, named after the physicist Richard Threlkeld Cox, is a derivation of the laws of probability theory from a certain set of postulates. This derivation justifies the so-called “logical” interpretation of probability. As the laws of probability derived by Cox’s theorem are applicable to any proposition, logical probability is a type of Bayesian probability. Other forms of Bayesianism, such as the subjective interpretation, are given other justifications.
Cox wanted his system to satisfy the following conditions:
- Divisibility and comparability – The plausibility of a statement is a real number and is dependent on information we have related to the statement.
- Common sense – Plausibilities should vary sensibly with the assessment of plausibilities in the model.
- Consistency – If the plausibility of a statement can be derived in many ways, all the results must be equal.
The postulates as originally stated by Cox were not mathematically rigorous (although better than the informal description above), e.g., as noted by Halpern. However it appears to be possible to augment them with various mathematical assumptions made either implicitly or explicitly by Cox to produce a valid proof.
Interpretation and further discussion
Cox’s theorem has come to be used as one of the justifications for the use of Bayesian probability theory. For example, in Jaynes it is discussed in detail in chapters 1 and 2 and is a cornerstone for the rest of the book. Probability is interpreted as a formal system of logic, the natural extension of Aristotelian logic (in which every statement is either true or false) into the realm of reasoning in the presence of uncertainty.
It has been debated to what degree the theorem excludes alternative models for reasoning about uncertainty. For example, if certain “unintuitive” mathematical assumptions were dropped then alternatives could be devised, e.g., an example provided by Halpern. However Arnborg and Sjödin suggest additional “common sense” postulates, which would allow the assumptions to be relaxed in some cases while still ruling out the Halpern example. Other approaches were devised by Hardy  or Dupré and Tipler.
The original formulation of Cox’s theorem is in Cox (1946) which is extended with additional results and more discussion in Cox (1961). Jaynes cites Abel for the first known use of the associativity functional equation. Aczél provides a long proof of the “associativity equation” (pages 256-267). Jaynes (p27) reproduces the shorter proof by Cox in which differentiability is assumed. A guide to Cox’s theorem by Van Horn aims at comprehensively introducing the reader to all these references.
In information theory, entropy is the average amount of information contained in each message received. Here, message stands for an event, sample or character drawn from a distribution or data stream. Entropy thus characterizes our uncertainty about our source of information. (Entropy is best understood as a measure of uncertainty rather than certainty as entropy is larger for more random sources.) The source is also characterized by the probability distribution of the samples drawn from it. The idea here is that the less likely an event is, the more information it provides when it occurs. For some other reasons (explained below) it makes sense to define information as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose average (a.k.a. expected) value is the average amount of information, a.k.a. entropy, generated by this distribution. Because entropy is average information, it is also measured in shannons, nats, or hartleys, depending on the base of the logarithm used to define it.
Entropy is a measure of unpredictability of information content. To get an informal, intuitive understanding of the connection between these three English terms, consider the example of a poll on some political issue. Usually, such polls happen because the outcome of the poll isn’t already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the entropy of the second poll result relative to the first is small.
If a compression scheme is lossless—that is, you can always recover the entire original message by decompressing—then a compressed message has the same quantity of information as the original, but communicated in fewer characters. That is, it has more information, or a higher entropy, per character. This means a compressed message has less redundancy. Roughly speaking, Shannon’s source coding theorem says that a lossless compression scheme cannot compress messages, on average, to have more than one bit of information per bit of message, but that any value less than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy of a message per bit multiplied by the length of that message is a measure of how much total information the message contains.
Shannon’s theorem also implies that no lossless compression scheme can shorten all messages. If some messages come out shorter, at least one must come out longer due to the pigeonhole principle. In practical use, this is generally not a problem, because we are usually only interested in compressing certain types of messages, for example English documents as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger. However, the problem can still arise even in everyday use when applying a compression algorithm to already compressed data: for example, making a ZIP file of music that is already in the FLAC audio format is unlikely to achieve much extra saving in space.