## Sunday, May 20, 2018

### Statistical inference

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.[1] Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.

## Introduction

Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.[citation needed]

Konishi & Kitagawa state, "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling".[2] Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[3]

The conclusion of a statistical inference is a statistical proposition.[citation needed] Some common forms of statistical proposition are the following:

## Models and assumptions

Any statistical inference requires some assumptions. A statistical model is a set of assumptions concerning the generation of the observed data and similar data. Descriptions of statistical models usually emphasize the role of population quantities of interest, about which we wish to draw inference.[4] Descriptive statistics are typically used as a preliminary step before more formal inferences are drawn.[5]

### Degree of models/assumptions

Statisticians distinguish between three levels of modeling assumptions;
• Fully parametric: The probability distributions describing the data-generation process are assumed to be fully described by a family of probability distributions involving only a finite number of unknown parameters.[4] For example, one may assume that the distribution of population values is truly Normal, with unknown mean and variance, and that datasets are generated by 'simple' random sampling. The family of generalized linear models is a widely used and flexible class of parametric models.
• Non-parametric: The assumptions made about the process generating the data are much less than in parametric statistics and may be minimal.[6] For example, every continuous probability distribution has a median, which may be estimated using the sample median or the Hodges–Lehmann–Sen estimator, which has good properties when the data arise from simple random sampling.
• Semi-parametric: This term typically implies assumptions 'in between' fully and non-parametric approaches. For example, one may assume that a population distribution has a finite mean. Furthermore, one may assume that the mean response level in the population depends in a truly linear manner on some covariate (a parametric assumption) but not make any parametric assumption describing the variance around that mean (i.e. about the presence or possible form of any heteroscedasticity). More generally, semi-parametric models can often be separated into 'structural' and 'random variation' components. One component is treated parametrically and the other non-parametrically. The well-known Cox model is a set of semi-parametric assumptions.

### Importance of valid models/assumptions

Whatever level of assumption is made, correctly calibrated inference in general requires these assumptions to be correct; i.e. that the data-generating mechanisms really have been correctly specified.

Incorrect assumptions of 'simple' random sampling can invalidate statistical inference.[7] More complex semi- and fully parametric assumptions are also cause for concern. For example, incorrectly assuming the Cox model can in some cases lead to faulty conclusions.[8] Incorrect assumptions of Normality in the population also invalidates some forms of regression-based inference.[9] The use of any parametric model is viewed skeptically by most experts in sampling human populations: "most sampling statisticians, when they deal with confidence intervals at all, limit themselves to statements about [estimators] based on very large samples, where the central limit theorem ensures that these [estimators] will have distributions that are nearly normal."[10] In particular, a normal distribution "would be a totally unrealistic and catastrophically unwise assumption to make if we were dealing with any kind of economic population."[10] Here, the central limit theorem states that the distribution of the sample mean "for very large samples" is approximately normally distributed, if the distribution is not heavy tailed.

#### Approximate distributions

Given the difficulty in specifying exact distributions of sample statistics, many methods have been developed for approximating these.

With finite samples, approximation results measure how close a limiting distribution approaches the statistic's sample distribution: For example, with 10,000 independent samples the normal distribution approximates (to two digits of accuracy) the distribution of the sample mean for many population distributions, by the Berry–Esseen theorem.[11] Yet for many practical purposes, the normal approximation provides a good approximation to the sample-mean's distribution when there are 10 (or more) independent samples, according to simulation studies and statisticians' experience.[11] Following Kolmogorov's work in the 1950s, advanced statistics uses approximation theory and functional analysis to quantify the error of approximation. In this approach, the metric geometry of probability distributions is studied; this approach quantifies approximation error with, for example, the Kullback–Leibler divergence, Bregman divergence, and the Hellinger distance.[12][13][14]

With indefinitely large samples, limiting results like the central limit theorem describe the sample statistic's limiting distribution, if one exists. Limiting results are not statements about finite samples, and indeed are irrelevant to finite samples.[15][16][17] However, the asymptotic theory of limiting distributions is often invoked for work with finite samples. For example, limiting results are often invoked to justify the generalized method of moments and the use of generalized estimating equations, which are popular in econometrics and biostatistics. The magnitude of the difference between the limiting distribution and the true distribution (formally, the 'error' of the approximation) can be assessed using simulation.[18] The heuristic application of limiting results to finite samples is common practice in many applications, especially with low-dimensional models with log-concave likelihoods (such as with one-parameter exponential families).

### Randomization-based models

For a given dataset that was produced by a randomization design, the randomization distribution of a statistic (under the null-hypothesis) is defined by evaluating the test statistic for all of the plans that could have been generated by the randomization design. In frequentist inference, randomization allows inferences to be based on the randomization distribution rather than a subjective model, and this is important especially in survey sampling and design of experiments.[19][20] Statistical inference from randomized studies is also more straightforward than many other situations.[21][22][23] In Bayesian inference, randomization is also of importance: in survey sampling, use of sampling without replacement ensures the exchangeability of the sample with the population; in randomized experiments, randomization warrants a missing at random assumption for covariate information.[24]

Objective randomization allows properly inductive procedures.[25][26][27][28] Many statisticians prefer randomization-based analysis of data that was generated by well-defined randomization procedures.[29] (However, it is true that in fields of science with developed theoretical knowledge and experimental control, randomized experiments may increase the costs of experimentation without improving the quality of inferences.[30][31]) Similarly, results from randomized experiments are recommended by leading statistical authorities as allowing inferences with greater reliability than do observational studies of the same phenomena.[32] However, a good observational study may be better than a bad randomized experiment.

The statistical analysis of a randomized experiment may be based on the randomization scheme stated in the experimental protocol and does not need a subjective model.[33][34]

However, at any time, some hypotheses cannot be tested using objective statistical models, which accurately describe randomized experiments or random samples. In some cases, such randomized studies are uneconomical or unethical.

#### Model-based analysis of randomized experiments

It is standard practice to refer to a statistical model, often a linear model, when analyzing data from randomized experiments. However, the randomization scheme guides the choice of a statistical model. It is not possible to choose an appropriate model without knowing the randomization scheme.[20] Seriously misleading results can be obtained analyzing data from randomized experiments while ignoring the experimental protocol; common mistakes include forgetting the blocking used in an experiment and confusing repeated measurements on the same experimental unit with independent replicates of the treatment applied to different experimental units.[35]

Different schools of statistical inference have become established. These schools—or "paradigms"—are not mutually exclusive, and methods that work well under one paradigm often have attractive interpretations under other paradigms.

Bandyopadhyay & Forster[36] describe four paradigms: "(i) classical statistics or error statistics, (ii) Bayesian statistics, (iii) likelihood-based statistics, and (iv) the Akaikean-Information Criterion-based statistics". The classical (or frequentist) paradigm, the Bayesian paradigm, and the AIC-based paradigm are summarized below. The likelihood-based paradigm is essentially a sub-paradigm of the AIC-based paradigm.

### Frequentist inference

This paradigm calibrates the plausibility of propositions by considering (notional) repeated sampling of a population distribution to produce datasets similar to the one at hand. By considering the dataset's characteristics under repeated sampling, the frequentist properties of a statistical proposition can be quantified—although in practice this quantification may be challenging.

#### Frequentist inference, objectivity, and decision theory

One interpretation of frequentist inference (or classical inference) is that it is applicable only in terms of frequency probability; that is, in terms of repeated sampling from a population. However, the approach of Neyman[37] develops these procedures in terms of pre-experiment probabilities. That is, before undertaking an experiment, one decides on a rule for coming to a conclusion such that the probability of being correct is controlled in a suitable way: such a probability need not have a frequentist or repeated sampling interpretation. In contrast, Bayesian inference works in terms of conditional probabilities (i.e. probabilities conditional on the observed data), compared to the marginal (but conditioned on unknown parameters) probabilities used in the frequentist approach.

The frequentist procedures of significance testing and confidence intervals can be constructed without regard to utility functions. However, some elements of frequentist statistics, such as statistical decision theory, do incorporate utility functions.[citation needed] In particular, frequentist developments of optimal inference (such as minimum-variance unbiased estimators, or uniformly most powerful testing) make use of loss functions, which play the role of (negative) utility functions. Loss functions need not be explicitly stated for statistical theorists to prove that a statistical procedure has an optimality property.[38] However, loss-functions are often useful for stating optimality properties: for example, median-unbiased estimators are optimal under absolute value loss functions, in that they minimize expected loss, and least squares estimators are optimal under squared error loss functions, in that they minimize expected loss.

While statisticians using frequentist inference must choose for themselves the parameters of interest, and the estimators/test statistic to be used, the absence of obviously explicit utilities and prior distributions has helped frequentist procedures to become widely viewed as 'objective'.[citation needed]

### Bayesian inference

The Bayesian calculus describes degrees of belief using the 'language' of probability; beliefs are positive, integrate to one, and obey probability axioms. Bayesian inference uses the available posterior beliefs as the basis for making statistical propositions. There are several different justifications for using the Bayesian approach.

#### Bayesian inference, subjectivity and decision theory

Many informal Bayesian inferences are based on "intuitively reasonable" summaries of the posterior. For example, the posterior mean, median and mode, highest posterior density intervals, and Bayes Factors can all be motivated in this way. While a user's utility function need not be stated for this sort of inference, these summaries do all depend (to some extent) on stated prior beliefs, and are generally viewed as subjective conclusions. (Methods of prior construction which do not require external input have been proposed but not yet fully developed.)

Formally, Bayesian inference is calibrated with reference to an explicitly stated utility, or loss function; the 'Bayes rule' is the one which maximizes expected utility, averaged over the posterior uncertainty. Formal Bayesian inference therefore automatically provides optimal decisions in a decision theoretic sense. Given assumptions, data and utility, Bayesian inference can be made for essentially any problem, although not every statistical inference need have a Bayesian interpretation. Analyses which are not formally Bayesian can be (logically) incoherent; a feature of Bayesian procedures which use proper priors (i.e. those integrable to one) is that they are guaranteed to be coherent. Some advocates of Bayesian inference assert that inference must take place in this decision-theoretic framework, and that Bayesian inference should not conclude with the evaluation and summarization of posterior beliefs.

### AIC-based inference

The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. (In doing so, it deals with the trade-off between the goodness of fit of the model and the simplicity of the model.)

#### Minimum description length

The minimum description length (MDL) principle has been developed from ideas in information theory[39] and the theory of Kolmogorov complexity.[40] The (MDL) principle selects statistical models that maximally compress the data; inference proceeds without assuming counterfactual or non-falsifiable "data-generating mechanisms" or probability models for the data, as might be done in frequentist or Bayesian approaches.

However, if a "data generating mechanism" does exist in reality, then according to Shannon's source coding theorem it provides the MDL description of the data, on average and asymptotically.[41] In minimizing description length (or descriptive complexity), MDL estimation is similar to maximum likelihood estimation and maximum a posteriori estimation (using maximum-entropy Bayesian priors). However, MDL avoids assuming that the underlying probability model is known; the MDL principle can also be applied without assumptions that e.g. the data arose from independent sampling.[41][42]

The MDL principle has been applied in communication-coding theory in information theory, in linear regression,[42] and in data mining.[40]

The evaluation of MDL-based inferential procedures often uses techniques or criteria from computational complexity theory.[43]

#### Fiducial inference

Fiducial inference was an approach to statistical inference based on fiducial probability, also known as a "fiducial distribution". In subsequent work, this approach has been called ill-defined, extremely limited in applicability, and even fallacious.[44][45] However this argument is the same as that which shows[46] that a so-called confidence distribution is not a valid probability distribution and, since this has not invalidated the application of confidence intervals, it does not necessarily invalidate conclusions drawn from fiducial arguments. An attempt was made to reinterpret the early work of Fisher's fiducial argument as a special case of an inference theory using Upper and lower probabilities.[47]

#### Structural inference

Developing ideas of Fisher and of Pitman from 1938 to 1939,[48] George A. Barnard developed "structural inference" or "pivotal inference",[49] an approach using invariant probabilities on group families. Barnard reformulated the arguments behind fiducial inference on a restricted class of models on which "fiducial" procedures would be well-defined and useful.

### Statistical model

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of some sample data and similar data from a larger population. A statistical model represents, often in considerably idealized form, the data-generating process.

The assumptions embodied by a statistical model describe a set of probability distributions, some of which are assumed to adequately approximate the distribution from which a particular data set is sampled. The probability distributions inherent in statistical models are what distinguishes statistical models from other, non-statistical, mathematical models.

A statistical model is usually specified by mathematical equations that relate one or more random variables and possibly other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).[1]

All statistical hypothesis tests and all statistical estimators are derived from statistical models. More generally, statistical models are part of the foundation of statistical inference.

## Formal definition

In mathematical terms, a statistical model is usually thought of as a pair (${\displaystyle S,{\mathcal {P}}}$), where ${\displaystyle S}$ is the set of possible observations, i.e. the sample space, and ${\displaystyle {\mathcal {P}}}$ is a set of probability distributions on ${\displaystyle S}$.[2]

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose ${\displaystyle {\mathcal {P}}}$ to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that ${\displaystyle {\mathcal {P}}}$ contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"[3]—whence the saying "all models are wrong".

The set ${\displaystyle {\mathcal {P}}}$ is almost always parameterized: ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. The set ${\displaystyle \Theta }$ defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. ${\displaystyle P_{\theta _{1}}=P_{\theta _{2}}\Rightarrow \theta _{1}=\theta _{2}}$ must hold (in other words, it must be injective). A parameterization that meets the requirement is said to be identifiable.[2]

## An example

Suppose that we have a population of school children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 5 feet tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by in obtaining a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be the equation for a model of the data. The line cannot be the equation for a model, unless it exactly fits all the data points—i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points.

To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution.

We can formally specify the model in the form (${\displaystyle S,{\mathcal {P}}}$) as follows. The sample space, ${\displaystyle S}$, of our model comprises the set of all possible pairs (age, height). Each possible value of ${\displaystyle \theta }$ = (b0, b1, σ2) determines a distribution on ${\displaystyle S}$; denote that distribution by ${\displaystyle P_{\theta }}$. If ${\displaystyle \Theta }$ is the set of all possible values of ${\displaystyle \theta }$, then ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying ${\displaystyle S}$ and (2) making some assumptions relevant to ${\displaystyle {\mathcal {P}}}$. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify ${\displaystyle {\mathcal {P}}}$—as they are required to do.

## General remarks

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the example above, ε is a stochastic variable; without that variable, the model would be deterministic.

Statistical models are often used even when the physical process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process).

There are three purposes for a statistical model, according to Konishi & Kitagawa.[4]
• Predictions
• Extraction of information
• Description of stochastic structures

## Dimension of a model

Suppose that we have a statistical model (${\displaystyle S,{\mathcal {P}}}$) with ${\displaystyle {\mathcal {P}}=\{P_{\theta }:\theta \in \Theta \}}$. The model is said to be parametric if ${\displaystyle \Theta }$ has a finite dimension. In notation, we write that ${\displaystyle \Theta \subseteq \mathbb {R} ^{k}}$ where k is a positive integer (${\displaystyle \mathbb {R} }$ denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that
${\displaystyle {\mathcal {P}}=\left\{P_{\mu ,\sigma }(x)\equiv {\frac {1}{{\sqrt {2\pi }}\sigma }}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right):\mu \in \mathbb {R} ,\sigma >0\right\}}$.
In this example, the dimension, k, equals 2.

As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean). Then the dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has dimension 1.)

Although formally ${\displaystyle \theta \in \Theta }$ is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution, ${\displaystyle \theta }$ is a single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the mean and the standard deviation.

A statistical model is nonparametric if the parameter set ${\displaystyle \Theta }$ is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of ${\displaystyle \Theta }$ and n is the number of samples, both semiparametric and nonparametric models have ${\displaystyle k\rightarrow \infty }$ as ${\displaystyle n\rightarrow \infty }$. If ${\displaystyle k/n\rightarrow 0}$ as ${\displaystyle n\rightarrow \infty }$, then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".[5]

## Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model
y = b0 + b1x + b2x2 + ε,    ε ~ 𝒩(0, σ2)
has, nested within it, the linear model
y = b0 + b1x + ε,    ε ~ 𝒩(0, σ2)
—we constrain the parameter b2 to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

## Comparing models

It is assumed that there is a "true" probability distribution underlying the observed data, induced by the process that generated the data. The main goal of model selection is to make statements about which elements of ${\displaystyle {\mathcal {P}}}$ are most likely to adequately approximate the true distribution.

Models can be compared to each other by exploratory data analysis or confirmatory data analysis. In exploratory analysis, a variety of models are formulated and an assessment is performed of how well each one describes the data. In confirmatory analysis, a previously formulated model or models are compared to the data. Common criteria for comparing models include R2, Bayes factor, and the likelihood-ratio test together with its generalization relative likelihood.

Konishi & Kitagawa state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."[6] Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[7]

### Stochastic process

A computer-simulated realization of a Wiener or Brownian motion process on the surface of a sphere. The Wiener process is widely considered the most studied and central stochastic process in probability theory.[1][2][3]

In probability theory and related fields, a stochastic or random process is a mathematical object usually defined as a collection of random variables. Historically, the random variables were associated with or indexed by a set of numbers, usually viewed as points in time, giving the interpretation of a stochastic process representing numerical values of some system randomly changing over time, such as the growth of a bacterial population, an electrical current fluctuating due to thermal noise, or the movement of a gas molecule.[1][4][5] Stochastic processes are widely used as mathematical models of systems and phenomena that appear to vary in a random manner. They have applications in many disciplines including sciences such as biology,[6] chemistry,[7] ecology,[8] neuroscience,[9] and physics[10] as well as technology and engineering fields such as image processing, signal processing,[11] information theory,[12] computer science,[13] cryptography[14] and telecommunications.[15] Furthermore, seemingly random changes in financial markets have motivated the extensive use of stochastic processes in finance.[16][17][18]

Applications and the study of phenomena have in turn inspired the proposal of new stochastic processes. Examples of such stochastic processes include the Wiener process or Brownian motion process,[a] used by Louis Bachelier to study price changes on the Paris Bourse,[21] and the Poisson process, used by A. K. Erlang to study the number of phone calls occurring in a certain period of time.[22] These two stochastic processes are considered the most important and central in the theory of stochastic processes,[1][4][23] and were discovered repeatedly and independently, both before and after Bachelier and Erlang, in different settings and countries.[21][24]

The term random function is also used to refer to a stochastic or random process,[25][26] because a stochastic process can also be interpreted as a random element in a function space.[27][28] The terms stochastic process and random process are used interchangeably, often with no specific mathematical space for the set that indexes the random variables.[27][29] But often these two terms are used when the random variables are indexed by the integers or an interval of the real line.[5][29] If the random variables are indexed by the Cartesian plane or some higher-dimensional Euclidean space, then the collection of random variables is usually called a random field instead.[5][30] The values of a stochastic process are not always numbers and can be vectors or other mathematical objects.[5][28]

Based on their properties, stochastic processes can be divided into various categories, which include random walks,[31] martingales,[32] Markov processes,[33] Lévy processes,[34] Gaussian processes,[35] random fields,[36] renewal processes, and branching processes.[37] The study of stochastic processes uses mathematical knowledge and techniques from probability, calculus, linear algebra, set theory, and topology[38][39][40] as well as branches of mathematical analysis such as real analysis, measure theory, Fourier analysis, and functional analysis.[41][42][43] The theory of stochastic processes is considered to be an important contribution to mathematics[44] and it continues to be an active topic of research for both theoretical reasons and applications.[45][46][47]

## Introduction

A stochastic or random process can be defined as a collection of random variables that is indexed by some mathematical set, meaning that each random variable of the stochastic process is uniquely associated with an element in the set.[4][5] The set used to index the random variables is called the index set. Historically, the index set was some subset of the real line, such as the natural numbers, giving the index set the interpretation of time.[1] Each random variable in the collection takes values from the same mathematical space known as the state space. This state space can be, for example, the integers, the real line or ${\displaystyle n}$-dimensional Euclidean space.[1][5] An increment is the amount that a stochastic process changes between two index values, often interpreted as two points in time.[48][49] A stochastic process can have many outcomes, due to its randomness, and a single outcome of a stochastic process is called, among other names, a sample function or realization.[28][50]

A single computer-simulated sample function or realization, among other terms, of a three-dimensional Wiener or Brownian motion process for time 0 ≤ t ≤ 2. The index set of this stochastic process is the non-negative numbers, while its state space is three-dimensional Euclidean space.

### Classifications

A stochastic process can be classified in different ways, for example, by its state space, its index set, or the dependence among the random variables. One common way of classification is by the cardinality of the index set and the state space.[51][52][53]

When interpreted as time, if the index set of a stochastic process has a finite or countable number of elements, such as a finite set of numbers, the set of integers, or the natural numbers, then the stochastic process is said to be in discrete time.[54][55] If the index set is some interval of the real line, then time is said to be continuous. The two types of stochastic processes are respectively referred to as discrete-time and continuous-time stochastic processes.[48][56][57] Discrete-time stochastic processes are considered easier to study because continuous-time processes require more advanced mathematical techniques and knowledge, particularly due to the index set being uncountable.[58][59] If the index set is the integers, or some subset of them, then the stochastic process can also be called a random sequence.[55]

If the state space is the integers or natural numbers, then the stochastic process is called a discrete or integer-valued stochastic process. If the state space is the real line, then the stochastic process is referred to as a real-valued stochastic process or a process with continuous state space. If the state space is ${\displaystyle n}$-dimensional Euclidean space, then the stochastic process is called a ${\displaystyle n}$-dimensional vector process or ${\displaystyle n}$-vector process.[51][52]

## Examples of stochastic processes

### Bernoulli process

One of the simplest stochastic processes is the Bernoulli process,[60] which is a sequence of independent and identically distributed (iid) random variables, where each random variable takes either the value one or zero, say one with probability ${\displaystyle p}$ and zero with probability ${\displaystyle 1-p}$. This process can be likened to repeatedly flipping a coin, where the probability of obtaining a head is ${\displaystyle p}$ and its value is one, while the value of a tail is zero.[61] In other words, a Bernoulli process is a sequence of iid Bernoulli random variables,[62] where each coin flip is an example of a Bernoulli trial.[63]

### Random walk

Random walks are stochastic processes that are usually defined as sums of iid random variables or random vectors in Euclidean space, so they are processes that change in discrete time.[64][65][66][67][68] But some also use the term to refer to processes that change in continuous time,[69] particularly the Wiener process used in finance, which has led to some confusion, resulting in its criticism.[70] There are other various types of random walks, defined so their state spaces can be other mathematical objects, such as lattices and groups, and in general they are highly studied and have many applications in different disciplines.[69][71]

A classic example of a random walk is known as the simple random walk, which is a stochastic process in discrete time with the integers as the state space, and is based on a Bernoulli process, where each iid Bernoulli variable takes either the value positive one or negative one. In other words, the simple random walk takes place on the integers, and its value increases by one with probability, say, ${\displaystyle p}$, or decreases by one with probability ${\displaystyle 1-p}$, so index set of this random walk is the natural numbers, while its state space is the integers. If the ${\displaystyle p=0.5}$, this random walk is called a symmetric random walk.[72][73]

### Wiener process

The Wiener process is a stochastic process with stationary and independent increments that are normally distributed based on the size of the increments.[2][74] The Wiener process is named after Norbert Wiener, who proved its mathematical existence, but the process is also called the Brownian motion process or just Brownian motion due to its historical connection as a model for Brownian movement in liquids.[75][76][76][77]

Realizations of Wiener processes (or Brownian motion processes) with drift (blue) and without drift (red).

Playing a central role in the theory of probability, the Wiener process is often considered the most important and studied stochastic process, with connections to other stochastic processes.[1][2][3][78][79][80][81] Its index set and state space are the non-negative numbers and real numbers, respectively, so it has both continuous index set and states space.[82] But the process can be defined more generally so its state space can be ${\displaystyle n}$-dimensional Euclidean space.[71][79][83] If the mean of any increment is zero, then the resulting Wiener or Brownian motion process is said to have zero drift. If the mean of the increment for any two points in time is equal to the time difference multiplied by some constant ${\displaystyle \mu }$, which is a real number, then the resulting stochastic process is said to have drift ${\displaystyle \mu }$.[84][85][86]

Almost surely, a sample path of a Wiener process is continuous everywhere but nowhere differentiable. It can be considered as a continuous version of the simple random walk.[49][85] The process arises as the mathematical limit of other stochastic processes such as certain random walks rescaled,[87][88] which is the subject of Donsker's theorem or invariance principle, also known as the functional central limit theorem.[89][90][91]

The Wiener process is a member of some important families of stochastic processes, including Markov processes, Lévy processes and Gaussian processes.[2][49] The process also has many applications and is the main stochastic process used in stochastic calculus.[92][93] It plays a central role in quantitative finance,[94][95] where it is used, for example, in the Black–Scholes–Merton model.[96] The process is also used in different fields, including the majority of natural sciences as well as some branches of social sciences, as a mathematical model for various random phenomena.[3][97][98]

### Poisson process

The Poisson process is a stochastic process that has different forms and definitions.[99][100] It can be defined as a counting process, which is a stochastic process that represents the random number of points or events up to some time. The number of points of the process that are located in the interval from zero to some given time is a Poisson random variable that depends on that time and some parameter. This process has the natural numbers as its state space and the non-negative numbers as its index set. This process is also called the Poisson counting process, since it can be interpreted as an example of a counting process.[99]

If a Poisson process is defined with a single positive constant, then the process is called a homogeneous Poisson process.[99][101] The homogeneous Poisson process is a member of important classes of stochastic processes such as Markov processes and Lévy processes.[49]

The homogeneous Poisson process can be defined and generalized in different ways. It can be defined such that its index set is the real line, and this stochastic process is also called the stationary Poisson process.[102][103] If the parameter constant of the Poisson process is replaced with some non-negative integrable function of ${\displaystyle t}$, the resulting process is called an inhomogeneous or nonhomogeneous Poisson process, where the average density of points of the process is no longer constant.[104] Serving as a fundamental process in queueing theory, the Poisson process is an important process for mathematical models, where it finds applications for models of events randomly occurring in certain time windows.[105][106]

Defined on the real line, the Poisson process can be interpreted as a stochastic process,[49][107] among other random objects.[108][109] But the it can be defined on the ${\displaystyle n}$-dimensional Euclidean space or other mathematical spaces,[110] where it is often interpreted as a random set or a random counting measure, instead of a stochastic process.[108][109] In this setting, the Poisson process, also called the Poisson point process, is one of the most important objects in probability theory, both for applications and theoretical reasons.[22][111] But it has been remarked that the Poisson process does not receive as much attention as it should, partly due to it often being considered just on the real line, and not on other mathematical spaces.[111][112]

## Definitions

### Stochastic process

A stochastic process is defined as a collection of random variables defined on a common probability space ${\displaystyle (\Omega ,{\mathcal {F}},P)}$, where ${\displaystyle \Omega }$ is a sample space, ${\displaystyle {\mathcal {F}}}$ is a ${\displaystyle \sigma }$-algebra, and ${\displaystyle P}$ is a probability measure, and the random variables, indexed by some set ${\displaystyle T}$, all take values in the same mathematical space ${\displaystyle S}$, which must be measurable with respect to some ${\displaystyle \sigma }$-algebra ${\displaystyle \Sigma }$.[28]

In other words, for a given probability space ${\displaystyle (\Omega ,{\mathcal {F}},P)}$ and a measurable space ${\displaystyle (S,\Sigma )}$, a stochastic process is a collection of ${\displaystyle S}$-valued random variables, which can be written as:[60]

${\displaystyle \{X(t):t\in T\}.}$

Historically, in many problems from the natural sciences a point ${\displaystyle t\in T}$ had the meaning of time, so ${\displaystyle X(t)}$ is a random variable representing a value observed at time ${\displaystyle t}$.[113] A stochastic process can also be written as ${\displaystyle \{X(t,\omega ):t\in T\}}$ to reflect that it is actually a function of two variables, ${\displaystyle t\in T}$ and ${\displaystyle \omega \in \Omega }$.[28][114]

There are others ways to consider a stochastic process, with the above definition being considered the traditional one.[115][116] For example, a stochastic process can be interpreted or defined as a ${\displaystyle S^{T}}$-valued random variable, where ${\displaystyle S^{T}}$ is the space of all the possible ${\displaystyle S}$-valued functions of ${\displaystyle t\in T}$ that map from the set ${\displaystyle T}$ into the space ${\displaystyle S}$.[27][115]

### Index set

The set ${\displaystyle T}$ is called the index set[4][51] or parameter set[28][117] of the stochastic process. Often this set is some subset of the real line, such as the natural numbers or an interval, giving the set ${\displaystyle T}$ the interpretation of time.[1] In addition to these sets, the index set ${\displaystyle T}$ can be other linearly ordered sets or more general mathematical sets,[1][54] such as the Cartesian plane ${\displaystyle R^{2}}$ or ${\displaystyle n}$-dimensional Euclidean space, where an element ${\displaystyle t\in T}$ can represent a point in space.[48][118] But in general more results and theorems are possible for stochastic processes when the index set is ordered.[119]

### State space

The mathematical space ${\displaystyle S}$ of a stochastic process is called its state space. This mathematical space can be defined using integers, real lines, ${\displaystyle n}$-dimensional Euclidean spaces, complex planes, or more abstract mathematical spaces. The state space is defined using elements that reflect the different values that the stochastic process can take. [1][5][28][51][56]

### Sample function

A sample function is a single outcome of a stochastic process, so it is formed by taking a single possible value of each random variable of the stochastic process.[28][120] More precisely, if ${\displaystyle \{X(t,\omega ):t\in T\}}$ is a stochastic process, then for any point ${\displaystyle \omega \in \Omega }$, the mapping

${\displaystyle X(\cdot ,\omega ):T\rightarrow S,}$

is called a sample function, a realization, or, particularly when ${\displaystyle T}$ is interpreted as time, a sample path of the stochastic process ${\displaystyle \{X(t,\omega ):t\in T\}}$.[50] This means that for a fixed ${\displaystyle \omega \in \Omega }$, there exists a sample function that maps the index set ${\displaystyle T}$ to the state space ${\displaystyle S}$.[28] Other names for a sample function of a stochastic process include trajectory, path function[121] or path.[122]

### Increment

An increment of a stochastic process is the difference between two random variables of the same stochastic process. For a stochastic process with an index set that can be interpreted as time, an increment is how much the stochastic process changes over a certain time period. For example, if ${\displaystyle \{X(t):t\in T\}}$ is a stochastic process with state space ${\displaystyle S}$ and index set ${\displaystyle T=[0,\infty )}$, then for any two non-negative numbers ${\displaystyle t_{1}\in [0,\infty )}$ and ${\displaystyle t_{2}\in [0,\infty )}$ such that ${\displaystyle t_{1}\leq t_{2}}$, the difference ${\displaystyle X_{t_{2}}-X_{t_{1}}}$ is a ${\displaystyle S}$-valued random variable known as an increment.[48][49] When interested in the increments, often the state space ${\displaystyle S}$ is the real line or the natural numbers, but it can be ${\displaystyle n}$-dimensional Euclidean space or more abstract spaces such as Banach spaces.[49]

## Notation

A stochastic process can be denoted, among other ways, by ${\displaystyle \{X(t)\}_{t\in T}}$,[56] ${\displaystyle \{X_{t}\}_{t\in T}}$,[116] ${\displaystyle \{X_{t}\}}$[123] ${\displaystyle \{X(t)\}}$ or simply as ${\displaystyle X}$ or ${\displaystyle X(t)}$, although ${\displaystyle X(t)}$ is regarded as an abuse of notation.[124] For example, ${\displaystyle X(t)}$ or ${\displaystyle X_{t}}$ are used to refer to the random variable with the index ${\displaystyle t}$, and not the entire stochastic process.[123] If the index set is ${\displaystyle T=[0,\infty )}$, then one can write, for example, ${\displaystyle (X_{t},t\geq 0)}$ to denote the stochastic process.[29]

## Further examples of stochastic processes

### Markov processes and chains

Markov processes are stochastic processes, traditionally in discrete or continuous time, that have the Markov property, which means the next value of the Markov process depends on the current value, but it is conditionally independent of the previous values of the stochastic process. In other words, the behavior of the process in the future is stochastically independent of its behavior in the past, given the current state of the process.[125][126]

The Brownian motion process and the Poisson process (in one dimension) are both examples of Markov processes[127] in continuous time, while random walks on the integers and the gambler's ruin problem are examples of Markov processes in discrete time.[128][129]

A Markov chain is a type of Markov process that has either discrete state space or discrete index set (often representing time), but the precise definition of a Markov chain varies.[130] For example, it is common to define a Markov chain as a Markov process in either discrete or continuous time with a countable state space (thus regardless of the nature of time),[131][132][133][134] but it is also common to define a Markov chain as having discrete time in either countable or continuous state space (thus regardless of the state space).[130]

Markov processes form an important class of stochastic processes and have applications in many areas.[39][135] For example, they are the basis for a general stochastic simulation method known as Markov chain Monte Carlo, which is used for simulating random objects with specific probability distributions, and has found application in Bayesian statistics.[136][137]

The concept of the Markov property was originally for stochastic processes in continuous and discrete time, but the property has been adapted for other index sets such as ${\displaystyle n}$-dimensional Euclidean space, which results in collections of random variables known as Markov random fields.[138][139][140]

### Martingale

A martingale is a discrete-time or continuous-time stochastic process with the property that the expectation of the next value of a martingale is equal to the current value given all the previous values of the process. The exact mathematical definition of a martingale requires two other conditions coupled with the mathematical concept of a filtration, which is related to the intuition of increasing available information as time passes. Martingales are usually defined to be real-valued,[141][142][143] but they can also be complex-valued[144] or even more general.[145]

A symmetric random walk and a Wiener process (with zero drift) are both examples of martingales, respectively, in discrete and continuous time.[141][142] For a sequence of independent and identically distributed random variables ${\displaystyle X_{1},X_{2},X_{3},\dots }$ with zero mean, the stochastic process formed from the successive partial sums ${\displaystyle X_{1},X_{1}+X_{2},X_{1}+X_{2}+X_{3},\dots }$ is a discrete-time martingale.[146] In this aspect, discrete-time martingales generalize the idea of partial sums of independent random variables.[147]

Martingales can also be created from stochastic processes by applying some suitable transformations, which is the case for the homogeneous Poisson process (on the real line) resulting in a martingale called the compensated Poisson process.[142] Martingales can also be built from other martingales.[146] For example, there are martingales based on the martingale the Wiener process, forming continuous-time martingales.[141][148]

Martingales mathematically formalize the idea of a fair game,[149] and they were originally developed to show that it is not possible to win a fair game.[150] But now they are used in many areas of probability, which is one of the main reasons for studying them.[143][150][151] Many problems in probability have been solved by finding a martingale in the problem and studying it.[152] Martingales will converge, given some conditions on their moments, so they are often used to derive convergence results, due largely to martingale convergence theorems.[147][153][154]

Martingales have many applications in statistics, but it has been remarked that its use and application are not as widespread as it could be in the field of statistics, particularly statistical inference.[155] They have found applications in areas in probability theory such as queueing theory and Palm calculus[156] and other fields such as economics[157] and finance.[17]

### Lévy process

Lévy processes are types of stochastic processes that can be considered as generalizations of random walks in continuous time.[49][158] These processes have many applications in fields such as finance, fluid mechanics, physics and biology.[159][160] The main defining characteristics of these processes are their stationarity and independence properties, so they were known as processes with stationary and independent increments. In other words, a stochastic process ${\displaystyle X}$ is a Lévy process if for ${\displaystyle n}$ non-negatives numbers, ${\displaystyle 0\leq t_{1}\leq \dots \leq t_{n}}$, the corresponding ${\displaystyle n-1}$ increments

${\displaystyle X_{t_{2}}-X_{t_{1}},\dots ,X_{t_{n-1}}-X_{t_{n}},}$

are all independent of each other, and the distribution of each increment only depends on the difference in time.[49]

A Lévy process can be defined such that its state space is some abstract mathematical space, such as a Banach space, but the processes are often defined so that they take values in Euclidean space. The index set is the non-negative numbers, so ${\displaystyle I=[0,\infty )}$, which gives the interpretation of time. Important stochastic processes such as the Wiener process, the homogeneous Poisson process (in one dimension), and subordinators are all Lévy processes.[49][158]

### Random field

A random field is a collection of random variables indexed by a ${\displaystyle n}$-dimensional Euclidean space or some manifold. In general, a random field can be considered an example of a stochastic or random process, where the index set is not necessarily a subset of the real line.[30] But there is a convention that an indexed collection of random variables is called a random field when the index has two or more dimensions.[5][28][161] If the specific definition of a stochastic process requires the index set to be a subset of the real line, then the random field can be considered as a generalization of stochastic process.[162]

### Point process

A point process is a collection of points randomly located on some mathematical space such as the real line, ${\displaystyle n}$-dimensional Euclidean space, or more abstract spaces. Sometimes the term point process is not preferred, as historically the word process denoted an evolution of some system in time, so a point process is also called a random point field.[163] There are different interpretations of a point process, such a random counting measure or a random set.[164][165] Some authors regard a point process and stochastic process as two different objects such that a point process is a random object that arises from or is associated with a stochastic process,[166][167] though it has been remarked that the difference between point processes and stochastic processes is not clear.[167]

Other authors consider a point process as a stochastic process, where the process is indexed by sets of the underlying space[b] on which it is defined, such as the real line or ${\displaystyle n}$-dimensional Euclidean space.[170][171] Other stochastic processes such as renewal and counting processes are studied in the theory of point processes.[172][173]

## History

### Early probability theory

Probability theory has its origins in games of chance, which have a long history, with some games being played thousands of years ago,[174] but very little analysis on them was done in terms of probability.[175] The year 1654 is often considered the birth of probability theory when French mathematicians Pierre Fermat and Blaise Pascal had a written correspondence on probability, motivated by a gambling problem.[176][177] But there was earlier mathematical work done on the probability of gambling games such as Liber de Ludo Aleae by Gerolamo Cardano, written in the 16th century but posthumously published later in 1663.[178]

After Cardano, Jakob Bernoulli[c] wrote Ars Conjectandi, which is considered a significant event in the history of probability theory. Bernoulli's book was published, also posthumously, in 1713 and inspired many mathematicians to study probability.[180][181] But despite some renown mathematicians contributing to probability theory, such as Pierre-Simon Laplace, Abraham de Moivre, Carl Gauss, Siméon Poisson and Pafnuty Chebyshev,[182][183] most of the mathematical community[d] did not consider probability theory to be part of mathematics until the 20th century.[182][184][185][186]

### Statistical mechanics

In the physical sciences, scientists developed in the 19th century the discipline of statistical mechanics, where physical systems, such as containers filled with gases, can be regarded or treated mathematically as collections of many moving particles. Although there were attempts to incorporate randomness into statistical physics by some scientists, such as Rudolf Clausius, most of the work had little or no randomness.[187][188] This changed in 1859 when James Clerk Maxwell contributed significantly to the field, more specifically, to the kinetic theory of gases, by presenting work where he assumed the gas particles move in random directions at random velocities.[189][190] The kinetic theory of gases and statistical physics continued to be developed in the second half of the 19th century, with work done chiefly by Clausius, Ludwig Boltzmann and Josiah Gibbs, which would later have an influence on Albert Einstein's mathematical model for Brownian movement.[191]

### Measure theory and probability theory

In 1900 at the International Congress of Mathematicians in Paris David Hilbert presented a list of mathematical problems, where his sixth problem asked for a mathematical treatment of physics and probability involving axioms.[183] Around the start of the 20th century, mathematicians developed measure theory, a branch of mathematics for studying integrals of mathematical functions, where two of the founders were French mathematicians, Henri Lebesgue and Émile Borel. In 1925 another French mathematician Paul Lévy published the first probability book that used ideas from measure theory.[183]

In 1920s fundamental contributions to probability theory were made in the Soviet Union by mathematicians such as Sergei Bernstein, Aleksandr Khinchin,[e] and Andrei Kolmogorov.[186] Kolmogorov published in 1929 his first attempt at presenting a mathematical foundation, based on measure theory, for probability theory.[193] In the early 1930s Khinchin and Kolmogorov set up probability seminars, which were attended by researchers such as Eugene Slutsky and Nikolai Smirnov,[194] and Khinchin gave the first mathematical definition of a stochastic process as a set of random variables indexed by the real line.[192][195][f]

### Birth of modern probability theory

In 1933 Andrei Kolmogorov published in German his book on the foundations of probability theory titled Grundbegriffe der Wahrscheinlichkeitsrechnung,[g] where Kolmogorov used measure theory to develop an axiomatic framework for probability theory. The publication of this book is now widely considered to be the birth of modern probability theory, when the theories of probability and stochastic processes became parts of mathematics.[183][186]

After the publication of Kolmogorov's book, further fundamental work on probability theory and stochastic processes was done by Khinchin and Kolmogorov as well as other mathematicians such as Joseph Doob, William Feller, Maurice Fréchet, Paul Lévy, Wolfgang Doeblin, and Harald Cramér.[183][186] Decades later Cramér referred to the 1930s as the "heroic period of mathematical probability theory".[186] World War II greatly interrupted the development of probability theory, causing, for example, the migration of Feller from Sweden to the United States of America[186] and the death of Doeblin, considered now a pioneer in stochastic processes.[197]

Mathematician Joseph Doob did early work on the theory of stochastic processes, making fundamental contributions, particularly in the theory of martingales.[198][196] His book Stochastic Processes is considered highly influential in the field of probability theory.[199]

### Stochastic processes after World War II

After World War II the study of probability theory and stochastic processes gained more attention from mathematicians, with significant contributions made in many areas of probability and mathematics as well as the creation of new areas.[186][200] Starting in the 1940s, Kiyosi Itô published papers developing the field of stochastic calculus, which involves stochastic integrals and stochastic differential equations based on the Wiener or Brownian motion process.[201]

Also starting in the 1940s, connections were made between stochastic processes, particularly martingales, and the mathematical field of potential theory, with early ideas by Shizuo Kakutani and then later work by Joseph Doob.[200] Further work, considered pioneering, was done by Gilbert Hunt in the 1950s, connecting Markov processes and potential theory, which had a significant effect on the theory of Lévy processes and led to more interest in studying Markov processes with methods developed by Itô.[21][202][203]

In 1953 Doob published his book Stochastic processes, which had a strong influence on the theory of stochastic processes and stressed the importance of measure theory in probability.[200] [199] Doob also chiefly developed the theory of martingales, with later substantial contributions by Paul-André Meyer. Earlier work had been carried out by Sergei Bernstein, Paul Lévy and Jean Ville, the latter adopting the term martingale for the stochastic process.[204][205] Methods from the theory of martingales became popular for solving various probability problems. Techniques and theory were developed to study Markov processes and then applied to martingales. Conversely, methods from the theory of martingales were established to treat Markov processes.[200]

Other fields of probability were developed and used to study stochastic processes, with one main approach being the theory of large deviations.[200] The theory has many applications in statistical physics, among other fields, and has core ideas going back to at least the 1930s. Later in the 1960s and 1970s fundamental work was done by Alexander Wentzell in the Soviet Union and Monroe D. Donsker and Srinivasa Varadhan in the United States of America,[206] which would later result in Varadhan winning the 2007 Abel Prize.[207] In the 1990s and 2000s the theories of Schramm–Loewner evolution[208] and rough paths[209] were introduced and developed to study stochastic processes and other mathematical objects in probability theory, which respectively resulted in Fields Medals being awarded to Wendelin Werner[210] in 2008 and to Martin Hairer in 2014.[211]

The theory of stochastic processes still continues to be a focus of research, with yearly international conferences on the topic of stochastic processes.[45][159]

### Discoveries of specific stochastic processes

Although Khinchin gave mathematical definitions of stochastic processes in the 1930s,[192][195] specific stochastic processes had already been discovered in different settings, such as the Brownian motion process and the Poisson process.[21][24] Some families of stochastic processes such as point processes or renewal processes have long and complex histories, stretching back centuries.[212]

#### Bernoulli process

The Bernoulli process, which can serve as a mathematical model for flipping a biased coin, is possibly the first stochastic process to have been studied.[61] The process is a sequence of independent Bernoulli trials,[62] which are named after Jackob Bernoulli who used them to study games of chance, including probability problems proposed and studied earlier by Christiaan Hugens.[213] Bernoulli's work, including the Bernoulli process, were published in his book Ars Conjectandi in 1713.[214]

#### Random walks

In 1905 Karl Pearson coined the term random walk while posing a problem describing a random walk on the plane, which was motivated by an application in biology, but such problems involving random walks had already been studied in other fields. Certain gambling problems that were studied centuries earlier can be considered as problems involving random walks.[69][214] For example, the problem known as the Gambler's ruin is based on a simple random walk,[129][215] and is an example of a random walk with absorbing barriers.[176][216] Pascal, Fermat and Huyens all gave numerical solutions to this problem without detailing their methods,[217] and then more detailed solutions were presented by Jakob Bernoulli and Abraham de Moivre.[218]

For random walks in ${\displaystyle n}$-dimensional integer lattices, George Pólya published in 1919 and 1921 work, where he studied the probability of a symmetric random walk returning to a previous position in the lattice. Pólya showed that a symmetric random walk, which has an equal probability to advance in any direction in the lattice, will return to a previous position in the lattice an infinite number of times with probability one in one and two dimensions, but with probability zero in three or higher dimensions.[219][220]

#### Wiener process

The Wiener process or Brownian motion process has its origins in different fields including statistics, finance and physics.[21] In 1880, Thorvald Thiele wrote a paper on the method of least squares, where he used the process to study the errors of a model in time-series analysis.[221][222] The work is now considered as an early discovery of the statistical method known as Kalman filtering, but the work was largely overlooked. It is thought that the ideas in Thiele's paper were too advanced to have been understood by the broader mathematical and statistical community at the time.[222]

Norbert Wiener gave the first mathematical proof of the existence of the Wiener process. This mathematical object had appeared previously in the work of Thorvald Thiele, Louis Bachelier, and Albert Einstein.[21]

The French mathematician Louis Bachelier used a Wiener process in his 1900 thesis in order to model price changes on the Paris Bourse, a stock exchange,[223] without knowing the work of Thiele.[21] It has been speculated that Bachelier drew ideas from the random walk model of Jules Regnault, but Bachelier did not cite him,[224] and Bachelier's thesis is now considered pioneering in the field of financial mathematics.[223][224]

It is commonly thought that Bachelier's work gained little attention and was forgotten for decades until it was rediscovered in the 1950s by the Leonard Savage, and then become more popular after Bachelier's thesis was translated into English in 1964. But the work was never forgotten in the mathematical community, as Bachelier published a book in 1912 detailing his ideas,[224] which was cited by mathematicians including Doob, Feller[224] and Kolomogorov.[21] The book continued to be cited, but then starting in the 1960s the original thesis by Bachelier began to be cited more than his book when economists started citing Bachelier's work.[224]

In 1905 Albert Einstein published a paper where he studied the physical observation of Brownian motion or movement to explain the seemingly random movements of particles in liquids by using ideas from the kinetic theory of gases. Einstein derived a differential equation, known as a diffusion equation, for describing the probability of finding a particle in a certain region of space. Shortly after Einstein's first paper on Brownian movement, Marian Smoluchowski published work where he cited Einstein, but wrote that he had independently derived the equivalent results by using a different method.[225]

Einstein's work, as well as experimental results obtained by Jean Perrin, later inspired Norbert Wiener in the 1920s[226] to use a type of measure theory, developed by Percy Daniell, and Fourier analysis to prove the existence of the Wiener process as a mathematical object.[21]

#### Poisson process

The Poisson process is named after Siméon Poisson, due to its definition involving the Poisson distribution, but Poisson never studied the process.[22][227] There are a number of claims for early uses or discoveries of the Poisson process.[22][24] At the beginning of the 20th century the Poisson process would arise independently in different situations.[22][24] In Sweden 1903, Filip Lundberg published a thesis containing work, now considered fundamental and pioneering, where he proposed to model insurance claims with a homogeneous Poisson process.[228][229]

Another discovery occurred in Denmark in 1909 when A.K. Erlang derived the Poisson distribution when developing a mathematical model for the number of incoming phone calls in a finite time interval. Erlang was not at the time aware of Poisson's earlier work and assumed that the number phone calls arriving in each interval of time were independent to each other. He then found the limiting case, which is effectively recasting the Poisson distribution as a limit of the binomial distribution.[22]

In 1910 Ernest Rutherford and Hans Geiger published experimental results on counting alpha particles. Motivated by their work, Harry Bateman studied the counting problem and derived Poisson probabilities as a solution to a family of differential equations, resulting in the independent discovery of the Poisson process.[22] After this time there were many studies and applications of the Poisson process, but its early history is complicated, which has been explained by the various applications of the process in numerous fields by biologists, ecologists, engineers and various physical scientists.[22]

#### Markov processes

Markov processes and Markov chains are named after Andrey Markov who studied Markov chains in the early 20th century. Markov was interested in studying an extension of independent random sequences. In his first paper on Markov chains, published in 1906, Markov showed that under certain conditions the average outcomes of the Markov chain would converge to a fixed vector of values, so proving a weak law of large numbers without the independence assumption,[230][231][232] which had been commonly regarded as a requirement for such mathematical laws to hold.[232] Markov later used Markov chains to study the distribution of vowels in Eugene Onegin, written by Alexander Pushkin, and proved a central limit theorem for such chains.[230]

In 1912 Poincaré studied Markov chains on finite groups with an aim to study card shuffling. Other early uses of Markov chains include a diffusion model, introduced by Paul and Tatyana Ehrenfest in 1907, and a branching process, introduced by Francis Galton and Henry William Watson in 1873, preceding the work of Markov.[230][231] After the work of Galton and Watson, it was later revealed that their branching process had been independently discovered and studied around three decades earlier by Irénée-Jules Bienaymé.[233] Starting in 1928, Maurice Fréchet became interested in Markov chains, eventually resulting in him publishing in 1938 a detailed study on Markov chains.[230][234]

Andrei Kolmogorov developed in a 1931 paper a large part of the early theory of continuous-time Markov processes.[186][193] Kolmogorov was partly inspired by Louis Bachelier's 1900 work on fluctuations in the stock market as well as Norbert Wiener's work on Einstein's model of Brownian movement.[193][235] He introduced and studied a particular set of Markov processes known as diffusion processes, where he derived a set of differential equations describing the processes.[193][236] Independent of Kolmogorov's work, Sydney Chapman derived in a 1928 paper an equation, now called the Chapman–Kolmogorov equation, in a less mathematically rigorous way than Kolmogorov, while studying Brownian movement.[237] The differential equations are now called the Kolmogorov equations[238] or the Kolmogorov–Chapman equations.[239] Other mathematicians who contributed significantly to the foundations of Markov processes include William Feller, starting in the 1930s, and then later Eugene Dynkin, starting in the 1950s.[186]

#### Lévy processes

Lévy processes such as the Wiener process and the Poisson process (on the real line) are named after Paul Lévy who started studying them in the 1930s,[159] but they have connections to infinitely divisible distributions going back to the 1920s.[158] In a 1932 paper Kolmogorov derived a characteristic function for random variables associated with Lévy processes. This result was later derived under more general conditions by Lévy in 1934, and then Khinchin independently gave an alternative form for this characteristic function in 1937.[186][240] In addition to Lévy, Khinchin and Kolomogrov, early fundamental contributions to the theory of Lévy processes were made by Bruno de Finetti and Kiyosi Itô.[158]

## Etymology

The word stochastic in English was originally used as an adjective with the definition "pertaining to conjecturing", and stemming from a Greek word meaning "to aim at a mark, guess", and the Oxford English Dictionary gives the year 1662 as its earliest occurrence.[241] In his work on probability Ars Conjectandi, originally published in Latin in 1713, Jakob Bernoulli used the phrase "Ars Conjectandi sive Stochastice", which has been translated to "the art of conjecturing or stochastics".[242] This phrase was used, with reference to Bernoulli, by Ladislaus Bortkiewicz[243] who in 1917 wrote in German the word stochastik with a sense meaning random. The term stochastic process first appeared in English in a 1934 paper by Joseph Doob.[241] For the term and a specific mathematical definition, Doob cited another 1934 paper, where the term stochastischer Prozeß was used in German by Aleksandr Khinchin,[192][244] though the German term had been used earlier, for example, by Andrei Kolmogorov in 1931.[245]

Early occurrences of the word random in English with its current meaning, relating to chance or luck, date back to the 16th century, while earlier recorded usages started in the 14th century as a noun meaning "impetuosity, great speed, force, or violence (in riding, running, striking, etc.)". The word itself comes from a Middle French word meaning "speed, haste", and it is probably derived from a French verb meaning "to run" or "to gallop". The first written appearance of the term random process pre-dates stochastic process, which the Oxford English Dictionary also gives as a synonym, and was used in an article by Francis Edgeworth published in 1888.[246]

## Terminology

The definition of a stochastic process varies,[247] but a stochastic process is traditionally defined as a collection of random variables indexed by some set.[115][116] The terms random process and stochastic process are considered synonyms and are used interchangeably, without the index set being precisely specified.[27][29][30][248][249][250] Both "collection",[28][248] or "family" are used[4][251] while instead of "index set", sometimes the terms "parameter set"[28] or "parameter space"[30] are used.

The term random function is also used to refer to a stochastic or random process,[5][252][253] though sometimes it is only used when the stochastic process takes real values.[28][251] This term is also used when the index sets are mathematical spaces other than the real line,[5][254] while the terms stochastic process and random process are usually used when the index set interpreted as time,[5][254][255] and other terms are used such as random field when the index set is ${\displaystyle n}$-dimensional Euclidean space ${\displaystyle R^{n}}$ or a manifold.[5][28][30]

## Further definitions

### Law

For a stochastic process ${\displaystyle X\colon \Omega \rightarrow S^{T}}$ defined on the probability space ${\displaystyle (\Omega ,{\mathcal {F}},P)}$, the law of stochastic process ${\displaystyle X}$ is defined as the image measure:

${\displaystyle \mu =P\circ X^{-1},}$

where ${\displaystyle P}$ is a probability measure, the symbol ${\displaystyle \circ }$ denotes function composition and ${\displaystyle X^{-1}}$ is the pre-imagine of the measurable function or, equivalently, the ${\displaystyle S^{T}}$-valued random variable ${\displaystyle X}$, where ${\displaystyle S^{T}}$ is the space of all the possible ${\displaystyle S}$-valued functions of ${\displaystyle t\in T}$, so the law of a stochastic process is a probability measure.[27][115][209][256]

For a measurable subset ${\displaystyle B}$ of ${\displaystyle S^{T}}$, the pre-image of ${\displaystyle X}$ gives

${\displaystyle X^{-1}(B)=\{\omega \in \Omega :X(\omega )\in B\},}$

so the law of a ${\displaystyle X}$ can be written as:[28]

${\displaystyle \mu (B)=P(\{\omega \in \Omega :X(\omega )\in B\}).}$

The law of a stochastic process or a random variable is also called the probability law, probability distribution, or the distribution.[113][209][257][258][259]

### Finite-dimensional probability distributions

For a stochastic process ${\displaystyle X}$ with law ${\displaystyle \mu }$, its finite-dimensional distributions are defined as:

${\displaystyle \mu _{t_{1},\dots ,t_{n}}=P\circ (X({t_{1}}),\dots ,X({t_{n}}))^{-1},}$

where ${\displaystyle n\geq 1}$ is a counting number and each set ${\displaystyle t_{i}}$ is a non-empty finite subset of the index set ${\displaystyle T}$, so each ${\displaystyle t_{i}\subset T}$, which means that ${\displaystyle t_{1},\dots ,t_{n}}$ is any finite collection of subsets of the index set ${\displaystyle T}$.[27][260]

For any measurable subset ${\displaystyle C}$ of the ${\displaystyle n}$-fold Cartesian power ${\displaystyle S^{n}=S\times \dots \times S}$, the finite-dimensional distributions of a stochastic process ${\displaystyle X}$ can be written as:[28]

${\displaystyle \mu _{t_{1},\dots ,t_{n}}(C)=P{\Big (}{\big \{}\omega \in \Omega :{\big (}X_{t_{1}}(\omega ),\dots ,X_{t_{n}}(\omega ){\big )}\in C{\big \}}{\Big )}.}$

The finite-dimensional distributions of a stochastic process satisfy two mathematical conditions known as consistency conditions.[57]

### Stationarity

Stationarity is a mathematical property that a stochastic process has when all the random variables of that stochastic process are identically distributed. In other words, if ${\displaystyle X}$ is a stationary stochastic process, then for any ${\displaystyle t\in T}$ the random variable ${\displaystyle X_{t}}$ has the same distribution, which means that for any set of ${\displaystyle n}$ index set values ${\displaystyle t_{1},\dots ,t_{n}}$, the corresponding ${\displaystyle n}$ random variables

${\displaystyle X_{t_{1}},\dots X_{t_{n}},}$

all have the same probability distribution. The index set of a stationary stochastic process is usually interpreted as time, so it can be the integers or the real line.[261][262] But the concept of stationarity also exists for point processes and random fields, where the index set is not interpreted as time.[261][263][264]

When the index set ${\displaystyle T}$ can be interpreted as time, a stochastic process is said to be stationary if its finite-dimensional distributions are invariant under translations of time. This type of stochastic process can be used to describe a physical system that is in steady state, but still experiences random fluctuations.[261] The intuition behind stationarity is that as time passes the distribution of the stationary stochastic process remains the same.[265] A sequence of random variables forms a stationary stochastic process if and only if the random variables are identically distributed.[261]

A stochastic process with the above definition of stationarity is sometimes said to be strictly stationary, but there are other forms of stationarity. One example is when a discrete-time or continuous-time stochastic process ${\displaystyle X}$ is said to be stationary in the wide sense, then the process ${\displaystyle X}$ has a finite second moment for all ${\displaystyle t\in T}$ and the covariance of the two random variables ${\displaystyle X_{t}}$ and ${\displaystyle X_{t+h}}$ depends only on the number ${\displaystyle h}$ for all ${\displaystyle t\in T}$.[265][266] Khinchin introduced the related concept of stationarity in the wide sense, which has other names including covariance stationarity or stationarity in the broad sense.[266][267]

### Filtration

A filtration is an increasing sequence of sigma-algebras defined in relation to some probability space and an index set that has some total order relation, such in the case of the index set being some subset of the real numbers. More formally, if a stochastic process has an index set with a total order, then a filtration ${\displaystyle \{{\mathcal {F}}_{t}\}_{t\in T}}$, on a probability space ${\displaystyle (\Omega ,{\mathcal {F}},P)}$ is a family of sigma-algebras such that ${\displaystyle {\mathcal {F}}_{s}\subseteq {\mathcal {F}}_{t}\subseteq {\mathcal {F}}}$ for all ${\displaystyle s\leq t}$, where ${\displaystyle t,s\in T}$ and ${\displaystyle \leq }$ denotes the total order of the index set ${\displaystyle T}$.[51] With the concept of a filtration, it is possible to study the amount of information contained in a stochastic process ${\displaystyle X_{t}}$ at ${\displaystyle t\in T}$, which can be interpreted as the moment or time ${\displaystyle t}$.[51][143] The intuition behind a filtration ${\displaystyle {\mathcal {F}}_{t}}$ is that as time ${\displaystyle t}$ passes, more and more information on ${\displaystyle X_{t}}$ is known or available, which is captured in ${\displaystyle {\mathcal {F}}_{t}}$, resulting in finer and finer partitions of ${\displaystyle \Omega }$.[268][269]

### Modification

A modification of a stochastic process is another stochastic process, which is closely related to the original stochastic process. More precisely, a stochastic process ${\displaystyle X}$ that has the same index set ${\displaystyle T}$, set space ${\displaystyle S}$, and probability space ${\displaystyle (\Omega ,{\cal {F}},P)}$ as another stochastic process ${\displaystyle Y}$ is said to be a modification of ${\displaystyle Y}$ if for all ${\displaystyle t\in T}$ the following

${\displaystyle P(X_{t}=Y_{t})=1,}$

holds. Two stochastic processes that are modifications of each other have the same law[270] and they are said to be stochastically equivalent or equivalent.[271]

Instead of modification, the term version is also used,[263][272][273][274] however some authors use the term version when two stochastic processes have the same finite-dimensional distributions, but they may be defined on different probability spaces, so two processes that are modifications of each other, are also versions of each other, in the latter sense, but not the converse.[275][209]

If a continuous-time real-valued stochastic process meets certain moment conditions on its increments, then the Kolmogorov continuity theorem says that there exists a modification of this process that has continuous sample paths with probability one, so the stochastic process has a continuous modification or version.[273][274][276] The theorem can also be generalized to random fields so the index set is ${\displaystyle n}$-dimensional Euclidean space[277] as well as to stochastic processes with metric spaces as their state spaces.[278]

### Indistinguishable

Two stochastic processes ${\displaystyle X}$ and ${\displaystyle Y}$ defined on the same probability space ${\displaystyle (\Omega ,{\mathcal {F}},P)}$ with the same index set ${\displaystyle T}$ and set space ${\displaystyle S}$ are said be indistinguishable if the following

${\displaystyle P(X_{t}=Y_{t}{\text{ for all }}t\in T)=1,}$

holds.[209][270] If two ${\displaystyle X}$ and ${\displaystyle Y}$ are modifications of each other and are almost surely continuous, then ${\displaystyle X}$ and ${\displaystyle Y}$ are indistinguishable.[279]

### Separability

Separability is a property of a stochastic process based on its index set in relation to the probability measure. The property is assumed so that functionals of stochastic processes or random fields with uncountable index sets can form random variables. For a stochastic process to be separable, in addition to other conditions, its index set must be a separable space,[h] which means that the index set has a dense countable subset.[263][280]

More precisely, a real-valued continuous-time stochastic process ${\displaystyle X}$ with a probability space ${\displaystyle (\Omega ,{\cal {F}},P)}$ is separable if its index set ${\displaystyle T}$ has a dense countable subset ${\displaystyle U\subset T}$ and there is a set ${\displaystyle \Omega _{0}\subset \Omega }$ of probability zero, so ${\displaystyle P(\Omega _{0})=0}$, such that for every open set ${\displaystyle G\subset T}$ and every closed set ${\displaystyle F\subset \textstyle R=(-\infty ,\infty )}$, the two events ${\displaystyle \{X_{t}\in F{\text{ for all }}t\in G\cap U\}}$ and ${\displaystyle \{X_{t}\in F{\text{ for all }}t\in G\}}$ differ from each other at most on a subset of ${\displaystyle \Omega _{0}}$.[281][282][283] The definition of separability[i] can also be stated for other index sets and state spaces,[286] such as in the case of random fields, where the index set as well as the state space can be ${\displaystyle n}$-dimensional Euclidean space.[30][263]

The concept of separability of a stochastic process was introduced by Joseph Doob,[280] where the underlying idea is to make a countable set of points of the index set determine the properties of the stochastic process.[284] Any stochastic process with a countable index set already meets the separability conditions, so discrete-time stochastic processes are always separable.[287] A theorem by Doob, sometimes known as Doob's separability theorem, says that any real-valued continuous-time stochastic process has a separable modification.[280][282][288] Versions of this theorem also exist for more general stochastic processes with index sets and state spaces other than the real line.[117]

### Skorokhod space

A Skorokhod space, also written as Skorohod space, is a mathematical space of all the functions that are right-continuous with left limits, defined on some interval of the real line such as ${\displaystyle [0,1]}$ or ${\displaystyle [0,\infty )}$, and take values on the real line or on some metric space.[289][290][291] Such functions are known as càdlàg or cadlag functions, based on the acronym of the French expression continue à droite, limite à gauche, due to the functions being right-continuous with left limits.[289][292] A Skorokhod function space, introduced by Anatoliy Skorokhod,[291] is often denoted with the letter ${\displaystyle D}$,[289][290][291][292] so the function space is also referred to as space ${\displaystyle D}$.[289][293][294] The notation of this function space can also include the interval on which all the càdlàg functions are defined, so, for example, ${\displaystyle D[0,1]}$ denotes the space of càdlàg functions defined on the unit interval ${\displaystyle [0,1]}$.[292][294][295]

Skorokhod function spaces are frequently used in the theory of stochastic processes because it often assumed that the sample functions of continuous-time stochastic processes belong to a Skorokhod space.[291][293] Such spaces contain continuous functions, which correspond to sample functions of the Wiener process. But the space also has functions with discontinuities, which means that the sample functions of stochastic processes with jumps, such as the Poisson process (on the real line), are also members of this space.[294][296]

### Regularity

In the context of mathematical construction of stochastic processes, the term regularity is used when discussing and assuming certain conditions for a stochastic process to resolve possible construction issues.[297][298] For example, to study stochastic processes with uncountable index sets, it is assumed that the stochastic process adheres to some type of regularity condition such as the sample functions being continuous.[299][300]

## Mathematical construction

In mathematics, constructions of mathematical objects are needed, which is also the case for stochastic processes, to prove that they exist mathematically.[57] There are two main approaches for constructing a stochastic process. One approach involves considering a measurable space of functions, defining a suitable measurable mapping from a probability space to this measurable space of functions, and then deriving the corresponding finite-dimensional distributions.[301]

Another approach involves defining a collection of random variables to have specific finite-dimensional distributions, and then using Kolmogorov's existence theorem[j] to prove a corresponding stochastic process exists.[57][301] This theorem, which is an existence theorem for measures on infinite product spaces,[305] says that if any finite-dimensional distributions satisfy two conditions, known as consistency conditions, then there exists a stochastic process with those finite-dimensional distributions.[57]

### Construction issues

When constructing continuous-time stochastic processes certain mathematical difficulties arise, due to the uncountable index sets, which do not occur with discrete-time processes.[58][59] One problem is that is it possible to have more than one stochastic process with the same finite-dimensional distributions. For example, both the left-continuous modification and the right-continuous modification of a Poisson process have the same finite-dimensional distributions.[306] This means that the distribution of the stochastic process does not, necessarily, specify uniquely the properties of the sample functions of the stochastic process.[301][307]

Another problem is that functionals of continuous-time process that rely upon an uncountable number of points of the index set may not be measurable, so the probabilities of certain events may not be well-defined.[280] For example, the supremum of a stochastic process or random field is not necessarily a well-defined random variable.[30][59] For a continuous-time stochastic process ${\displaystyle X}$, other characteristics that depend on an uncountable number of points of the index set ${\displaystyle T}$ include:[280]
• a sample function of a stochastic process ${\displaystyle X}$ is a continuous function of ${\displaystyle t\in T}$;
• a sample function of a stochastic process ${\displaystyle X}$ is a bounded function of ${\displaystyle t\in T}$; and
• a sample function of a stochastic process ${\displaystyle X}$ is an increasing function of ${\displaystyle t\in T}$.
To overcome these two difficulties, different assumptions and approaches are possible.[116]

### Resolving construction issues

One approach for avoiding mathematical construction issues of stochastic processes, proposed by Joseph Doob, is to assume that the stochastic process is separable.[308] Separability ensures that infinite-dimensional distributions determine the properties of sample functions by requiring that sample functions are essentially determined by their values on a dense countable set of points in the index set.[309] Furthermore, if a stochastic process is separable, then functionals of an uncountable number of points of the index set are measurable and their probabilities can be studied.[280][309]

Another approach is possible, originally developed by Anatoliy Skorokhod and Andrei Kolmogorov,[310] for a continuous-time stochastic process with any metric space as its state space. For the construction of such a stochastic process, it is assumed that the sample functions of the stochastic process belong to some suitable function space, which is usually the Skorokhod space consisting of all right-continuous functions with left limits. This approach is now more used than the separability assumption,[116][198] but such a stochastic process based on this approach will be automatically separable.[311]

Although less used, the separability assumption is considered more general because every stochastic process has a separable version.[198] It is also used when it is not possible to construct a stochastic process in a Skorokhod space.[285] For example, separability is assumed when constructing and studying random fields, where the collection of random variables is now indexed by sets other than the real line such as ${\displaystyle n}$-dimensional Euclidean space.[30][312]