Distributional data analysis is a branch of nonparametric statistics that is related to functional data analysis. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that although the space of probability distributions is a convex space, it is not a vector space.
Let be a space of distributions and let be a metric on so that forms a metric space. There are various metrics available for .[1]
For example, suppose , and let and be the density functions of and , respectively. The Fisher-Rao metric is defined as
For univariate distributions, let and be the quantile functions of and . Denote the -Wasserstein space as , which is the space of distributions with finite -th moments. Then, for , the -Wasserstein metric is defined as
For a probability measure , consider a random process such that . One way to define mean and variance of is to introduce the Fréchet mean and the Fréchet variance. With respect to the metric on , the Fréchet mean, also known as the barycenter, and the Fréchet variance are defined as[2]
A widely used example is the Wasserstein-Fréchet mean, or simply the Wasserstein mean, which is the Fréchet mean with the -Wasserstein metric .[3] For , let be the quantile functions of and , respectively. The Wasserstein mean and Wasserstein variance is defined as
Functional principal component analysis (FPCA) can be directly applied to the probability density functions.[4] Consider a distribution process and let be the density function of . Let the mean density function as and the covariance function as with orthonormal eigenfunctions and eigenvalues .
By the Karhunen-Loève theorem, , where principal components . The th mode of variation is defined as
with some constant , such as 2 or 3.
Assume the probability density functions exist, and let be the space of density functions.
Transformation approaches introduce a continuous and invertible transformation , where is a Hilbert space of functions. For instance, the log quantile density transformation or the centered log ratio transformation are popular choices.[5][6]
For , let , the transformed functional variable. The mean function and the covariance function are defined accordingly, and let be the eigenpairs of . The Karhunen-Loève decomposition gives
, where . Then, the th transformation mode of variation is defined as[7]
Endowed with metrics such as the Wasserstein metric or the Fisher-Rao metric , we can employ the (pseudo) Riemannian structure of . Denote the tangent space at the Fréchet mean as , and define the logarithm and exponential maps and .
Let be the projected density onto the tangent space, .
In Log FPCA, FPCA is performed to and then projected back to using the exponential map.[8] Therefore, with , the th Log FPCA mode of variation is defined as
As a special case, consider -Wasserstein space , a random distribution , and a subset . Let and . Let be the metric space of nonempty, closed subsets of , endowed with Hausdorff distance, and define
Let the reference measure be the Wasserstein mean .
Then, a principal geodesic subspace (PGS) of dimension with respect to is a set .[9][10]
Note that the tangent space is a subspace of , the Hilbert space of -square-integrable functions. Obtaining the PGS is equivalent to performing PCA in under constraints to lie in the convex and closed subset.[10] Therefore, a simple approximation of the Wasserstein Geodesic PCA is the Log FPCA by relaxing the geodesicity constraint, while alternative techniques are suggested.[9][10]
Fréchet regression is a generalization of regression with responses taking values in a metric space and Euclidean predictors.[11][12] Using the Wasserstein metric , Fréchet regression models can be applied to distributional objects. The global Wasserstein-Fréchet regression model is defined as
(1)
which generalizes the standard linear regression.
For the local Wasserstein-Fréchet regression, consider a scalar predictor and introduce a smoothing kernel . The local Fréchet regression model, which generalizes the local linear regression model, is defined as
where , and .
Consider the response variable to be probability distributions. With the space of density functions and a Hilbert space of functions , consider continuous and invertible transformations . Examples of transformations include log hazard transformation, log quantile density transformation, or centered log-ratio transformation. Linear methods such as functional linear models are applied to the transformed variables. The fitted models are interpreted back in the original density space using the inverse transformation.[12]
In Wasserstein regression, both predictors and responses can be distributional objects. Let and be the Wasserstein mean of and , respectively. The Wasserstein regression model is defined as
with a linear regression operator
Estimation of the regression operator is based on empirical estimators obtained from samples.[13]
Also, the Fisher-Rao metric can be used in a similar fashion.[12][14]
Wasserstein -test has been proposed to test for the effects of the predictors in the Fréchet regression framework with the Wasserstein metric.[15] Consider Euclidean predictors and distributional responses . Denote the Wasserstein mean of as , and the sample Wasserstein mean as . Consider the global Wasserstein-Fréchet regression model defined in (1), which is the conditional Wasserstein mean given . The estimator of , is obtained by minimizing the empirical version of the criterion.
Let , , ,
, , ,
, ,
and denote the cumulative distribution, quantile, and density functions of , , and , respectively. For a pair , define be the optimal transport map from to .
Also, define , the optimal transport map from to . Finally, define the covariance kernel and by the Mercer decomposition, .
If there are no regression effects, the conditional Wasserstein mean would equal the Wasserstein mean. That is, hypotheses for the test of no effects are
To test for these hypotheses, the proposed global Wasserstein -statistic and its asymptotic distribution are
where .[15] An extension to hypothesis testing for partial regression effects, and alternative testing approximations using the
Satterthwaite's approximation or a bootstrap approach are proposed.[15]
The Hilbert sphere is defined as , where is a separable infinite-dimensional Hilbert space with inner product and norm . Consider the space of square root densities . Then with the Fisher-Rao metric on , is the positive orthant of the Hilbert sphere with .
Let a chart as a smooth homeomorphism that maps onto an open subset of a separable Hilbert space for coordinates. For example, can be the logarithm map.[14]
Consider a random element equipped with the Fisher-Rao metric, and write its Fréchet mean as . Let the empirical estimator of using samples as . Then central limit theorem for and holds: ,
where is a Gaussian random element in with mean 0 and covariance operator . Let the eigenvalue-eigenfunction pairs of and the estimated covariance operator as and , respectively.
Consider one-sample hypothesis testing
with . Denote and as the norm and inner product in . The test statistics and their limiting distributions are
where . The actual testing procedure can be done by employing the limiting distributions with Monte Carlo simulations, or bootstrap tests are possible. An extension to the two-sample test and paired test are also proposed.[14]
Autoregressive (AR) models for distributional time series are constructed by defining stationarity and utilizing the notion of difference between distributions using and .
In Wasserstein autoregressive model (WAR), consider a stationary density time series with Wasserstein mean .[16] Denote the difference between and using the logarithm map, , where is the optimal transport from to in which and are the cdf of and . An model on the tangent space is defined as for with the autoregressive parameter and mean zero random i.i.d. innovations . Under proper conditions, with densities and . Accordingly, , with a natural extension to order , is defined as
On the other hand, the spherical autoregressive model (SAR) considers the Fisher-Rao metric.[17] Following the settings of ##Tests for the intrinsic mean, let with Fréchet mean . Let , which is the geodesic distance between and . Define a rotation operator that rotates to . The spherical difference between and is represented as . Assume that is a stationary sequence with the Fréchet mean , then is defined as
where and mean zero random i.i.d innovations . An alternative model, the differenced based spherical autoregressive (DSAR) model is defined with , with natural extensions to order . A similar extension to the Wasserstein space was introduced.[18]
^Kneip, A.; Utikal, K.J. (2001). "Inference for density families using functional principal component analysis". Journal of the American Statistical Association. 96 (454): 519–532. doi:10.1198/016214501753168235. S2CID123524014.
^van den Boogaart, K.G.; Egozcue, J.J.; Pawlowsky-Glahn, V. (2014). "Bayes Hilbert spaces". Australian and New Zealand Journal of Statistics. 56 (2): 171–194. doi:10.1111/anzs.12074. S2CID120612578.
^Fletcher, T.F.; Lu, C.; Pizer, S.M.; Joshi, S. (2004). "Principal geodesic analysis for the study of nonlinear statistics of shape". IEEE Transactions on Medical Imaging. 23 (8): 995–1005. doi:10.1109/TMI.2004.831793. PMID15338733. S2CID620015.
^ abcCazelles, E.; Seguy, V.; Bigot, J.; Cuturi, M.; Papadakis, N. (2018). "Geodesic PCA versus Log-PCA of histograms in the Wasserstein space". SIAM Journal on Scientific Computing. 40 (2): B429–B456. Bibcode:2018SJSC...40B.429C. doi:10.1137/17M1143459.
^ abcPetersen, A.; Zhang, C.; Kokoszka, P. (2022). "Modeling probability density functions as data objects". Econometrics and Statistics. 21: 159–178. doi:10.1016/j.ecosta.2021.04.004. S2CID236589040.
^ abcPetersen, A.; Liu, X.; Divani, A.A. (2021). "Wasserstein F-tests and confidence bands for the Fréchet regression of density response curves". Annals of Statistics. 49 (1): 590–611. arXiv:1910.13418. doi:10.1214/20-AOS1971. S2CID204950494.
^Zhang, C.; Kokoszka, P.; Petersen, A. (2022). "Wasserstein autoregressive models for density time series". Journal of Time Series Analysis. 43 (1): 30–52. arXiv:2006.12640. doi:10.1111/jtsa.12590. S2CID219980621.