Wasserstein F-tests and confidence bands for the Frechet regression of density response curves
Published:
Abstract
This is the report from my point of view on the paper Wasserstein \(F\)-Tests and Confidence Bands for the Frechet Regression of Density Response Curves (Petersen, Liu, and Divani 2021).
In many applications, density curves appear as functional response objects in a regression model with vector predictors. It is important to understand the so-called “density-predictor” relationship. In this case, the conditional mean is defined as the conditional Fréchet means under a suitable metric.
Petersen, Liu, and Divani (2021) considered the Fréchet regression of density curve responses and developed tests for global and partial effects, as well as the simultaneous confidence bands for estimated conditional mean densities.
Introduction
In this paper, Petersen, Liu, and Divani (2021) study a regression model with density functions as response variables under the Wasserstein geometry, with predictors being Euclidean vectors. They assume a global regression model that does not require any smoothing or other tuning parameter to fit.
Preliminaries
We begin with a brief definition of the Wasserstein metric in the language of optimal transport.
Let \(\mathcal{D}\) be the class of univariate probability density functions \(f\) that satisfy \(\int_{\mathbb{R}} u^2 f(u) \mathrm{d} u < \infty\), i.e. absolutely continuous distributions on with a finite second moment.
For \(f\), \(g \in \mathcal{D}\), consider the collection of maps \(\mathcal{M}_{f,g}\) such that, if \(U \sim f\) and \(M \in \mathcal{M}_{f,g}\), then \(M(U) \sim g\). The squared Wasserstein distance between these two distributions is \[ \begin{equation*} d_{\mathrm{W}}^{2}(f, g) \mathrel{\vcenter{:}}= \inf_{M \in \mathcal{M}_{f,g}} \int_{\mathbb{R}} \left(M(u) - u\right)^2 f(u) \mathrm{d} u. \end{equation*} \]
It is well known that the infimum above is attained by the optimal transport map \(M_{f,g}^{\textrm{opt}} = G^{-1} \circ F\), where \(F\) and \(G\) are the cumulative distribution functions of \(f\) and \(g\), respectively, leading to the closed forms \[ \begin{equation}\label{eq: wass2} d_{\mathrm{W}}^{2}(f, g) = \int_\mathbb{R} \left(M_{f,g}^{\textrm{opt}}(u) - u\right)^2 f(u) \mathrm{d} u = \int_0^1 \left(F^{-1}(t) - G^{-1}(t)\right)^2\mathrm{d} t, \end{equation} \] where the last equality follows by the change of variables \(t = F(u).\)
✋ Note
In this circumstance, the Wasserstein distance can be calculated using the quantile functions of the two distributions.
Random densities
A random density \(\mathfrak{F}\) is a random variable taking its values almost surely in \(\mathcal{D}\).
It will also be useful to refer to the corresponding random cdf \(F\), quantile function \(Q = F^{-1}\) and quantile density \(q = Q^{\prime} = 1/(\mathfrak{F} \circ Q)\) (Parzen 1979).
For clarity:
- \(u,v \in \mathbb{R}\) will consistently be used as density and cdf arguments throughout, whereas
- \(s, t \in[0,1]\) will be used as arguments for the quantile and quantile density functions.
✋ Note
The Wasserstein mean is the minimizer of the expected squared Wasserstein distance between the random density and a fixed density in \(\mathcal{D}\). Apparently, \(f_{\oplus}^{*}\) is still a density function while the \(\mathrm{Var}_\oplus(\mathfrak{F})\) is a scalar.
In the regression setting, we model the distribution of \(\mathfrak{F}\) conditional on a vector \(X \in \mathbb{R}^p\) of predictors, where the pair \((X, \mathfrak{F})\) is distributed according to a probability measure \(\mathcal{G}\) on the product space \(\mathbb{R}^p \times \mathcal{D}.\)
💡 Tip
The density of random \(\mathfrak{F}\) is kind of “density of densities”, the domain is \(\mathcal{D}\).
In this sense, the objects in \eqref{eq: wMeanVar} are the marginal Fr'echet mean and variance of \(\mathfrak{F}.\) Let \(\mathfrak{S}_X\) denote the support of the marginal distribution of \(X.\) Our interest is in the Fr'echet regression function, or function of conditional Fréchet means, \[ \begin{equation}\label{eq:condWMean} f_{\oplus}(x) := \mathop{\mathrm{arg\,min\,}}_{f \in \mathcal{D}} E\left[d_{\mathrm{W}}^{2}(\mathfrak{F}, f) | X = x\right], \quad x \in \mathfrak{S}_X. \end{equation} \] Next, we define the conditional Fréchet variance:
Let
- \(F_{\oplus}(x)\) be the cdf of \(f_{\oplus}(x).\)
- \(Q_{\oplus}(x)\) be quantile of \(f_{\oplus}(x).\)
- \(q_{\oplus}(x)\) be quantile density functions of \(f_{\oplus}(x).\)
We will use the notation \(f_{\oplus}(x,u)\) to denote the value of the conditional mean density \(f_{\oplus}(x)\) at argument \(u \in \mathbb{R},\) and similary for \(F_{\oplus}(x,u),\) \(Q_{\oplus}(x,t)\), and \(q_{\oplus}(x,t),\) \(t \in [0,1].\)
💡 Tip
To help to understand, let \(z^{*}\sim f_{\oplus}(x)\), then \(f_{\oplus}(x,u) = f_{\oplus}(x)|_{z^{*}=u}\).
For a pair \((X,\mathfrak{F})\), define \(T = Q \circ F_{\oplus}(X)\) to be the optimal transport map from the conditional mean \(f_{\oplus}(X)\) to the random density \(\mathfrak{F}.\) By \eqref{eq: wass2}, it must be that \(E(Q(t)|X=x) = Q_{\oplus}(x,t),\) so that \(E(T(u) |X = x) = u\) for \(u\) such that \(f_{\oplus}(x,u) > 0.\) Then the conditional Fréchet variance is \[ \begin{equation}\label{eq: condWVar} \begin{split} \mathrm{Var}_{\oplus}(\mathfrak{F}|X= x) & = E\left[d_{\mathrm{W}}^{2}(\mathfrak{F}, f_{\oplus}(x))|X = x\right] = \int_\mathbb{R} E\left[(T(u) - u)^2|X=x\right]f_{\oplus}(x,u)\mathrm{d} u \\ & = \int_\mathbb{R} \mathrm{Var}(T(u)|X=x) f_{\oplus}(x,u) \mathrm{d} u. \end{split} \end{equation} \]
In these developments, we have assumed that the marginal and conditional Wasserstein mean densities exist and are unique. However, this is not automatic and some conditions are needed. The assumptions (A1)–(A3) in Petersen, Liu, and Divani (2021) are sufficient for this purpose, we will skip this part and discuss the regression directly.
Global Wasserstein–Fréchet regression
In order to tests for no or partial effects of the covariates \(X\) and confidence bands for the conditional Wasserstein means, we consider a particular global regression model for the conditional Wassersteins \(f_{\oplus}(x)\) defined in \eqref{eq:condWMean}.
This model, proposed by Petersen and Müller (2019), is termed Fréchet regression, and takes the form of a weighted Fréchet mean \[ \begin{equation}\label{eq:Wmodel} f_{\oplus}(x) = \mathop{\mathrm{arg\,min\,}}_{f \in \mathcal{D}} E\left[s(X, x)d_{\mathrm{W}}^{2}(\mathfrak{F}, f)\right], \end{equation} \] where the weight function is \[ \begin{equation*} s(X, x) = 1 + (X - \mu)^\top \Sigma^{-1}(x - \mu), \quad \mu = E(X), \, \Sigma = \mathrm{Var}(X), \end{equation*} \] and \(\Sigma\) is assumed to be positive definite. In fact, this model generalises linear regression to the case of density response by substituting \(\mathfrak{F}\) for \(Y\) and \((\mathcal{D},d_W)\) in place of the usual metric space \((\mathbb{R}, |\cdot|)\) implicitly used in multiple linear regression.
Thus far, the regression model \eqref{eq:Wmodel} provides us with a formula for the conditional Wasserstein mean of \(\mathfrak{F},\) whereas one also needs information on the conditional variance in order to conduct inference. In this section, we skip residual transport and the assumption on the covariance.
Estimation
In order to estimate the regression function \(f_{\oplus}(x),\) we utilize an empirical version of the least-squares Wasserstein criterion in \eqref{eq:Wmodel}.
- First, set \(\bar{X} = n^{-1}\sum_{i = 1}^{n} X_i,\) \(\hat{\Sigma} = n^{-1}\sum_{i = 1}^{n} (X_i - \bar{X})(X_i - \bar{X})^\top,\) and compute empirical weights \(s_{in}(x) = 1 + (X_i - \bar{X})^\top \hat{\Sigma}^{-1}(x - \bar{X})\),
- Let \(\mathfrak{Q}\) be the set of quantile functions in \(L^2[0,1].\) With \({\left\lVert \cdot \right\rVert}_{L^{2}}\) denoting the standard Hilbert norm on \(L^{2}[0,1]\), an estimator of \(Q_{\oplus}(x)\) is \[ \begin{equation}\label{eq:Qfit} \hat{Q}_{\oplus}(x) = \mathop{\mathrm{arg\,min\,}}_{Q \in \mathfrak{Q}} \sum_{i = 1}^{n} s_{in}(x) {\left\lVert Q - Q_i \right\rVert}_{L^{2}}^{2}. \end{equation} \] Implementation of this estimator is given in Algorithm 1 in the Appendix. In finite samples, \(\hat{Q}_{\oplus}(x)\) will not necessarily admit a density.
- However, Lemma 2 in the Appendix guarantees that \(\hat{Q}_{\oplus}(x)\) admits a density \(\hat{f}_{\oplus}(x)\in \mathcal{D}\) for large samples with high probability. When this holds, the estimate \[ \begin{equation}\label{eq:fullEst} \hat{f}_{\oplus}(x) = \mathop{\mathrm{arg\,min\,}}_{f \in \mathcal{D}} \frac{1}{n}\sum_{i = 1}^{n} s_{in}(x)d_{\mathrm{W}}^{2}(\mathfrak{F}_i, f). \end{equation} \] is well-defined, and \(\hat{f}_{\oplus}(x)\) is the density corresponding to the quantile estimate \(\hat{Q}_{\oplus}(x)\) above. It can be computed in practice using Algorithm 2 in the Appendix.
To consider hypothesis tests of partial effects, just do the similar things above in terms of part of covariates \(X\).
Hypothese testing
Given that we are considering the more specific case of density-valued response variables under the Wasserstein geometry, we are able to expand on these results in order to develop asymptotically justified tests for both global and partial null hypotheses, where predictors can be of any type.
Test of no effects
We begin with the global null hypothesis of no effects, \(\mathcal{H}_0^G: f_{\oplus}(x) \equiv f_{\oplus}^{*}\). Given the Wasserstein variance decomposition in (2.7) of Petersen, Liu, and Divani (2021), under \(\mathcal{H}_0^G\) we have \(\mathrm{Var}_{\oplus}(f_{\oplus}(X)) = E\left(d_{\mathrm{W}}^{2}(f_{\oplus}(X), f_{\oplus}^{*})\right) = 0.\) This motivates \[ \begin{equation}\label{eq:globalF} F^{*}_G = \sum_{i = 1}^n d_{\mathrm{W}}^{2}(\hat{\mathfrak{F}}_{i}, \hat{f}_{\oplus}^{*}) \end{equation} \] as a test statistic, where \(\hat{\mathfrak{F}}_{i} = \hat{f}_{\oplus}(X_i)\) are the fitted densities and \(\hat{f}_{\oplus}^{*} = \hat{f}_{\oplus}(\bar{X})\) is the sample Wasserstein mean. This can be viewed as a generalization of the numerator of the global \(F\)-test in multiple linear regression, and we thus refer to \(F^{*}_G\) in \eqref{eq:globalF} as the (global) Wasserstein \(F\)-statistic.
In order to establish the asymptotic null distribution of \(F^{*}_G,\) we require the assumptions (T1)–(T4) (skipped here).
The limiting distribution obtained in Theorem 1 depends on unknown parameters, namely the eigenvalues \(\lambda_j\), that must be approximated to formulate a rejection region.
A simple calculation reveals that \(C_Q(s,t) = K(Q_{\oplus}^{*}(s),Q_{\oplus}^{*}(t)),\) so that \(\lambda_j\) are also the eigenvalues of \(C_Q,\) which can be estimated by \[ \begin{equation}\label{eq: CqEst} \hat{C}*Q(s, t) = \frac{1}{n} \sum*{i = 1}^{n} (Q_i(s) - \hat{Q}_{i}(s))(Q_i(t) - \hat{Q}_{i}(t)), \end{equation} \] where \(\hat{Q}_{i}\) are the fitted quantile functions corresponding to densities \(\hat{\mathfrak{F}}_{i}\). Let \(\hat{\lambda}_j\) be the corresponding eigenvalues of \(\hat{C}_Q.\)
One can consistently approximate the conditional null distribution of \(F^{*}_G\) as follows.
- For \(\alpha \in (0,1)\) and eigenvalue estimates \(\hat{\lambda}_j\), let \(\hat{b}_\alpha^G\) be the \(1-\alpha\) quantile of \(\sum_{j = 1}^{\infty} \hat{\lambda}_j\omega_j\), where \(\omega_j\) are as in Theorem 1.
- Computation of this critical value is outlined in Algorithm 3 of the Appendix.
- The following result on the conditional power \(\beta_n^G = P_{\mathbb{X}}(F_{G}^{*} > \hat{b}_\alpha^G)\) follows from Theorem 1.
- Let \(\mathfrak{G}\) denote the collection of distributions on \(\mathbb{R}^p \times \mathcal{D}\) such that model \eqref{eq:Wmodel} holds.
The paper also introduces the Test of partial effects and Alternative testing approximation, we will skip these parts and move to the confidence bands.
Confidence bands
We now develop methodology for producing a confidence set for \(f_{\oplus}(x),\) where \(x\) is considered to be fixed.
Let \(g\) be a generic functional parameter of interest, where we assume that \(g\) is bounded. Given an estimator \(\hat{g},\) a typical approach to formulating a simultaneous confidence band is to show that \(b_n(\hat{g}(u) - g(u))/a(u)\) converges weakly to a limiting process (usually Gaussian) in the space of bounded functions under the uniform metric, where \(a>0\) is a scaling function and \(b_n^{-1}\) is the rate of convergence. By an application of the continuous mapping theorem, one can then obtain a confidence band for \(g\) of the form \[ \begin{equation*} \{g^{*}: \hat{g}(u) - ca(u)b_n^{-1} \leq g^{*}(u) \leq \hat{g}(u) + ca(u)b_n^{-1}\,\, \textrm{for all }u\}. \end{equation*} \] This band corresponds to all functions that are almost everywhere between the lower and upper bounds, and is the closest one can get to a confidence interval in function space.
Petersen, Liu, and Divani (2021) explore two different approaches to formulating simultaneous confidence bands. The first arises naturally from the Wasserstein geometry, and provides a distributional band for either \(Q_{\oplus}(x)\) or \(F_{\oplus}(x)\), but not the density parameter. In the second approach, they strengthen the convergence results and utilize the delta method to construct a simultaneous confidence band for \(f_{\oplus}(x)\). For more details, please refer to Section 4 of the paper (Petersen, Liu, and Divani 2021). The Algorithm of the confidence band is given in the Appendix, see Algorithm 7.
Furthermore, the authors also provide the R
package WRI
. The package is designed to provide a comprehensive set of tools for the analysis of density-valued response data under the Wasserstein geometry. The package is available on CRAN and can be installed using the command install.packages("WRI")
.