Functional Data Analysis for Density Functions by Transformation to a Hilbert Space
Published:
Goal
This is the report from my point of view on the paper Functional Data Analysis for Density Functions by Transformation to a Hilbert Space (Petersen and Müller 2016).
Functional data that are nonnegative and have a constrained integral can be considered as samples of one-dimensional density function.
The density function is nonnegative and has an integral of 1, hence commonly used Hilbert space based methods of functional data analysis are not applicable.
To address this issue, the paper proposes a new method which maps the densities to a Hilbert space, where the map is a continuous and invertible function.
Basic operations are implemented in the Hilbert space, then apply the inverse transformation to the density space.
Introduction
The functional modeling of density functions is difficult due to the two constrains \(\int f(x)\,dx=1\) and \(f \geq 0\) which means that the densities space is not linear.
The key idea is to map probability densities into a linear function space by using a suitably chosen continuous and invertible map \(\psi\).
Preliminaries
Density modelling
Assume that data consist of a sample of \(n\) (random) density functions \(f_1, \ldots, f_n\), where the densities are supported on a common interval \([0,T]\) for some \(T>0\).
Denote the space of continuous and strictly positive densities on \([0,1]\) by \(\mathcal{G}\).
The sample consists of i.i.d. realizations of an underlying stochastic process, that is, each density is independently distributed as \(f\sim\mathfrak{F}\), where \(\mathfrak{F}\) is an \(L^2\) process on \([0,1]\) taking values in some space \(\mathcal{F}\subset\mathcal{G}\)
A basic assumption we make on the space \(\mathcal {F}\) is:
Densities \(f\) can equivalently be represented as:
- cumulative distribution functions (c.d.f.) \(F\) with domain \([0,1]\), hazard functions \(h=f/(1-F)\) (possibly on a subdomain of \([0,1]\) where \(F(x)<1\)).
- quantile functions \(Q=F^{-1}\), with support \([0,1]\).
- Occasionally of interest is the equivalent notion of the quantile-density function \(q(t)=Q'(t)=\frac{d}{dt}F^{-1}(t)= [f(Q(t)) ]^{-1}\), from which we obtain \(f(x)= [q(F(x)) ]^{-1}\).
All of these functions provide equivalent characterizations of distributions.
In many situations, the densities themselves will not be directly observed. Instead, for each \(i\), we may observe an i.i.d. sample of data \(W_{il}\), \(l = 1, \ldots, N_i\), that are generated by the random density \(f_i\).
Thus, there are two random mechanisms at work that are assumed to be independent: the first generates the sample of densities and the second generates the samples of real-valued random data; one sample for each random density in the sample of densities.
Hence, the probability space can be thought of as a product space \((\Omega_1\times\Omega_2, \mathcal{A}, P)\), where \(P = P_1 \otimes P_2\).
Metrics in the space of density functions
In previous applied and methodological work, it was found that a metric \(d_{Q}\) based on quantile functions \({d_{Q}(f,g)^2=\int_0^1 (F^{-1}(t)-G^{-1}(t))^2 \,\mathrm{d}t}\) is particularly promising from a practical point of view.
This quantile metric has connections to the optimal transport problem , and corresponds to the Wasserstein metric between two probability measures,
\[ \begin{equation}\label{eq:wass_dist} d_W(f,g)^2 = \inf_{X \sim f, Y \sim g}E(X - Y)^2, \end{equation} \]
where the expectation is with respect to the joint distribution of \((X, Y)\). The equivalence \(d_{Q}=d_{W}\) can be most easily seen by applying a covariance identity, details can be found in the supplementary material.
We will develop our methodology for a general metric, which will be denoted by \(d\) in the following, and may stand for any of the above metrics in the space of densities.
Density estimation
The densities themselves must first be estimated.
Consider the estimation of a density \(f \in\mathcal{F}\) from an i.i.d. sample (generated by \(f\)) of size \(N\) by an estimator \(\check{f}\). Here, \(N = N(n)\) will implicitly represent a sequence that depends on \(n\), the size of the sample of random densities.
For the theoretical results, a density estimator \(\check{f}\) must satisfy the following consistency properties in terms of the \(L^2\) and uniform metrics (denoted as \(d_2\) and \(d_{\infty}\), resp.):
- For a sequence \(b_N = o(1)\), the density estimator \(\check{f}\), based on an i.i.d. sample of size \(N\), satisfies \(\check{f}\geq0\), \(\int_0^1\check{f}(x) \,dx = 1\) and
\[ \begin{equation}\label{eq:d1} \sup_{f\in\mathcal {F}} E\bigl(d_2(f, \check{f})^2 \bigr) = O\bigl(b_N^2\bigr). \end{equation} \]
- For a sequence \(a_N = o(1)\) and some \(R > 0\), the density estimator \(\check{f}\), based on an i.i.d. sample of size \(N\), satisfies
\[ \begin{equation}\label{eq:d2} \sup_{f \in\mathcal {F}}P\bigl(d_{\infty}(f, \check{f}) > Ra_N\bigr) \rightarrow 0. \end{equation} \]
✋ Note
The standard kernel density estimator does not satisfy these assumptions, due to boundary effects. Much work has been devoted to rectify the boundary effects, but the resulting estimators leave the density space and have not been shown to satisfy \eqref{eq:d1} and \eqref{eq:d2}.
Therefore, we introduce here a modified density estimator of kernel type that is guaranteed to satisfy \eqref{eq:d1} and \eqref{eq:d2}.
Let $$ be a kernel that corresponds to a continuous probability density function and \(h < 1/2\) be the bandwidth. We define a new kernel density estimator to estimate the density \(f \in\mathcal {F}\) on \([0,1]\) from a sample \(W_1,\ldots,W_N \stackrel {\mathrm{i.i.d.}}{\sim} f\) by
\[ \begin{equation}\label{eq:dens_est} \check{f}(x) = \sum_{l = 1}^N \kappa \biggl(\frac{x - W_l}{h} \biggr)w(x, h) \bigg/ \sum _{l = 1}^N \int_0^1 \kappa \biggl(\frac{y - W_l}{h} \biggr)w(y, h) \,dy, \end{equation} \]
for \(x \in[0,1]\) and \(0\) elsewhere.
Here, the kernel \(\kappa\) is assumed to satisfy the following additional conditions:
(K1) The kernel \(\kappa\) is of bounded variation and is symmetric about \(0\).
(K2) The kernel \(\kappa\) satisfies \(\int_0^1 \kappa (u) \,du> 0\), and \(\int_\mathbb{R} |u|\kappa (u) \,du\), \(\int_\mathbb{R} \kappa^2(u) \,du\) and \(\int_\mathbb{R} |u|\kappa ^2(u) \,du\) are finite.
The weight function
\[ \begin{equation}\label{eq:jackknife_kernel_weight} w(x, h)= \begin{cases} \biggl(\int_{-x/h}^1 \kappa (u) \,du \biggr)^{-1}& \quad \text{for } x\in[0,h), \\ \biggl(\int_{-1}^{(1-x)/h} \kappa (u) \,du \biggr)^{-1},&\quad \text{for } x\in (1-h,1] \\ 1 & \quad \text{otherwise} \\ \end{cases} \end{equation} \]
is designed to remove boundary bias.
The following result demonstrates that this modified kernel estimator indeed satisfies conditions \eqref{eq:d1} and \eqref{eq:d2}. Furthermore, this result provides the rate in \eqref{eq:d1} for this estimator as \(b_N = N^{-1/3}\), which is known to be the optimal rate under our assumptions, where the class of densities \(\mathcal {F}\) is assumed to be continuously differentiable, and it also shows that rates \(a_N = N^{-c}\), for any \(c \in(0, 1/6)\) are possible in \eqref{eq:d1}.
Functional data analysis for density process
For a generic density function process \(f \sim\mathfrak{F}\), denote:
the mean function by \(\mu(x) = E(f(x))\),
the covariance function by \(G(x,y) = \operatorname{Cov}(f(x), f(y))\),
the orthonormal eigenfunctions and eigenvalues of the linear covariance operator \[ \begin{equation} (Af)(t)=\int G(s,t)f(s) \,\mathrm{d}s \end{equation} \]
by \(\{\phi_k\}_{k = 1}^ \infty\) and \(\{\lambda_k\}_{k = 1}^\infty\), where the latter are positive and in decreasing order.
If \(f_1, \ldots, f_n\) are i.i.d. distributed as \(f\), then by the Karhunen–Loève expansion, for each \(i\),
\[ \begin{equation}\label{eq:eq KL} f_i(x) = \mu(x) + \sum_{k = 1}^ \infty \xi_{ik} \phi_k(x), \end{equation} \]
where \(\xi_{ik} = \int_0^1 (f_i(x) - \mu(x))\phi_k(x) \,\mathrm{d}x\) are the uncorrelated principal components with zero mean and variance \(\lambda_k\). The Karhunen–Loève expansion constitutes the foundation for the commonly used FPCA technique.
The next part of this section is on the estimation of the mode of variation, i.e., the eigenfunctions. This part is the routine part of the FPCA, and I will skip it.
Transformation approach
The proposed transformation approach is to map the densities into a new space \(L^2(\mathcal {T})\) via a functional transformation \(\psi\), where \(\mathcal{T} \subset\mathbb {R}\) is a compact interval.
Then we work with the resulting \(L^2\) process \(X: = \psi(f)\)
By performing FPCA in the linear space \(L^2(\mathcal {T})\) and then mapping back to density space, this transformation approach can be viewed as an intrinsic analysis, as opposed to ordinary FPCA.
In the new space \(L^2(\mathcal{T})\):
- \(\nu\) and \(H\) denoting the mean and covariance functions, respectively, of the process \(X\).
- \(\{\rho_k\}_{k= 1}^\infty\) denoting the orthonormal eigenfunctions of the covariance
- \(\{\eta_k\}_{k= 1}^\infty\) is the corresponding eigenvalues,
The Karhunen–Loeve expansion for each of the transformed processes \(X_i = \psi(f_i)\) is
\[ \begin{equation*} X_i(t) = \nu(t) + \sum_{k = 1}^\infty \eta_{ik}\rho_k(t), \qquad t \in \mathcal {T}, \end{equation*} \]
with principal components \(\eta_{ik} = \int_\mathcal {T} (X_i(t) - \nu(t))\rho_k(t) \,dt\).
Our goal is to find suitable transformations \(\psi: \mathcal{G}\rightarrow L^2(\mathcal{T})\) from density space to a linear functional space.
We begin with two specific examples of relevant transformations. For clarity, for functions in the native density space \(\mathcal {G}\) we denote the argument by \(x\), while for functions in the transformed space \(L^2(\mathcal{T})\) the argument is \(t\).
The log hazard transformation
Since hazard functions diverge at the right endpoint of the distribution, which is 1, we consider quotient spaces induced by identifying densities which are equal on a subdomain \(\mathcal {T} = [0, {1_\delta}]\), where \({1_\delta}= 1 -\delta\) for some \(0 <\delta< 1\). With a slight abuse of notation, we denote this quotient space as \(\mathcal {G}\) as well.
The log hazard transformation \(\psi_H: \mathcal{G} \rightarrow L^2(\mathcal {T})\) is \[ \begin{equation*} \psi_H(f) (t)= \log\bigl(h(t)\bigr) = \log \biggl\{ \frac{f(t)}{1-F(t)} \biggr\}, \qquad t \in\mathcal {T}. \end{equation*} \]
Since the hazard function is positive but otherwise not constrained on \(\mathcal {T}\), it is easy to see that \(\psi\) indeed maps density functions to \(L^2(\mathcal {T})\). The inverse map can be defined for any continuous function \(X\) as
\[ \begin{equation*} \psi_H^{-1}(X) (x) = \exp \biggl\{X(x) - \int_0^x e^{X(s)} \,ds \biggr\} ,\qquad x \in[0, {1_\delta}]. \end{equation*} \]
Solving the differential equation \[ \begin{equation*} h(t)=\frac{f(t)}{1-F(t)}=\frac{F'(t)}{1-F(t)} \end{equation*} \] for \(F(t)\), it can be shown that \[ \begin{equation*} F(t) = 1 - \exp{\left(-\int_0^t h(t) dt \right)} \end{equation*} \]
Note that for this case one has a strict inverse only modulo the quotient space. However, in order to use metrics such as \(d_W\), we must choose a representative. A~straightforward way to do this is to assign the remaining mass uniformly, that is,
\[ \begin{equation*} \psi_H^{-1}(X) (x) = \delta^{-1}\exp{\biggl\{ - \int_{0}^{1_{\delta}} e^{X(s)} \,ds \biggr\}},\qquad x \in(1_{\delta}, 1]. \end{equation*} \]
The log quantile density transformation
For \(\mathcal {T} = [0,1]\), the log quantile density (LQD) transformation \(\psi_Q: \mathcal {G}\rightarrow L^2(\mathcal{T})\) is given by
\[ \begin{equation*} \psi_Q(f) (t) = \log\bigl(q(t)\bigr) = -\log\bigl\{f\bigl(Q(t) \bigr)\bigr\}, \qquad t \in\mathcal {T}. \end{equation*} \]
It is then natural to define the inverse of a continuous function \(X\) on \(\mathcal {T}\) as the density given by \(\exp\{-X(F(x))\}\), where \(Q(t)=F^{-1}(t) = \int_0^t e^{X(s)} \,ds\). Since the value \(F^{-1}(1)\) is not fixed, the support of the densities is not fixed within the transformed space, and as the inverse transformation should map back into the space of densities with support on \([0, 1]\), we make a slight adjustment when defining the inverse by
\[ \begin{equation*} \psi_Q^{-1}(X) (x) = \theta_X\exp\bigl\{-X \bigl(F(x)\bigr)\bigr\},\qquad F^{-1}(t) = \theta _X^{-1} \int_0^t e^{X(s)} \,ds, \end{equation*} \]
where \(\theta_X = \int_0^1 e^{X(s)} \,ds\). Since \(F^{-1}(1) = 1\) whenever \(X\in\psi_Q (\mathcal {G} )\), this definition coincides with the natural definition mentioned above on \(\psi_Q (\mathcal{G} )\).
In the transformation approach we construct modes of variation in the transformed space for processes \(X = \psi(f)\) and then map these back into the density space, defining transformation modes of variation
\[ \begin{equation}\label{eq:tmodes} g_k(x, \alpha, \psi) = \psi^{-1} (\nu+\alpha\sqrt{\tau _k}\rho _k ) (x). \end{equation} \]
Estimation of these modes is done by first estimating the mean function \(\nu\) and covariance function \(H\) of the process \(X\). Letting \(\widehat {X}_i =\psi(\check{f}_i)\), the empirical estimators are
\[ \begin{eqnarray} \label{eq:nu_est} \tilde{\nu}(t) &= &\frac{1}{n}\sum_{i=1}^nX_i(t)\quad \mbox{respectively}\quad \hat{\nu}(t) = \frac{1}{n}\sum _{i=1}^n\widehat {X}_i(t); \\\label{eq:H_est} \widetilde {H}(s,t) &=& \frac{1}{n}\sum_{i=1}^nX_i(s)X_i(t) - \tilde {\nu}(s)\tilde{\nu}(t)\quad \mbox {respectively} \nonumber \\[-8pt] \\[-8pt] \nonumber \widehat {H}(s,t) &=& \frac{1}{n}\sum_{i=1}^n \widehat {X}_i(s)\widehat {X}_i(t) - \hat{\nu}(s)\hat{ \nu} (t). \end{eqnarray} \]
Estimated eigenvalues and eigenfunctions (\(\tilde{\tau}_k\) and \(\tilde{\rho}_k\), resp., \(\hat{\tau}_k\) and \(\hat{\rho}_k\)) are then obtained from the mean and covariance estimates as before, yielding the transformation mode of variation estimators \[ \begin{eqnarray}\label{eq:tmodes_est} \tilde{g}_k(x, \alpha, \psi) &=& \psi^{-1}(\tilde{\nu}+ \alpha \sqrt {\tilde{\tau}_k}\tilde{\rho} _k) (x)\quad \mbox{respectively} \nonumber \\[-8pt] \\[-8pt] \nonumber \hat{g}_k(x, \alpha, \psi) &= &\psi^{-1}(\hat{\nu}+ \alpha \sqrt {\hat{\tau} _k}\hat{\rho} _k) (x). \nonumber \end{eqnarray} \]
In contrast to the modes of variation resulting from ordinary FPCA in Section 3, the transformation modes are bona fide density functions for any value of \(\alpha\). Thus, for reasonably chosen transformations, the transformation modes can be expected to provide a more interpretable description of the variability contained in the sample of densities. Indeed, the data application in Section 6.2 shows that this is the case, using the log quantile density transformation as an example.
The truncated representations of the original densities in the sample are then given by
\[ \begin{equation}\label{eq:fnt_trnc1} f_i(x, K, \psi) = \psi^{-1} \Biggl(\nu+ \sum_{k = 1}^K \eta _{ik}\rho _k \Biggr) (x). \end{equation} \]
Utilizing (\ref{eq:nu_est}) and the ensuing estimates of the eigenfunctions, the (transformation) principal components, for the case of fully observed densities, are obtained in a straightforward manner,
\[ \begin{equation} \label{eq:pc_est} \tilde{\eta}_{ik}= \int_\mathcal {T} \bigl(X_i(t) - \tilde{\nu }(t)\bigr)\tilde{\rho}_k(t) \,dt, \end{equation} \]
whence \[ \begin{equation*} \tilde{f}_i(x, K, \psi) = \psi^{-1} \Biggl(\tilde{\nu}+ \sum_{k =1}^K \tilde{\eta}_{ik}\tilde{\rho} _k \Biggr) (x). \end{equation*} \]
In practice, the truncation point \(K\) can be selected by choosing a cutoff for the fraction of variance explained. This raises the question of how to quantify \emph{total variance}. For the chosen metric \(d\), we propose to use the Frechet variance
\[ \begin{equation}\label{eq:tot_var} V_\infty:= E\bigl(d(f, f_\oplus )^2\bigr), \end{equation} \] where $f_\oplus $ is the Frechet mean of the densities in the sample. which is estimated by its empirical version
\[ \begin{equation}\label{eq:tot_var_est} \tilde{V}_\infty = \frac{1}{n}\sum_{i=1}^nd(f_i, \tilde{f}_\oplus)^2, \end{equation} \]
using an estimator $\tilde{f}_\oplus $ of the Frechet mean. Truncating at \(K\) included components as in (\ref{eq:fnt_trnc1}) and denoting the truncated versions as \(f_{i,K}\), the variance explained by the first \(K\) components is
\[ \begin{equation}\label{eq:var_ex} V_K:= V_\infty - E\bigl(d(f_1, f_{1, K})^2\bigr), \end{equation} \]
which is estimated by
\[ \begin{equation}\label{eq:var_ex_est} \tilde{V}_K= \tilde{V}_\infty - \frac{1}{n}\sum_{i=1}^nd(f_i, \tilde{f}_{i, K})^2. \end{equation} \]
The ratio $V_K/V_\infty $ is called the fraction of variance explained (FVE), and is estimated by \(\tilde{V}_K/\tilde{V}_\infty\). If the truncation level is chosen so that a fraction \(p\), \(0 < p < 1\), of total variation is to be explained, the optimal choice of \(K\) is
\[ \begin{equation} \label{eq:choice_K} K^\ast= \min \biggl\{K : \frac{V_K}{V_\infty} > p \biggr \}, \end{equation} \]
which is estimated by
\[ \begin{equation}\label{eq:choice_K_est} \tilde{K}^\ast= \min \biggl\{K : \frac{\tilde{V}_K}{\tilde{V}_\infty} > p\biggr\}. \end{equation} \]
As will be demonstrated in the data illustrations, this more general notion of variance explained is a useful concept when dealing with densities or other functions that are not in a Hilbert space. Specifically, we will show that density representations in \eqref{eq:fnt_trnc1}, obtained via transformation, yield higher FVE values than the ordinary representations, thus giving more efficient representations of the sample of densities.
❗ Important
The R code for the transformation approach (only log density transformation) is available in the package fdadensity. There is no R code for the log hazard transformation.