\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{DP mixture models in Survival Analysis}
\author[1]{William Thomson}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
\textbf{Dirichlet Process Mixture models}
Dirichlet Process Mixture models (DPMs) arise as nonparametric
estimators of probability density functions. In general, we choose some
parametric (preferably easy to work with) distributions and take an
infinite mixture of them. To wit, given some arbitrary probability
density function~\(f(x)\), we approximate it as a mixture of
parametric pdfs, \(\{\varphi_k(x)\}\)~indexed by~\(k\) as
\(f(t) \approx \sum_{k = 1}^{\infty} w_k \varphi_k(t)\).
The weights~\(w_k\) must sum to 1 to ensure that the
approximation is a valid pdf.
Now, the infinite-dimensional weight vector is given a Dirichlet Process
prior distribution, which is a distribution over the
infinite-dimensional unit simplex, samples from it being discrete
probability mass functions.
The stick-breaking formulation of the DP can be stated as follows:
If~\(G \sim DP(\alpha H)\), then we can write~\(G = \sum_{i = 1}^{\infty} \pi_i \delta_{\theta_i}\),
where~\(\theta_i\) is drawn from the base
measure~\(H\) and the
weights~\(\pi_i\)~represent~\(\beta (1,\alpha)\)-distributed
portions cleaved from a stick of unit length, where at each step only
what's left of the stick is left to be apportioned. Dirichlet process
mixtures centre some parametric density at the
locations~\(\theta_i\) and attach to them the
weights~\(\pi_i\).
The incorporation of covariates can be accomplished through introducing
some dependency, dependent on~\(\mathbf x\), between the
distributions for each patient (McEachern, 2000; de Iorio, 2004, 2009).
de Iorio imposes an ANOVA model on the locations of the mixture
components. Alternatively, we might look to find a representation of the
hazard function and incorporate covariate effects in standard ways. The
hazard is given by~\(\tilde h (t) = \frac{\sum_{k = 1}^{\infty} w_k \phi_k(t)}{1 - \sum_{k = 1}^{\infty} w_k \Phi_k(t)}\).
If~\(\phi_k\) is the Erlang pdf with~\(k\) phases
and shared rate~\(\lambda\), we get the following hazard
function
\(\tilde h (t) = \frac{\sum_{n = 0}^{\infty} w_{n+1} \lambda^{n+1} (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k) t^n} \)
~~~~~~~~~~~\( = \frac{\lambda \sum_{n = 0}^{\infty} w_{n+1} \lambda^{n} (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k) t^n}\)~~~~~~~~~~~~~~~~~~~~~~~~
Covariate effects can be incorporated through the~\(w_n\),
the~\(\lambda\), or both. The weights will typically be decided
by some nonparametric stick-breaking method, for example a logistically
transformed GP for each weight. The~\(\lambda\) dependence can
also be decided by a log-GP (since \(\lambda\) must be
positive).
This gives us the general hazard model
\(\tilde{h}(t; \mathbf x)= \frac{\lambda(\mathbf x) \sum_{n = 0}^{\infty} w_{n+1}(\mathbf x) \lambda^{n}(\mathbf x) (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n(\mathbf x) (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k(\mathbf x)) t^n}\).
We want to know under what circumstances (conditions on
the~\(\mathbf x\) dependency) we can write this as a product of a
function of~\(\mathbf x\)~alone and a function
of~\(t\) alone; this would lead to a proportional hazards
model.
We can recover a very simple proportional hazards model by defining the
weights recursively. We choose~\(w_1(\mathbf x)\) by some means, and
then define the rest of the weights such that~\((1 - \sum_{k = 1}^{n}w_k) = (1 - w_1)^n\). This
leads to a hazard rate of
\(\tilde{h}(t) = \lambda w_1(\mathbf x)\), i.e. there is some maximum hazard across the
population, and the weight determines how much of this hazard is present
in a patient with covariates~\(\mathbf x\). The recursion sets the
weights to be~\(w_k = w_1(1 - w_1)^{k-1}\). This means the weights are defined by
a stick-breaking process where each break is a
proportion~\(w_1\) along the remainder of the stick. This
corresponds to the unconditional distribution of an Erlang r.v. with
geometrically distributed number of phases, suggesting a DP with a
geometric base measure for the weights.
Note that the geometric distribution is a special case of the negative
binomial distribution, which is equivalent to a Poisson distribution
with Gamma distributed rate parameter (with shape \(r\)
and scale~\(p/(1-p)\)).
So the general set-up would be:
\(T|k,\theta \sim \textrm{Erlang}(k,\lambda)\)
\(k|\theta \sim \textrm{Poisson}(\theta)\),
\(\theta \sim \textrm{Gamma}(r,p/(1-p))\).
The question is whether ~choosing~\(p = p(\mathbf x)\) leads to a
proportional hazards model.
Let~\(p = w_1(\mathbf x)\), then the negative binomial distribution with
parameters~\(r,p\) is given by
\(w_k := \pi(k) = {{k + r - 1} \choose {k}} (1 - w_1)^r w_1 ^k \).
The survival function then becomes
\(S(t;\mathbf x) = \exp(-\lambda x) \sum_{n = 0}^{\infty} (1 - \sum_{k = 0}^{n}w_k) \frac{\lambda^n}{n!}t^n \)
Simulations suggest that we end up with a proportional hazards model
(survival curves don't cross).
It is tempting to hypothesise that the survival function we obtain from
the negative binomial mixture is gamma.
The problem with Erlang mixtures is that the common scale parameter is
required to approach zero to obtain the result that these mixtures are
dense in the field of continuous pdfs on~\([0,\infty)\). For large
values of the shared scale parameter, there appears to be no guarantee
that the mixture approximates anything particularly well.
\textbf{Beta-Stacy processes}
Provide priors on CDFs which are conjugate to right-censored data, can
be anchored to a baseline model and allow covariates to be incorporated.
Equivalent to a beta process prior on the cumulative hazard. Leads to
inflexible covariate modelling and non-smooth survival estimates.
\textbf{de Iorio Model}
The basic idea is that dependence among the distributions, dependent
on~\(\mathbf x\), can be proxied by dependence of the locations of
the mixture components on~\(\mathbf x\). ~In the original
dependent DP (McEachern), the weights are also allowed to depend
on~\(\mathbf x\). ~Essentially, each location parameter is
associated with a linear model,~\(\theta_i = X \beta\). They refer to this
model as the `linear DDP'.
The question is: what restrictions on this procedure produce linear
proportional hazards models, and can we impose hierarchical complexity
penalties on these models?
It's probably easier to approach this from the GP formulation of
Fernandez (2016).
\textbf{Gaussian Processes for Survival}
There, the hazard rate is modelled as a GP-modified parametric hazard.
The hazard for the GP formulation is given
by~\(h_0(t) \sigma(l(t)) = h_0(t) \frac{\exp(l(t))}{1 + \exp(l(t))}\).~\(l(t)\) is given a GP prior with
(there) a kernel dependent on the covariates, so
that~\(l \sim GP(0,k)\) with~\(k(v,v') = k_0(t,t') + \sum_{i = 1}^{p} x_i x_i' k_i(t,t')\)
with~\(v = (t,\mathbf x)\). This essentially says that the modification to
the baseline hazard is a logistically transformed GP with additive
time-dependent kernels. The~\(k_i\) are taken to be SE
kernels with distinct hyperparameters. Essentially, the
covariates~\(x_i\) influence the vertical range of the GP
(its~\(\sigma_i\) hyperparameter), so that larger values of the
covariates allow the GP to attain larger values in both the positive and
negative directions, essentially encouraging the modifications to be
more polarised to 0 or 1. This seems like a weird choice\ldots{}
Also, why not do away with the `baseline' hazard and just
use~\(k_0(t,t')\)? I suppose it makes incorporating prior
information about the shape of the hazard easier, but is this really the
focus of prior information? Isn't it more likely to be covariate effects
that clinicians have more prior knowledge about?
Doing away with the `base measure' for now, let's assume we just model
the hazard as a log-GP over time and covariate space. The way to impose
hierarchical complexity structure is through the kernel hyperparameters.
It might be possible to use the Fernandez trick to avoid to the
numerical integration via data augmentation.
\textbf{Density estimation with GPs}
The go-to approach to density estimation with Gaussian Processes is the
`Logistic Gaussian Process' of Leonard (1978,?). Lenk (2003) provides a
means to incorporate prior information/anchor the estimate to a
parametric distribution, in a similar approach to the one used by
Fernandez.
The density estimate is a GP adjustment to some exponential family
parametric density;
\par\null\par\null\par\null\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}