Dirichlet Process Mixture models
Dirichlet Process Mixture models (DPMs) arise as nonparametric estimators of probability density functions. In general, we choose some parametric (preferably easy to work with) distributions and take an infinite mixture of them. To wit, given some arbitrary probability density function \(f(x)\), we approximate it as a mixture of parametric pdfs, \(\{\varphi_k(x)\}\) indexed by \(k\) as
\(f(t) \approx \sum_{k = 1}^{\infty} w_k \varphi_k(t)\).
The weights \(w_k\) must sum to 1 to ensure that the approximation is a valid pdf.
Now, the infinite-dimensional weight vector is given a Dirichlet Process prior distribution, which is a distribution over the infinite-dimensional unit simplex, samples from it being discrete probability mass functions.
The stick-breaking formulation of the DP can be stated as follows:
If \(G \sim DP(\alpha H)\), then we can write \(G = \sum_{i = 1}^{\infty} \pi_i \delta_{\theta_i}\), where \(\theta_i\) is drawn from the base measure \(H\) and the weights \(\pi_i\) represent \(\beta (1,\alpha)\)-distributed portions cleaved from a stick of unit length, where at each step only what's left of the stick is left to be apportioned. Dirichlet process mixtures centre some parametric density at the locations \(\theta_i\) and attach to them the weights \(\pi_i\).
The incorporation of covariates can be accomplished through introducing some dependency, dependent on \(\mathbf x\), between the distributions for each patient (McEachern, 2000; de Iorio, 2004, 2009). de Iorio imposes an ANOVA model on the locations of the mixture components. Alternatively, we might look to find a representation of the hazard function and incorporate covariate effects in standard ways. The hazard is given by \(\tilde h (t) = \frac{\sum_{k = 1}^{\infty} w_k \phi_k(t)}{1 - \sum_{k = 1}^{\infty} w_k \Phi_k(t)}\).
If \(\phi_k\) is the Erlang pdf with \(k\) phases and shared rate \(\lambda\), we get the following hazard function
\(\tilde h (t) = \frac{\sum_{n = 0}^{\infty} w_{n+1} \lambda^{n+1} (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k) t^n} \)
           \( = \frac{\lambda \sum_{n = 0}^{\infty} w_{n+1} \lambda^{n} (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k) t^n}\)                        
Covariate effects can be incorporated through the \(w_n\), the \(\lambda\), or both. The weights will typically be decided by some nonparametric stick-breaking method, for example a logistically transformed GP for each weight. The \(\lambda\) dependence can also be decided by a log-GP (since \(\lambda\) must be positive).
This gives us the general hazard model
\(\tilde{h}(t; \mathbf x)= \frac{\lambda(\mathbf x) \sum_{n = 0}^{\infty} w_{n+1}(\mathbf x) \lambda^{n}(\mathbf x) (n!)^{-1}t^n}{1 + \sum_{n = 1}^{\infty} \lambda^n(\mathbf x) (n!)^{-1} (1 - \sum_{k = 1}^{n} w_k(\mathbf x)) t^n}\).
We want to know under what circumstances (conditions on the \(\mathbf x\) dependency) we can write this as a product of a function of \(\mathbf x\) alone and a function of \(t\) alone; this would lead to a proportional hazards model.
We can recover a very simple proportional hazards model by defining the weights recursively. We choose \(w_1(\mathbf x)\) by some means, and then define the rest of the weights such that \((1 - \sum_{k = 1}^{n}w_k) = (1 - w_1)^n\). This leads to a hazard rate of
\(\tilde{h}(t) = \lambda w_1(\mathbf x)\), i.e. there is some maximum hazard across the population, and the weight determines how much of this hazard is present in a patient with covariates \(\mathbf x\). The recursion sets the weights to be \(w_k = w_1(1 - w_1)^{k-1}\). This means the weights are defined by a stick-breaking process where each break is a proportion \(w_1\) along the remainder of the stick. This corresponds to the unconditional distribution of an Erlang r.v. with geometrically distributed number of phases, suggesting a DP with a geometric base measure for the weights.
Note that the geometric distribution is a special case of the negative binomial distribution, which is equivalent to a Poisson distribution with Gamma distributed rate parameter (with shape \(r\) and scale \(p/(1-p)\)).
So the general set-up would be:
\(T|k,\theta \sim \textrm{Erlang}(k,\lambda)\)
\(k|\theta \sim \textrm{Poisson}(\theta)\),
\(\theta \sim \textrm{Gamma}(r,p/(1-p))\).
The question is whether  choosing \(p = p(\mathbf x)\) leads to a proportional hazards model.
Let \(p = w_1(\mathbf x)\), then the negative binomial distribution with parameters \(r,p\) is given by
\(w_k := \pi(k) = {{k + r - 1} \choose {k}} (1 - w_1)^r w_1 ^k \).
The survival function then becomes
\(S(t;\mathbf x) = \exp(-\lambda x) \sum_{n = 0}^{\infty} (1 - \sum_{k = 0}^{n}w_k) \frac{\lambda^n}{n!}t^n \)
Simulations suggest that we end up with a proportional hazards model (survival curves don't cross).
It is tempting to hypothesise that the survival function we obtain from the negative binomial mixture is gamma.
The problem with Erlang mixtures is that the common scale parameter is required to approach zero to obtain the result that these mixtures are dense in the field of continuous pdfs on \([0,\infty)\). For large values of the shared scale parameter, there appears to be no guarantee that the mixture approximates anything particularly well.
Beta-Stacy processes
Provide priors on CDFs which are conjugate to right-censored data, can be anchored to a baseline model and allow covariates to be incorporated. Equivalent to a beta process prior on the cumulative hazard. Leads to inflexible covariate modelling and non-smooth survival estimates.
de Iorio Model
The basic idea is that dependence among the distributions, dependent on \(\mathbf x\), can be proxied by dependence of the locations of the mixture components on \(\mathbf x\).  In the original dependent DP (McEachern), the weights are also allowed to depend on \(\mathbf x\).  Essentially, each location parameter is associated with a linear model, \(\theta_i = X \beta\). They refer to this model as the 'linear DDP'.
The question is: what restrictions on this procedure produce linear proportional hazards models, and can we impose hierarchical complexity penalties on these models?
It's probably easier to approach this from the GP formulation of Fernandez (2016).
Gaussian Processes for Survival
There, the hazard rate is modelled as a GP-modified parametric hazard.
The hazard for the GP formulation is given by \(h_0(t) \sigma(l(t)) = h_0(t) \frac{\exp(l(t))}{1 + \exp(l(t))}\)\(l(t)\) is given a GP prior with (there) a kernel dependent on the covariates, so that \(l \sim GP(0,k)\) with \(k(v,v') = k_0(t,t') + \sum_{i = 1}^{p} x_i x_i' k_i(t,t')\) with \(v = (t,\mathbf x)\). This essentially says that the modification to the baseline hazard is a logistically transformed GP with additive time-dependent kernels. The \(k_i\) are taken to be SE kernels with distinct hyperparameters. Essentially, the covariates \(x_i\) influence the vertical range of the GP (its \(\sigma_i\) hyperparameter), so that larger values of the covariates allow the GP to attain larger values in both the positive and negative directions, essentially encouraging the modifications to be more polarised to 0 or 1. This seems like a weird choice...
Also, why not do away with the 'baseline' hazard and just use \(k_0(t,t')\)? I suppose it makes incorporating prior information about the shape of the hazard easier, but is this really the focus of prior information? Isn't it more likely to be covariate effects that clinicians have more prior knowledge about?
Doing away with the 'base measure' for now, let's assume we just model the hazard as a log-GP over time and covariate space. The way to impose hierarchical complexity structure is through the kernel hyperparameters.
It might be possible to use the Fernandez trick to avoid to the numerical integration via data augmentation.
Density estimation with GPs
The go-to approach to density estimation with Gaussian Processes is the 'Logistic Gaussian Process' of Leonard (1978,?). Lenk (2003) provides a means to incorporate prior information/anchor the estimate to a parametric distribution, in a similar approach to the one used by Fernandez.
The density estimate is a GP adjustment to some exponential family parametric density;