A very brief introduction
Climate
Finance
Industry
Hazard \(\approx\) probability: the chance of an event at least as severe as some value happening within in a given space-time window.
\[\Pr(X > x) = 1 - F_X(x).\]
Risk \(\approx\) cost: the potential economic, social and environmental consequences of perilous events that may occur in a specified period of time or space.
\[ \text{VaR}_\alpha (X) = F_X^{-1}(\alpha) \quad or \quad \text{ES}_\alpha(X) = \mathbb{E}[X | X < \text{VaR}_\alpha(X)].\]
Depending on the peril we are considering, the definition of a “bad” outcome differs:
Without loss of generality, we can focus on modelling large positive values, by transforming our data and results as appropriate.
\[g(X_i) \quad \text{e.g.} \quad -X \quad \text{or} \quad X^{-1}.\]
An issue with risk / hazard modelling is that we are by definition interested in the rare events, which make up only a very small proportion of our data.
Robust regression:
Generalises to quantile regression:
What if we need to do beyond the historical record?
e.g. estimate a 1 in 1000-year flood from 50 years of data.
Extreme Value Theory allows principled extrapolation beyond the range of the observed data.
Focuses on the most extreme observations.
If we care about
\[M_n = \max \{X_1, \ldots, X_n\}\]
How can we model
\[\begin{align*} F_{M_n}(x) &= \Pr(X_1 \leq x,\ \ldots, \ X_n \leq x) \\ &= \Pr(X \leq x) ^n \\ &= F_X(x)^n? \end{align*}\]
Analogue of CLT for Sample Maxima. Let’s revisit the CLT:
Suppose \(X_1, X_2, X_3, \ldots,\) is a sequence of i.i.d. random variables with \(\mathbb{E}[X_i] = \mu\) and \(\text{Var}[X_i] = \sigma^2 < \infty\).
As \(n \rightarrow \infty\), the random variables \(\frac{\sqrt{n} (\bar{X}_n - \mu)}{\sigma}\) converge in distribution to a standard normal distribution.
\[ \frac{\sqrt{n} (\bar{X}_n - \mu)}{\sigma} \overset{d}{\longrightarrow} \mathcal{N}(0,1).\]
Rephrasing this as the partial sums rather than partial means:
\[\frac{S_n}{\sigma\sqrt{n}} - \frac{\mu}{\sigma / \sqrt{n}} \overset{d}{\longrightarrow} \mathcal{N}(0,1).\]
Analogue of CLT for Sample Maxima. Let’s revisit the CLT:
Under weak conditions on \(F_X\) and where appropriate sequences of constants \(\{a_n\}\) and \(\{b_n\}\) exist:
\[a_n S_n - b_n \overset{d}{\longrightarrow} \mathcal{N}(0,1).\]
If suitable sequences of normalising constants exist, then as \(n \rightarrow \infty\):
\[\begin{equation} \label{eqn:lit_extremes_normalising} \Pr\left(\frac{M_n - b_n}{a_n} \leq x \right) \rightarrow G(x), \end{equation}\]
where \(G\) is distribution function of a Fréchet, Gumbel or negative Weibull random variable.
This links to the concept of Maximal Domain of Attraction: if we know \(F_X(x)\) then we can identify \(G(x)\).
But we don’t know \(F\)!
These distributional forms are united in a single parameterisation by the Unified Extremal Types Theorem.
The resulting generalised extreme value (GEV) family of distribution functions has the form
\[\begin{equation} \label{eqn:lit_extremes_GEV} G(x) = \exp\left\{ -\left[ 1 + \xi \frac{x - \mu}{\sigma} \right]_{+}^{-1/\xi}\right\}, \end{equation}\]
where \(x_+ = \max(x,0)\), \(\sigma \in \mathbb{R}^+\) and \(\mu , \xi \in \mathbb{R}\). The parameters \(\mu, \sigma\) and \(\xi\) have respective interpretations as location, scale and shape parameters.
CLT and UETT are asymptotic results, we use them as approximations for finite \(n\).
Split the data into \(m\) blocks of length \(k\) and model the \(M_k\).
How to pick the block size? Trade-off between bias and variance.
Annual blocks often uses as a pragmatic choice to handle seasonal trends.
\[X_i - u | X_i > u, z_i \sim \text{GPD}(\sigma(z_i),\ \xi(z_i)).\]
\[ \lambda(z_i) = \Pr(X_i > u | z_i) = \frac{\exp\{\beta_0 + \beta_1 z_i\}}{1 + \exp\{\beta_0 + \beta_1 z_i\}}.\]
EVT gives us a model for the size of rare events, to get risk maps or forecasts we also care about how they occur over time/space.
Point processes are a stochastic process \(\{X_1, \ldots ,X_N\}\) where RVs represent locations of event in time or space and the number of these is random.
Poisson process is simplest version, Poisson(\(\lambda\)) number events located independently and uniformly at random.
Not a useful model, too simple to be realistic.
Central case from which to define clustered or regular occurrences.
Some simulated point patterns - which is random, clustered or repulsive?
Humans are rubbish at this. How can we formally test instead?
Inhomogeneous Poisson Process: event count still Poisson, events locations still independent but rate of events allowed to vary over time/space:
\[ \lambda(t) = \lim_{\delta \rightarrow 0 } \frac{N(t, t + \delta)}{\delta}.\]
The expected number of events in a region \(A\) is given by the integral of this intensity function.
\[ \Lambda(A) = \int_A \lambda(t) \mathrm{d}t \quad \text{e.g.} \quad \int_{t_{0}}^{t_{1}} \exp(a + bt) \mathrm{d}t.\]
Interested in describing the number, location and any additional information about events – potentially using covariates to do so.
Further relaxations: renewal processes, Poisson cluster processes, self-exciting processes.
Peaks Over Threshold
Point Processes
Combining these we can undo the conditioning to assess hazard and risk.
hazard: number, location and magnitude of peril. risk: convolution of hazard with context and opinion.
Challenge: What if we do not have a good guess at GLM form? Mix with flexible regression techniques.
{ismev}
R package.{sp}
and {spatstat}
R packages.R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
loaded via a namespace (and not attached): Matrix(v.1.6-1.1), bit(v.4.0.5), jsonlite(v.1.8.8), crayon(v.1.5.2), compiler(v.4.3.3), Rcpp(v.1.0.12), tidyselect(v.1.2.1), MatrixModels(v.0.5-3), parallel(v.4.3.3), showtext(v.0.9-7), zvplot(v.0.0.0.9000), splines(v.4.3.3), png(v.0.1-8), yaml(v.2.3.8), fastmap(v.1.1.1), lattice(v.0.22-5), readr(v.2.1.5), R6(v.2.5.1), showtextdb(v.3.0), knitr(v.1.45), MASS(v.7.3-60.0.1), tibble(v.3.2.1), pander(v.0.6.5), pillar(v.1.9.0), tzdb(v.0.4.0), rlang(v.1.1.3), utf8(v.1.2.4), xfun(v.0.43), bit64(v.4.0.5), cli(v.3.6.2), magrittr(v.2.0.3), digest(v.0.6.35), grid(v.4.3.3), vroom(v.1.6.5), rstudioapi(v.0.16.0), quantreg(v.5.97), hms(v.1.1.3), lifecycle(v.1.0.4), sysfonts(v.0.8.9), vctrs(v.0.6.5), SparseM(v.1.81), evaluate(v.0.23), glue(v.1.7.0), survival(v.3.5-8), fansi(v.1.0.6), rmarkdown(v.2.26), tools(v.4.3.3), pkgconfig(v.2.0.3) and htmltools(v.0.5.8.1)
Statistical Hazard Modelling - A Very Brief Introduction - Zak Varty