Workshop, University of Venice
2024-11-01
Statistician by training, got to where I am through medical, environmental and industrial applications.
Teaching Fellow at Imperial College London
Privacy, fairness and explainability in ML.
Really it all comes down to doing good statistics well.
Protecting Sensitive Data: ML models often train on personal or confidential data (health records, financial info), and safeguarding this data is essential to prevent misuse or unauthorized access.
Compliance with Regulations: Laws like GDPR require organisations to protect user privacy, impacting how data is collected, stored, and used in ML.
Preventing Data Leakage Models can unintentionally expose sensitive information from their training data, risking user privacy if someone exploits the model’s outputs.
Building Trust: Privacy-conscious ML practices foster trust among users, making them more willing to share data and participate in systems that use ML.
Avoiding Discrimination: Privacy techniques can reduce bias and discrimination risks, ensuring the ML model treats users fairly without targeting sensitive attributes.
Reality is much messier. See Gelman and Loken (2013) for a discussion of the implications.
Standard ML assumes that data are cheap and easy to collect.
Assumption that data are cheap and easy to collect. Out of the box model fitting assumes we are working with big, representative and independent samples.
Applications of ML to social science to study hard-to-reach populations: persecuted groups, stigmatised behaviours.
Standard study designs and analysis techniques will fail.
By using subject-driven sampling design, we can better explore the hard to reach target population while preserving the privacy of data subjects who do not want to be included in the study.
Even if data subjects are easy to access and sample from, they may not wish to answer honestly.
Can you give me some examples?
Dishonest answers will bias any subsequent analysis, leading us to underestimate the prevalence of an “undesirable outcome”.
(Interesting intersection with survey design and psychology. The order and way that you ask questions can influence responses but we will focus on a single question here.)
\[\Pr(Y_i = 1) = \theta \quad \text{and} \quad \Pr(Y_i = 0) = 1 - \theta.\] Method of Moments Estimator: (General dataset)
\[ \hat \Theta = \hat\Pr(Y_i = 1)\]
\[ \hat \Theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1) = \bar Y = \frac{\#\{yes\}}{\#\{subjects\}}.\]
Method of Moments Estimate: (Specific dataset) \[ \hat \theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(y_i = 1) = \bar y.\]
Suppose I ask 100 people whether they have ever been unfaithful in a romantic relationship and 24 people respond “Yes”.
What is your best guess of the proportion of all people who have been unfaithful?
\(\hat\theta = \bar y = \frac{24}{100}\)
How confident are you about that guess?
Would that change if I had 1 person responding “Yes”?
Would that change if I had 99 people responding “Yes”?
Over lots of samples we get it right on average:
\(\mathbb{E}_Y[\hat\Theta] = \mathbb{E}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}_Y\left[ \mathbb{I}(Y_i = 1)\right] = \frac{n \theta}{n} = \theta\)
As the number of samples gets large, we get more confident and therefore recover the truth
\[\begin{align*} \text{Var}_Y[\hat\Theta] &= \text{Var}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n\text{Var}_Y\left[\mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n p(1-p) \\ &= \frac{n p (1-p)}{n^2} \\ &= \frac{p (1-p)}{n} \rightarrow 0 \end{align*}\]
Mathematically nice but in reality people lie.
Our estimator worked “best” for central values of \(\theta\), unlikely for stigmatised events.
Add random element to survey to provide plausible deniability.
MoM estimation: Equate probabilities and proportions.
Consider using a weighted coin, probability \(p\) of telling truth. Derive an expression for the probability of answering “Yes”.
\[\begin{align*} \Pr(\text{Yes}) &= \theta p + (1 - \theta)(1 - p) \\ & \approx \frac{\#\{yes\}}{\#\{subjects\}} \\ &= \bar y \end{align*}\]
Rearrange this expression to get a formula for \(\hat \theta\).
\[\hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]
\[ \hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]
Approach to privacy for single, binary response.
Issues with applying to multiple questions, e.g. surveys with follow on questions.
Extensions to categorical and continuous responses and predictors
General principle of adding noise
Collect what you need and use that information only for its intended purpose.
Targeting hard-to-reach populations can be challenging but possible by combining survey design and specific learning approaches. Keeps statisticians in a job!
Asking difficult questions can lead to biased responses. Plausible deniability through randomised response designs can help.
Once we have gone to the effort of collecting data we don’t want to just leave it lying around for anyone to access.
\[ \text{Plain text} \overset{f}{\rightarrow} \text{Cipher Text}\]
\[ \text{Cipher text} \overset{f^{-1}}{\rightarrow} \text{Plain Text}\]
\[ f(\text{data}, \text{key})\] Many encryption schemes depending on the data to be encrypted and how the key is to be distributed.
Have a go at decrypting the encrypted message f(?, 2 ) = JGNNQ YQTNF.
What are some benefits and drawbacks of this encryption scheme?
What happens if someone gets access to the data?
\(k\)-anonymity is a measure of privacy within a dataset.
Given a set of predictor-outcome responses, each unique combination forms an equivalence class.
The smallest equivalence class of a \(k\)-anonymous dataset is of size \(k\).
Equivalently, each individual is indistinguishable from at least \(k-1\) others.
I asked ChatGPT to generate 4-anonymous datasets but it hasn’t done a good job.
Establish the true value of k for your dataset.
Use pseudonymisation, aggregation, redaction and partial-redaction to make your dataset 4-anonymous.
What do you think some of the limitations of \(k\)-anonymity might be?
Don’t leave important data lying around unprotected.
Choose a level of security appropriate to the sensitivity of the data.
Consider the consequences of someone gaining access to the data.
Remember that your data does not live in isolation.
K-anonymity not a good measure of privacy but an accessible starting point.
Data can be vulnerable while in transport
Might be too risky to send individual data items
Summary Statistics are often sufficient
These can be transmitted from individual data centres (clients), e.g. hospitals within a local authority (server).
Federated data is decentralised - it is not all stored in one place.
Federated Computing
Federated Analytics
Federated Learning
Federated Validation
Two common examples: where data remains with client
We can formalise how information sources are connected as a graph structure known as an empirical graph.
Nodes represent users or data sources, while edges represent data sharing capabilities.
✅ Easy to explain
✅ No loss of information
✅ Lower computational costs
❌ Client or individual information still vulnerable
❌ Combining local information is hard
❌ Exacerbates existing modelling issues
Pass data \(x\) through some function \(E\) with inverse \(E^{-1}=D\):
Pros
Cons
\[f(E[x]) = E[f(x)]\]
This need not literally be the same function: \[g(E[x]) = E[f(x)].\]
(Fully) Homomorphic encryption is an emerging technology in private ML.
Homomorphic encryption schemes allow \(f\) to be addition:
\[E[x] \oplus E[y]= E[x + y].\] Fully homomorphic encryption schemes additionally allow \(f\) multiplication:
\[E[x] \otimes E[y]= E[x \times y].\]
This allows us to construct polynomial approximations to ML models.
The core idea of differential privacy is that we want to ensure that no individual data point influences the model “too much”.
Strategies:
Add noise to individual data entries or summary statistics. Similar to robust regression (median modelling, heavy-tailed noise)
Combine goodness of fit and sensitivity penalty in loss function. Similar to penalised regression.
\[ L(\theta, x; \lambda) = \underset{\text{likelihood /} \\ \text{model fit}}{\underbrace{\ell(\theta, x)}} - \lambda \underset{\text{sensitivity} \\ \text{penality}}{\underbrace{h(\theta, x)}}\]
Similar idea to federated analytics but communicating e.g. gradient of loss function using local data.
This means that data stays with user, only partial model updates are transmitted.
Strong links to distributed computing and this is how Apple (and others) collect performance data from phones.
Have to take care with non-responses so as not to bias the model.
Putting a model into production exposes a whole range of adversarial circumstances. Two of the most common would be:
Rigaki, M. & Garcia, S. (2020). “A Survey of Privacy Attacks in Machine Learning”. ArXiv preprint.
Three things to remember from this workshop
R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
loaded via a namespace (and not attached): gtable(v.0.3.5), jsonlite(v.1.8.8), dplyr(v.1.1.4), compiler(v.4.3.3), crayon(v.1.5.3), Rcpp(v.1.0.12), tidyselect(v.1.2.1), stringr(v.1.5.1), showtext(v.0.9-7), zvplot(v.0.0.0.9000), assertthat(v.0.2.1), scales(v.1.3.0), png(v.0.1-8), yaml(v.2.3.8), fastmap(v.1.1.1), ggplot2(v.3.5.1), R6(v.2.5.1), generics(v.0.1.3), showtextdb(v.3.0), knitr(v.1.45), tibble(v.3.2.1), pander(v.0.6.5), munsell(v.0.5.1), lubridate(v.1.9.3), pillar(v.1.9.0), rlang(v.1.1.4), utf8(v.1.2.4), stringi(v.1.8.4), xfun(v.0.43), timechange(v.0.3.0), cli(v.3.6.3), magrittr(v.2.0.3), digest(v.0.6.35), grid(v.4.3.3), rstudioapi(v.0.16.0), lifecycle(v.1.0.4), sysfonts(v.0.8.9), vctrs(v.0.6.5), evaluate(v.0.23), glue(v.1.8.0), emo(v.0.0.0.9000), fansi(v.1.0.6), colorspace(v.2.1-1), rmarkdown(v.2.26), purrr(v.1.0.2), jpeg(v.0.1-10), tools(v.4.3.3), pkgconfig(v.2.0.3) and htmltools(v.0.5.8.1)
Privacy by Design - November 2024 - Zak Varty