Privacy by Design
in the machine learning pipeline

Workshop, University of Venice

Zak Varty

2024-11-01

Hello! Who am I?

Statistician by training, got to where I am through medical, environmental and industrial applications.
Teaching Fellow at Imperial College London
- Data Science, Data Ethics
Privacy, fairness and explainability in ML.
- Capable enthusiast, realist / pessimist

Really it all comes down to doing good statistics well.

Why do we care about privacy?

Protecting Sensitive Data: ML models often train on personal or confidential data (health records, financial info), and safeguarding this data is essential to prevent misuse or unauthorized access.
Compliance with Regulations: Laws like GDPR require organisations to protect user privacy, impacting how data is collected, stored, and used in ML.
Preventing Data Leakage Models can unintentionally expose sensitive information from their training data, risking user privacy if someone exploits the model’s outputs.
Building Trust: Privacy-conscious ML practices foster trust among users, making them more willing to share data and participate in systems that use ML.
Avoiding Discrimination: Privacy techniques can reduce bias and discrimination risks, ensuring the ML model treats users fairly without targeting sensitive attributes.

“If you have nothing to hide, you have nothing to fear”

“The benefits outweight the costs”

Machine Learning Pipeline

Machine Learning Pipeline Life Cycle

Reality is much messier. See Gelman and Loken (2013) for a discussion of the implications.

This workshop

Work through the ML life cycle
How could and have things gone wrong

What tools do we have?
What are their limitations?

Interactive bits, every now and then

1. Data Collection

Collecting Data - Hard-to-Reach Groups

Standard ML assumes that data are cheap and easy to collect.

Assumption that data are cheap and easy to collect. Out of the box model fitting assumes we are working with big, representative and independent samples.

Applications of ML to social science to study hard-to-reach populations: persecuted groups, stigmatised behaviours.
Standard study designs and analysis techniques will fail.

Snowball Sampling - Hidden Network

Snowball Sampling - Initial Recruitment

Snowball Sampling - Referral Round 1

Snowball Sampling - Referral Round 2

Snowball Sampling - Referral Round 3

Snowball Sampling

By using subject-driven sampling design, we can better explore the hard to reach target population while preserving the privacy of data subjects who do not want to be included in the study.

Bonus: Also allows us to study community (network) structure if we are interested in that.

Drawbacks:
- “isolated” nodes,
- partitioned graphs,
- adapting model fitting to non-uniform sampling.

Collecting Data - Asking Difficult Questions

Even if data subjects are easy to access and sample from, they may not wish to answer honestly.

Can you give me some examples?

cheated on an exam?
cheated on a romantic / sexual partner?
experienced suicidal ideation?
killed another person?

Dishonest answers will bias any subsequent analysis, leading us to underestimate the prevalence of an “undesirable outcome”.

(Interesting intersection with survey design and psychology. The order and way that you ask questions can influence responses but we will focus on a single question here.)

Direct Response Survey

\[\Pr(Y_i = 1) = \theta \quad \text{and} \quad \Pr(Y_i = 0) = 1 - \theta.\] Method of Moments Estimator: (General dataset)

\[ \hat \Theta = \hat\Pr(Y_i = 1)\]

\[ \hat \Theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1) = \bar Y = \frac{\#\{yes\}}{\#\{subjects\}}.\]

Method of Moments Estimate: (Specific dataset) \[ \hat \theta = \frac{1}{n}\sum_{i=1}^n \mathbb{I}(y_i = 1) = \bar y.\]

MoM Example

Suppose I ask 100 people whether they have ever been unfaithful in a romantic relationship and 24 people respond “Yes”.

What is your best guess of the proportion of all people who have been unfaithful?

\(\hat\theta = \bar y = \frac{24}{100}\)

How confident are you about that guess?

Would that change if I had 1 person responding “Yes”?

Would that change if I had 99 people responding “Yes”?

MoM - Nice Properties

Over lots of samples we get it right on average:

\(\mathbb{E}_Y[\hat\Theta] = \mathbb{E}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}_Y\left[ \mathbb{I}(Y_i = 1)\right] = \frac{n \theta}{n} = \theta\)

As the number of samples gets large, we get more confident and therefore recover the truth

\[\begin{align*} \text{Var}_Y[\hat\Theta] &= \text{Var}_Y\left[\frac{1}{n}\sum_{i=1}^n \mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n\text{Var}_Y\left[\mathbb{I}(Y_i = 1)\right] \\ &= \frac{1}{n^2}\sum_{i=1}^n p(1-p) \\ &= \frac{n p (1-p)}{n^2} \\ &= \frac{p (1-p)}{n} \rightarrow 0 \end{align*}\]

Adding Privacy - Randomised Response

Mathematically nice but in reality people lie.
Our estimator worked “best” for central values of \(\theta\), unlikely for stigmatised events.
Add random element to survey to provide plausible deniability.
- Flip a fair coin. If heads, switch your answer.
MoM estimation: Equate probabilities and proportions.

Estimation from Randomised Response Data

Consider using a weighted coin, probability \(p\) of telling truth. Derive an expression for the probability of answering “Yes”.

\[\begin{align*} \Pr(\text{Yes}) &= \theta p + (1 - \theta)(1 - p) \\ & \approx \frac{\#\{yes\}}{\#\{subjects\}} \\ &= \bar y \end{align*}\]

Rearrange this expression to get a formula for \(\hat \theta\).

\[\hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]

Randomised Response Activities

\[ \hat \theta = \frac{\bar y - 1 + p}{2p -1}.\]

Direct response is a special case of randomised response. How can we use that to check our working?

If our previous survey results came from this randomised response survey design with \(p = 0.25\), what is your best guess of the proportion of people who have been unfaithful?

Are you more or less confident in this estimate than previously, when we had the same data but from a direct response design?

What does your confidence depend on? And which of those factors do you have knowledge of / control over?

When would this estimation procedure break and why? How could we fix that?

Randomised Response - Privacy Schematic

Randomised Response

Approach to privacy for single, binary response.
Issues with applying to multiple questions, e.g. surveys with follow on questions.
Extensions to categorical and continuous responses and predictors
General principle of adding noise
- Need to make observations indistinguishable
- Need to preserve important aspects of “signal”

Data Collection Summary

Collect what you need and use that information only for its intended purpose.
Targeting hard-to-reach populations can be challenging but possible by combining survey design and specific learning approaches. Keeps statisticians in a job!
Asking difficult questions can lead to biased responses. Plausible deniability through randomised response designs can help.

2. Data Storage

Encryption

Once we have gone to the effort of collecting data we don’t want to just leave it lying around for anyone to access.

\[ \text{Plain text} \overset{f}{\rightarrow} \text{Cipher Text}\]
\[ \text{Cipher text} \overset{f^{-1}}{\rightarrow} \text{Plain Text}\]
\[ f(\text{data}, \text{key})\] Many encryption schemes depending on the data to be encrypted and how the key is to be distributed.

Caesar Cipher

Have a go at decrypting the encrypted message f(?, 2 ) = JGNNQ YQTNF.

What are some benefits and drawbacks of this encryption scheme?

K-anonymity

What happens if someone gets access to the data?

\(k\)-anonymity is a measure of privacy within a dataset.
Given a set of predictor-outcome responses, each unique combination forms an equivalence class.
The smallest equivalence class of a \(k\)-anonymous dataset is of size \(k\).
Equivalently, each individual is indistinguishable from at least \(k-1\) others.

K-anonymity Example

K-anonymity Worked Example

K-anonymity Your Turn

I asked ChatGPT to generate 4-anonymous datasets but it hasn’t done a good job.

Establish the true value of k for your dataset.
Use pseudonymisation, aggregation, redaction and partial-redaction to make your dataset 4-anonymous.

K-anonymity Feedback

How did ChatGPT do?
How did you alter the anonymity level of your dataset?
What did you have to consider as you were doing this?

K-anonymity Drawbacks

What do you think some of the limitations of \(k\)-anonymity might be?

Knowing what is important before analysis.
Publishing multiple versions of the dataset.
Can be checked easily but not implemented algorithmically.
External data attacks; Jane Doe, Latanya Sweeney

Data Storage - Summary

Don’t leave important data lying around unprotected.
Choose a level of security appropriate to the sensitivity of the data.
Consider the consequences of someone gaining access to the data.
Remember that your data does not live in isolation.
K-anonymity not a good measure of privacy but an accessible starting point.

3. Data Analytics

Federated Computation and Analytics

Data can be vulnerable while in transport
Might be too risky to send individual data items
Summary Statistics are often sufficient
These can be transmitted from individual data centres (clients), e.g. hospitals within a local authority (server).

What is Federation?

Federated data is decentralised - it is not all stored in one place.

Federated Computing

Federated Analytics

Federated Learning

Federated Validation

Two common examples: where data remains with client

Mobile data (Text suggestions)
Medical data (Healthcare Monitoring)

Federation Networks

We can formalise how information sources are connected as a graph structure known as an empirical graph.

Nodes represent users or data sources, while edges represent data sharing capabilities.

Centralised Federation

Clustered Federated Learning

Personalised Federated Learning

Directions of Federation

Horizontal Federation

Vertical Federation

The empirical graph describes how information is shared within a federation network, but does not describe the types of information that are shared. This leads us on to two new terms: horizontal and vertical federation.

Horizontal federation

In Horizontal federation the same predictors and responses are available at all clients but particular instances of these are split between clients (potentially with overlap).

If we consider a design matrix for our learning problem, this corresponds to different rows of our data being stored by the various clients.

Federated Learning in this sense is a bit like a meta-analysis on steroids, our centralised model gains power from a collection of federated datasets each of the same form.

Vertical Federation

You might be able to now extrapolate to vertical federation.

In this set up, each client contains a different subset of the predictors that are used by the server model. Importantly, each of these predictors is recorded for the same instances (e.g. health data sorted by different medical practices, or screen time on macbook, ipad and iphone for a given apple ID)

In vertical federation, each row of our design matrix is present in each client data set but each client hold only have a small subset of all the columns (or predictors).

Why bother with Federated Learning?

✅ Easy to explain

✅ No loss of information

✅ Lower computational costs

❌ Client or individual information still vulnerable

❌ Combining local information is hard

❌ Exacerbates existing modelling issues

Encryption

Pass data \(x\) through some function \(E\) with inverse \(E^{-1}=D\):

Inversion easy with some extra information available to trusted individuals
Inversion very difficult otherwise.

Pros

Data security when in storage or transit.
Reduce associated costs.

Cons

Can’t compute on \(x\) without first decrypting.

Malleability

\[f(E[x]) = E[f(x)]\]

This need not literally be the same function: \[g(E[x]) = E[f(x)].\]

Predictable modification without decryption (+/-)
Outsource computation with our compromising security / privacy
- company can keep model private from hosting service
- customer can keep data private from host and provider
- still vulnerable to repeated query attacks.

Fully Homomorphic Encryption

(Fully) Homomorphic encryption is an emerging technology in private ML.

Homomorphic encryption schemes allow \(f\) to be addition:

\[E[x] \oplus E[y]= E[x + y].\] Fully homomorphic encryption schemes additionally allow \(f\) multiplication:

\[E[x] \otimes E[y]= E[x \times y].\]

This allows us to construct polynomial approximations to ML models.

How closely can our model be approximated by a polynomial?
Practical issues with imperfect data storage and large number of compositions.

4. Modelling

Privacy through modelling constraints

The core idea of differential privacy is that we want to ensure that no individual data point influences the model “too much”.

Strategies:

Add noise to individual data entries or summary statistics. Similar to robust regression (median modelling, heavy-tailed noise)
Combine goodness of fit and sensitivity penalty in loss function. Similar to penalised regression.

\[ L(\theta, x; \lambda) = \underset{\text{likelihood /} \\ \text{model fit}}{\underbrace{\ell(\theta, x)}} - \lambda \underset{\text{sensitivity} \\ \text{penality}}{\underbrace{h(\theta, x)}}\]

Federated Learning

Similar idea to federated analytics but communicating e.g. gradient of loss function using local data.

This means that data stays with user, only partial model updates are transmitted.
Strong links to distributed computing and this is how Apple (and others) collect performance data from phones.

Have to take care with non-responses so as not to bias the model.

5. Going into Production

Privacy of the model

Putting a model into production exposes a whole range of adversarial circumstances. Two of the most common would be:

Attacks aiming to recover model structure
- to reproduce the model
- to exploit weaknesses
Attacks aiming to recover individual training data
Examples would be financial decision making and LLMs.

Rigaki, M. & Garcia, S. (2020). “A Survey of Privacy Attacks in Machine Learning”. ArXiv preprint.

Wrapping up

Summary

Three things to remember from this workshop

Privacy can become compromised at all stages of the ML pipeline.

Core methods rely on some combination of localisation, encryption and noise.

As in life, you can’t do anything useful without risk but you can minimise those risks.

Learning More

Build Information

R version 4.3.3 (2024-02-29)

Platform: x86_64-apple-darwin20 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

loaded via a namespace (and not attached): gtable(v.0.3.5), jsonlite(v.1.8.8), dplyr(v.1.1.4), compiler(v.4.3.3), crayon(v.1.5.3), Rcpp(v.1.0.12), tidyselect(v.1.2.1), stringr(v.1.5.1), showtext(v.0.9-7), zvplot(v.0.0.0.9000), assertthat(v.0.2.1), scales(v.1.3.0), png(v.0.1-8), yaml(v.2.3.8), fastmap(v.1.1.1), ggplot2(v.3.5.1), R6(v.2.5.1), generics(v.0.1.3), showtextdb(v.3.0), knitr(v.1.45), tibble(v.3.2.1), pander(v.0.6.5), munsell(v.0.5.1), lubridate(v.1.9.3), pillar(v.1.9.0), rlang(v.1.1.4), utf8(v.1.2.4), stringi(v.1.8.4), xfun(v.0.43), timechange(v.0.3.0), cli(v.3.6.3), magrittr(v.2.0.3), digest(v.0.6.35), grid(v.4.3.3), rstudioapi(v.0.16.0), lifecycle(v.1.0.4), sysfonts(v.0.8.9), vctrs(v.0.6.5), evaluate(v.0.23), glue(v.1.8.0), emo(v.0.0.0.9000), fansi(v.1.0.6), colorspace(v.2.1-1), rmarkdown(v.2.26), purrr(v.1.0.2), jpeg(v.0.1-10), tools(v.4.3.3), pkgconfig(v.2.0.3) and htmltools(v.0.5.8.1)

References

Gelman, Andrew, and Eric Loken. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘p-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.” Department of Statistics, Columbia University 348 (1-17): 3.

Privacy by Design in the machine learning pipeline

Hello! Who am I?

Why do we care about privacy?

“If you have nothing to hide, you have nothing to fear”

“The benefits outweight the costs”

Machine Learning Pipeline

Machine Learning Pipeline Life Cycle

This workshop

1. Data Collection

Collecting Appropriate Data - GDPR

Collecting Data - Hard-to-Reach Groups

Snowball Sampling - Hidden Network

Snowball Sampling - Initial Recruitment

Snowball Sampling - Referral Round 1

Snowball Sampling - Referral Round 2

Snowball Sampling - Referral Round 3

Snowball Sampling

Collecting Data - Asking Difficult Questions

Direct Response Survey

MoM Example

MoM - Nice Properties

Adding Privacy - Randomised Response

Estimation from Randomised Response Data

Randomised Response Activities

Randomised Response - Privacy Schematic

Randomised Response

Data Collection Summary

2. Data Storage

Encryption

Caesar Cipher

K-anonymity

K-anonymity Example

K-anonymity Worked Example

K-anonymity Your Turn

K-anonymity Feedback

K-anonymity Drawbacks

Data Storage - Summary

3. Data Analytics

Federated Computation and Analytics

What is Federation?

Federation Networks

Centralised Federation

Clustered Federated Learning

Personalised Federated Learning

Directions of Federation

Horizontal Federation

Vertical Federation

Why bother with Federated Learning?

Encryption

Malleability

Fully Homomorphic Encryption

4. Modelling

Privacy through modelling constraints

Federated Learning

5. Going into Production

Privacy of the model

Wrapping up

Summary

Learning More

Build Information

References

Privacy by Design
in the machine learning pipeline