top of page
  • Writer's picturePaul Irolla

Demystifying the Membership Inference Attack

Updated: Nov 13, 2019

Machine Learning (ML) is currently being integrated in the process of many technological domains, in particular where confidential data have to be manipulated. The wide availability of information related to people opens the field of possibilities for artificial intelligence. We are considering more particularly, here, the medical and financial fields where the applications using confidential data are most obvious.

The algorithms of machine learning and deep learning were originally designed without defense mechanism against malicious threats in mind. When machine learning algorithms came out of research labs and companies internal processes with the rise of cloud services such as the ones provided by Google and Amazon, researchers and hackers started to seek out ways to hijack their primary use. Thanks to advances in IT security applied to machine learning, threats are mapped. Multiple threats to Security exists, like for example adversarial samples [1], adversarial reprogramming [2] and data poisoning [3]. However, we are focusing in this article on a specific privacy threat, the Membership Inference Attack (MIA) [4]. Privacy threats include also data extraction [5] and model extraction [6].

The Membership Inference Attack is the process of determining whether a sample comes from the training dataset of a trained ML model or not. We study the case where the attacker has a limited access to black-box requests that return the model output, which is often the confidence levels for each class in a classification task.

Why it matters: in the case where samples are linked to a person, such as medical or financial data, inferring whether samples come from the training dataset of the ML model constitutes a privacy threat. Meta-information could be leaked to the attacker. For instance, a patient participates to a study that aims to set the right difficulty level for a serious game destined to people suffering from Alzheimer disease [7]. If this is done with a ML model, an attacker that succeeds to infer the membership of a patient to the training, knows de facto that this patient suffers from Alzheimer disease. This is a leakage of confidential information that could be used for a targeted action against the person. More generally, this kind of attacks on sensitive machine learning services could serve in a discrimination process. For instance, by automating inference of characteristics that are supposed to be excluded in decision processes like hiring, granting privileges and subventions.

The current understanding of the MIA is rather vague and we seek to demystify it. We investigate the underlying factors that influence the success of the attack. We discovered that the current MIA technique with shadow models [4] on supervised learning is not effective as we currently understand it. We propose a new assumption of the success of the MIA that explains previous article results, and explains different unexpected properties of the MIA that could not be explained its original assumptions.

According to the original work of Shokri et al. [4], the MIA relies on the differences between the confidence levels for samples that are part of the training set and those that are not (which we will call from now in and out samples). Previous studies that expose reproducible experiments, showed the effectiveness of the MIA with shadow models. We demonstrate in this article that they have actually exploited the lack of generalization capabilities of purposely very overfitted models. As these models have excellent accuracy on their training set and poor accuracy on their test set, in samples have correct prediction while out samples have bad predictions.

We show that we can replace the whole MIA process by a decision simple rule: ”If the target model predicts the label correctly for a sample, then it comes from the training set, otherwise it does not”. This simple rule obtains an accuracy similar to the complex attack method with shadow models. Supported by a detailed statistical study of the confidence levels, we show that the explanation of the MIA success is wrong, we bring a more accurate explanation. Moreover we show that the MIA on a production setup is currently not a serious threat.

We show that we can replace the whole MIA process by a decision simple rule


How does the MIA with shadow models work?

The purpose of the MIA is to know whether a sample was part of a statistical study or training of a ML model. MIA has been demonstrated to be possible on statistical summary public releases [8] by comparing individual sample expressions to the mean of sample expressions with a simple distance or with a Likelihood-Ratio Test. Right now, Machine Learning is replacing statistics for classification / prediction and evaluation tasks. The research community has therefore addressed the problem of membership inference on trained ML models.

The way the MIA operates differs according to the type of machine learning used. Currently, MIA can be applied to Supervised Learning and to Generative Adversarial Networks (GAN) [9] [13]. We focus here exclusively on MIA techniques designed for supervised learning. Supervised learning is the act, for an ML algorithm, of learning the relationship between a set of inputs and an associated set of outputs. The model thus created must be able to predict a coherent output for new input samples. The challenge of supervised learning is to obtain a model that offers good generalization abilities.

In the MIA scenario, there are three entities: the target model, the shadow models and the attack models. The target model that is trained on an initial dataset.

For each class, the model outputs a level of confidence and the one that gets the highest confidence value is considered the chosen class by the ML model. In the given example, there are 10 categories of objects to classify. This is a classification problem known in machine learning as CIFAR-10 [12], a classical benchmark used in research to compare results on a common basis. The hypothesis behind the MIA is that machine learning algorithms retain too much information about their training dataset. This information could leak through the values of the confidence levels associated to each class during predictions. The whole MIA approach relies on the fact that samples from the training dataset would have higher average confidence value in their actual class than samples not seen in training [4].

Of course, an attacker in black-box conditions cannot perform a statistical study on these confidence levels because he does not have access to the training dataset. To cope with this limitation, it trains several others models that are comparable to the target model. These are called shadow models. Shadow models can be exact copies of the target model, i.e. have the same architecture and hyperparameters. There is no formal requirement models here, shadow models may differ from the target model in case the attacker does not have complete information about its characteristics.

Once these models have been trained, we are able to generate training samples for the attack models (the model that will predict whether a sample is from the training set or not). The inputs for the attack models are the confidence levels, along with the label in or out.

We use the confidence levels given by the shadow models to discriminate in samples from out samples. It gives us instances from which we can learn the differences in confidence between in and out samples, so that we can then infer the membership of a sample on the target model. As we control which samples are used in the training of a shadow model and which are not, we are able to generate instances of confidence levels belonging to the in group and others belonging to the out group. For each class of the target problem, we train an attack model that learns to discriminate in from out samples. In the CIFAR-10 problem, the attacker uses 10 attack models. The samples produced by the shadow models are dispatched to build different attack dataset regarding their real class / label. In other words, on the CIFAR-10 the attacker uses 10 attack models, one for each class (airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks). The samples produced by the shadow models are dispatched to build different attack dataset regarding their real class / label.

When this process is finished, the attack datasets are prepared and all that remains is to train the attack models. These are simple discriminators (often multilayer perceptrons [14]).

Unlike the usual machine learning process, in order to evaluate its performance, we are not able to use part of the training database. Why? We are not measuring if the attack is working on the target model but on the shadow models, so performance measures will be likely be highly overestimated. This is why, to get the proper metrics, we generate samples using the target model with samples from the training dataset, and others that come from the test dataset. Obviously in real conditions, the attacker can not obtain these measures because he does not have access to the initial training dataset.

Summary of the MIA process


Understanding the factors that influence the MIA

Let’s start by studying the MIA on the standard MNIST dataset. This is the Hello World! of machine learning, a very simple problem to solve. We train the target model with 10 epochs on a simple convolutional neural network:

It is PyTorch code to define a convolutional neural network. For instance nn.Conv2d(1, 10, 3, 1)) defines a 2D convolutional layer accepting images over 1 channel in input (gray scale), it has 10 trainable filters of size 3x3, and filters move over images with a stride of 1.

The target model gets 99.89% accuracy on its training set and 98.82% accuracy on its test set. The shadow models are copies of the target model, trained on 10 epochs on their shadow dataset. We then generate training samples for our 10 attack models with the trained shadow models. Attack models are simple multilayer perceptrons:

To measure the success of our attack we compute the mean accuracy of the attack models on their test set (train and test samples executed by the target model). We get 51.97% accuracy, hence this attack is not significantly better than a coin toss. The original paper [4] states that the MIA is linked with overfitting. The authors measure the overfit with the gap between the training set accuracy and the test set accuracy. In this case the gap is close to 0. Hence, we carry on with a harder problem on which it is harder to generalize to unseen samples.

CIFAR-10 is known to be a harder problem to solve. It is composed of miniature images of 10 object classes. We use a more complex convolutional model for the target and shadow models:

In this experiment we do not modify the architecture of the attack models and we train the target and shadow models with 15 epochs. The target model has 98% accuracy on its training set, 66% on its test set and the attack models have a mean accuracy of 67.44%. We start to see some success for the MIA. Effectively the train-test accuracy is 32% and attack models succeed to differentiate partially the in from out confidence levels. Let’s investigate the attack models samples distributions to understand how it works:

This figure is a violin plot [15] of the confidence distributions of in and out attack model test samples (samples output by the target model). We plot only the distribution of the confidence level for the class 0 for the attack model for the class 0 here. The difference between in and out confidence level is clear, but the out confidence distribution seems to be the addition of two distinct distributions. To investigate this phenomenon we differentiate samples that have a correct prediction (true positives) from the target model class 0 and that have a wrong prediction (false negative). We do not have here false positives and true negatives because we only consider samples of label 0 for the test “does the model predicted label 0?”.

With this sample separation we are able to single out the two different distributions that compose the total distribution. The clear differentiation that we have been able to glimpse is now is no longer relevant here. Actually, the difference in form in the total distributions is due to the differences in the number of true positive samples and the false negative samples in the in and out samples. The common hypothesis for the MIA success is that in samples exhibit a higher confidence level for their class compared to out samples [4]. This hypothesis does not hold here. In fact, a better hypothesis is: the attack models learn to single out samples that get a correct prediction from samples that get an incorrect prediction. To summarize, we could replace the whole membership inference attack process by a simple rule “If the target model predicts the label correctly for a sample, then it comes from the training set, otherwise it does not”. We refer later to this rule as the correct classification rule (CCR).

Let’s measure the accuracy of the CCR. We assume that to measure accuracy, there is as many sample in as out to test. This is indeed the case in our implementation of the MIA. On the MNIST experiment, the CCR accuracy is 50.53% (99.89 + (100–98.82))/2) and on the CIFAR-10 experiment the CCR accuracy is 66%. It seems very correlated with the attack models mean accuracy, respectively 51.97% and 67.44%. In fact, with these experiments, the rule is as effective to infer sample membership than the MIA with shadow models. Does this mean that, finally, the hypothesis of MIA success based on the confidence level is purely an illusion? It seems that the success of the attack is mainly based the train-test accuracy gap and the CCR captures this phenomenon.

Does this mean that, finally, the hypothesis of MIA success based on the confidence level is purely an illusion?

We studied all the public implementations that we could find, and reproduced as many experiments as we could from the literature. Some experiments presented in research articles expose a low train-test accuracy gap, but show a high MIA accuracy. Unfortunately we have not been able to reproduce those experiments. Either the dataset was not public, or the processing steps was too vaguely described to allow for reproduction. Moreover, a majority of public implementations are currently wrong, and show unrealistically good results because of induced biases. Those biases are often a bad dataset splitting strategy leading to biased results or the testing of the attack models on samples produced by the shadow models instead of the target model. In definitive, we could not find counterexamples to our hypothesis which is alarming because it questions the reality of the MIA with shadow model. Lastly, our hypothesis explains some curious assertions like: “our attacks are robust even if the attacker’s assumptions about the distribution of the target model’s training data are not very accurate” [4], “even restricting the prediction vector to a single label (most likely class), which is the absolute minimum a model must output to remain useful, is not enough to fully prevent membership inference” [4]. Indeed, the CCR does not need confidence levels.


Experimental results

Several facts made us doubt that this new hypothesis was the absolute truth about the MIA. There still are subtle differences between the in and out sample distributions, when we single out true positives from false negatives. Right now, these differences are not enough to enable a classifier to exploit them, the distributions are too entangled. Is it possible to force the increase of these differences in distributions to better understand the MIA? After experimenting extensively with hyperparameters, we realize that the best test it is to train and test the attack models with only the true positive samples. It removes the bias of misclassification. For playing with the train-test accuracy gap, we modified the number of train epochs of the target and shadow models. It revealed new insight about MIA:

The results presented in Table I point out toward several conclusions: There is effectively a difference in the confidence distributions of in and out samples that can be exploited when we increase the influence factors. Even with true positive samples only, we are able to increase the attack model mean accuracy from 56.54% (15 epochs) to 61.03% (100 epochs) leaving the train-test misclassification accuracy gap unchanged on CIFAR-10. When there is no train-test accuracy gap, it seems that there are no difference in the confidence levels between in and out samples. Even when there is a train-test accuracy gap (cf. Fashion-MNIST), it does not imply differencies in the distributions of confidence levels between in and out samples (~50% accuracy of the TP only attack models, whatever the overtraining).

We can conclude that MIA success have one major factor of influence, the train-test misclassification difference (mainly given by the train-test accuracy gap) and a secondary factor, the confidence level difference between in and out samples (factor that can be increased with overtraining). However, this secondary factor is not always an influence of the MIA success. In most of our experiments, it plays little to no role contrary to what the current literature explains. For it to affect the result, the ML model should be overfitted (a noticeable train-test accuracy gap) and overtrained (a ML model with more parameters that the problem to solve requires along with too many training iterations). In real conditions, this kind of ML model are unlikely to be put in production anyway because of their bad performance. If the secondary factor does not play any role, the CCR (correct classification rule) has similar performance compared to the MIA attack. Finally, unless a specific unfavorable setup for the defender, the MIA is an illusory threat with current attack techniques.



Our experiments were conducted in a situation of a black-box attacker, having knowledge of target model hyperparameters and having access to samples comparable to those of training dataset. In the black-box setup, we started from the worst possible assumptions from the defender’s point of view. Starting with this setup, maximizing the major factors that increase privacy information leakage, we have reached success rates that remain quite low, especially if we consider a targeted attack. A membership test that reaches around 70% accuracy is not good enough to have confidence in its results. For the attacker to exploit it, it requires a massive attack and not a targeted attack. This greatly reduces the number of scenarii in which the MIA becomes a profitable attack lever. Moreover, if we start in a more realistic scenario of a black-box attacker, even in the case where the attacker manages to realize the model stealing [6] and thus have realistic approximation of the hyperparameters; it remains very difficult to know, even approximately, the major factors of influence of the MIA, i.e. the train-test accuracy gap and overtraining. The attacker is not able to measure the real success of his attack. The only thing he can measure is his ability to generalize on the samples generated by his shadows models. So even though it has around 70% accuracy on its test set with overfitted and overtrained shadow models, it provides no performance guarantees on the target model. In general, with a ML model in production, one tries to have a model with good generalization properties. So there is no reason to think that the model is very overfitted or very overtrained. In reality, the attacker have little to no clue on the success of its method. He cannot, therefore, have confidence in the success of his attack. This is why it makes the MIA a really fragile cog in a targeted or massive attack architecture.

The current techniques for MIA on supervised learning are an illusory threat

In this article, we have showed that the current understanding of the MIA success is at least incomplete and purely false in most cases. We presented a more accurate explanation for its success, i.e. the attack models captures mostly the misclassification difference between in and out samples. It means that in most cases the MIA is no better that a simple attack rule ”If the target model predicts the label correctly for a sample, then it comes from the training set, otherwise it does not”. Hence the current techniques for MIA on supervised learning are an illusory threat. It does not mean that in the future, better techniques will not be invented, making the membership inference an actual threat.



[1] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[2] Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. arXiv preprint arXiv:1806.11146, 2018.

[3] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. arXiv preprint arXiv:1804.00792, 2018.

[4] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017.

[5] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333. ACM, 2015.

[6] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing machine learning models via prediction apis. In USENIX Security Symposium, pages 601–618, 2016.

[7] Bruno Bouchard, Frédérick Imbeault, Abdenour Bouzouane, and Bob-Antoine J Menelas. Developing serious games specifically adapted to people suffering from alzheimer. In International Conference on Serious Games Development and Applications, pages 243–254. Springer, 2012.

[8] Michael Backes, Pascal Berrang, Mathias Humbert, and Praveen Manoharan. Membership privacy in micrornabased studies. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 319–330. ACM, 2016.

[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

[10] Giuseppe Ateniese, Giovanni Felici, Luigi V Mancini, Angelo Spognardi, Antonio Villani, and Domenico Vitali. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers. arXiv preprint arXiv:1306.4447, 2013.

[11] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.

[12] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 and cifar-100 datasets. URl: https://www. cs. Toronto. edu/kriz/cifar. Html, 6, 2009.

[13] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. Logan: Membership inference attacks against generative models. arXiv preprint arXiv:1705.07663, 2017.

[14] Christopher M Bishop et al. Neural networks for pattern recognition. Oxford university press, 1995.

[15] Hintze, J. L., & Nelson, R. D. (1998). Violin plots: a box plot-density trace synergism. The American Statistician, 52(2), 181–184.

Image sources

The following images have been used in this article:

Other images are self-made.


This article has been first published on medium here.

235 views0 comments

Recent Posts

See All


bottom of page