Imagine, you are taking a treatment for disease X, that nobody is aware of except for you and the doctor. In due course, a research agency (owned by the hospital management) collects your private information (regarding your disease and symptoms) assuring you that your information will be safe, confidential and secure with them. However, after a few days, one stranger confronts and tells you that he is aware about your illness and tries to blackmail you.
Threatening! Right?
Health data is super sensitive. Infact, it is more sensitive data than your bank account number and the kind of data breach discussed in the example above can occur despite the highest level of cyber security measures.
How? Let’s bring in Sherlock Holmes to solve this case as the criminal may have not stolen data but may have used a similar approach to Holmes’ science of deduction. He may have deduced your details through statistical disclosure.
Let’s discuss the above peculiar case of data theft in more detail. Here is how he may have deduced:
The criminal was another patient in the hospital who had simply seen you and knew that your details were also taken. Probably, he saw you filling the researcher’s questionnaire.
In the monthly research magazine of the agency, a table was published containing details of diseases X and Y. A simple part of the table gave the count of patients as below:
X: count of patients – 50
Y: count of patients – 1
Here, the patient who deduced and misused the information was none other than the person with disease Y. By looking at the table, he immediately knew that since he is the one with disease Y, the others who filled the researcher’s questionnaire have disease X.
This deduction technique is called “Statistical disclosure by Differencing” and to deal with it we need “Anonymisation” methods.
Anonymisation is a process by which personal data is rendered as non-personal. It is different from cyber security, although equally important and also a critical area of research demanding more attention. It is based on securing data by dealing with data itself and not its storage environment. It includes encryption techniques, however, is not completely restricted to the same. It allows the researcher to analyse the data beyond traditional theft practices and presumes that the thieves come with Sherlock Holmes level of smartness!
Let us bring in an anonymization technique in the above scenario. Here, we need a particular type of anonymisation that doesn’t dilute important information for the reader when he looks at the table. If a little change in the proportion of patients doesn’t change/affect the overall take away, then a simple integer can be added at a decent/ an acceptable low count to make the information look like this –
X: count of patients – 50
Y: count of patients – 5
However, this may not work if it impacts data integrity and if actual count is important. For such instance, we may/will need some other anonymization technique.
Similarly, there are various other statistical disclosure and their respective ways of anonymisation. Therefore, the topic “statistical disclosure and anonymisation” is considered as an open area of research.
Consider another quick example of a respondent whose date of birth is 2nd Jan 1988 and the researcher needs to calculate respondent age at the end of diagnosis. But does he need the exact date of birth of respondent to calculate his age? If age is needed in years and not in days then even 5th Feb 1988 would give his age as 33 years as of Aug 2021. As date of birth is sensitive information that we want to avoid sharing, why not share a strategically manipulated date which doesn’t impact the results.
The sensitivity of data in the health industry puts a higher responsibility on data security professionals. This is why health research experts are mostly third-party professionals and different from data collection agencies. This is mainly to avoid the risk of statistical disclosure. Lately, researchers have understood the importance of anonymisation and therefore, the central authorities for Health management are creating safe havens for storing data from where data for research can be taken/accessible. Here, in addition to top-class cyber security, the data also undergoes stringent anonymisation processes to avoid data misuse.
Author: Kunal Hriday
Reference:
Mark Elliot. (2021) Anonymisation: theory and practice. National Centre for Research Methods online learning resource.