Susceptibility to anonymous gene databases for data protection violations

A new study shows that anonymous genetic databases are susceptible to identity theft and data protection violations. Researchers warn of the consequences.
(Symbolbild/natur.wiki)

Susceptibility to anonymous gene databases for data protection violations

A study has triggered concerns that a kind of genetic database that is increasingly popular with researchers could be exploited to disclose the identity of the participants or to link private health information with their public genetic profiles.

Individual cell data sets can contain information about gene expression in millions of cells collected by thousands of people. This data is often freely accessible and offer a valuable resource for researchers who examine the effects of diseases at cellular level. The data should be anonymized, but a study published on October 2 in the magazine Cell 1 To uncover private information about individuals in another study, ”write the authors.

The results emphasize the difficulty of reconciling the interests of the researchers with the privacy of the donors. "Our genomes are very identifying. You can say a lot about us, our characteristics and our fuses of illness," says the co-author of the study, Gamze Gürsoy, bioinformatics researcher at Columbia University in New York City. "You can change your credit card number if it is public, but you can't change your genome."

sensitive data

Consider the privacy in Genetic Data records have already been expressed, but mainly focused on "mass data" genetic profiles. These contain information on genetic activity that are average over a large cell population and not about individual cells.

In the past, it was thought that single cell data sets were not so susceptible to data protection violations, because of the level of "noise", or variation in gene expression, between the different cells. But Gürsoy and her team were able to prove that this is not the case.

The team examined three publicly available single cell data sets, which contained blood cells from people with lupus, chronic car disease. The researchers found that they were able to use the data for gene expression to predict the structure of the genome of a person by combining these values ​​with information about expression quantitative trait loci (EQtls). The details of the EQTLS variations in chromosome, which correlate with gene expression-are also publicly accessible in single cell data sets.

In order to test the reliability of their work, the researchers checked their genome forecasts based on a genome database that corresponded to the cells used. They were able to link most data records with the corresponding genome, with an accuracy rate of over 80 %.

In contrast to the data on gene expression and EQtls, complete genome databases can usually only be viewed by scientists in order to protect the identifying information from the donors. However, the researchers point out that the genome data of a participant could be publicly available somewhere else. For example, you could have uploaded them on a genealogy website on which users send DNA samples to learn more about their descent. In this case, an attacker could identify a person whose cells are in a single cell data set by analyzing their genome. This could uncover personal data that are related to a sensitive feature such as a psychiatric disorder, since research participants are often selected to examine the biology of these complex conditions.

Data protection injuries like this could have real consequences, such as discrimination at the workplace, says Gürsoy. It adds that leaks could even have an impact on future generations, since genetic features can be passed on to descendants. "Everything that gets known about us is carried on by generations," she says.

Bradley Malin, who in the area of ​​large-scale genome data exchange at Vanderbilt University in Nashville, Tennessee, researches, describes the study as a "new extension and contribution to literature". He adds that future research could investigate whether genome data could also be linked in larger data records that contain the samples of thousands or millions of people.

Competition interests

scientists are unsure how best to tackle the data protection concerns. "There is a desire to protect the privacy of the individual, but also the desire to promote medical research collectively, and these are unfortunately in contradiction to each other," says Mark Gerstein, who is researching at Yale University in New Haven, Connecticut, Medic Data. The simplest solution would be to make access to genetic data more difficult, but that would negatively influence research, he says. "We have to share and aggregate large amounts of information," he explains. "If we block everything and make it more private, it really hinders the entire process."

In their study, Gürsoy and her colleagues require greater transparency about the risks for participants who share their genome data and suggest that researchers should ensure that the donors agree to pass on their data. Another possible way could be the encryption of personal data if you are part of a public database. The authors recognize that this would complicate the process of creating and waiting data records, but it is of the opinion that it could help protect the privacy of the participants.

  1. Walker, C. R. et al. Cell https://doi.org/10.1016/J.CELL.2024.09.012 (2024).

    Article Download references