Thirteen genetic sequences — isolated from people with COVID-19 infections in the early days of the pandemic in China — were mysteriously deleted from an online database last year but have now been recovered.
Jesse Bloom, a computational biologist and specialist in viral evolution at the Fred Hutchinson Cancer Research Center, found that the sequences had been removed from an online database at the request of scientists in Wuhan, China. But with some internet sleuthing, he was able to recover copies of the data stored on Google Cloud.
The sequences don’t fundamentally change scientists’ understanding of the origins of COVID-19 — including the fraught question of whether the coronavirus spread naturally from animals to people or escaped in a laboratory accident. But their deletion adds to concerns that secrecy from the Chinese government has obstructed international efforts to understand how COVID-19 emerged.
Bloom’s results were published in a preprint paper, not yet peer-reviewed by other scientists, released on Tuesday. “I think it’s certainly consistent with an attempt to hide the sequences,” he told BuzzFeed News.
Bloom learned about the deleted data after reading a paper from a team led by Carlos Farkas at the University of Manitoba in Canada about some of the earliest genetic sequences of SARS-CoV-2. Farkas’s paper described sequences sampled from hospital outpatients in a project by researchers in Wuhan who were developing diagnostic tests for the virus. But when Bloom tried to download the sequences from the Sequence Read Archive, an online database run by the US National Institutes of Health, he was given error messages showing they had been removed.
Bloom realized that the copies of SRA data are also maintained on servers run by Google, and was able to puzzle out the URLs where the missing sequences could be found in the cloud. In this way, he recovered 13 genetic sequences that may help answer questions about how the coronavirus evolved and where it came from.
Bloom found that the deleted sequences, like others collected at later dates outside the city, were more similar to bat coronaviruses — presumed to be the ultimate ancestors of the virus that causes COVID-19 — than sequences linked to the Huanan Seafood Market in Wuhan. This adds to earlier suggestions that the seafood market may have been an early victim of COVID-19, rather than the place where the coronavirus first jumped over from animals into people.
“This is a very interesting study performed by Dr. Bloom, and in my opinion the analysis is totally correct,” Farkas told BuzzFeed News by email. Scott Gottlieb, formerly head of the Food and Drug Administration, also praised the findings on Twitter.
But some scientists were less impressed. “It really adds nothing to the origins debate,” Robert Garry of Tulane University in New Orleans told BuzzFeed News by email. Garry argued that the Huanan market or other markets in Wuhan could still be the source of COVID-19.
Bloom is one of 18 scientists who in May published a letter criticizing the WHO and China’s study into the origins of SARS-CoV-2. The scientists argued the WHO–China report failed to give “balanced consideration” to the competing ideas that the coronavirus spread naturally from animals to people or escaped from a lab — a theory the report judged to be “extremely unlikely.” After the WHO–China report was published, the US and 13 other governments complained that it “lacked access to complete, original data and samples.”
The deleted virus sequences were first uploaded to the SRA in early March 2020, around the time that researchers led by Yan Li and Tiangang Liu of Wuhan University published a preprint describing their work using genetic sequencing to diagnose COVID-19. Just days before, China’s State Council had ordered that all papers related to COVID-19 be centrally approved.
The sequences were then withdrawn from the SRA in June, around the time that the final version of the paper appeared in a scientific journal. According to the NIH, the authors asked for the sequences to be removed. “The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA to avoid version control issues,” NIH spokesperson Amanda Fine told BuzzFeed News by email.
However, it’s unclear whether the sequences have since been posted online in another database.
“There is no plausible scientific reason for the deletion,” Bloom wrote in his preprint, arguing the sequences were likely “deleted to obscure their existence.” That suggested, he wrote, “a less than wholehearted effort to trace early spread of the epidemic.”
Although the sequences were deleted, Garry pointed out that key genetic mutations they contained were still published in a table in the final paper from the Wuhan team. “Jesse Bloom found exactly nothing new that is not already part of the scientific literature,” Garry told BuzzFeed News, accusing Bloom of writing his preprint in an “inflammatory way that is unscientific and unnecessary.”
Bloom wrote to the Wuhan researchers asking them why the sequences had been deleted but received no reply. Li and Liu similarly did not immediately respond to a query from BuzzFeed News.
This is not the first time scientists have raised concerns about the removal of data that may help answer questions about the origins of COVID-19. The main database containing information on coronavirus sequences maintained by the Wuhan Institute of Virology — which is the focus of speculation about a possible “lab leak” of the virus — was taken offline in September 2019. When members of the WHO–China team that studied the origins of the pandemic visited the institute in February, they were told the database, which reportedly included data on 22,000 coronavirus samples and sequence records, had bee removed after repeated hacking attempts.