Chinese scientists mysteriously deleted data about the coronavirus' genetic sequence collected early in the pandemic, a virologist has discovered, suggesting the aim was to "obscure their existence".
Jesse Bloom - a virologist at the Fred Hutchinson Cancer Research Center in Seattle - says he managed to recover the data from hidden files on Google Cloud, and they reveal the earliest infections picked up in Wuhan probably weren't among the first.
Dr Bloom was looking at a March 2020 study published by scientists from Wuhan, which referenced 241 genetic sequences belonging to the SARS-CoV-2 virus uploaded to a United States-based online database called the Sequence Read Archive. But when he tried finding them, they were gone.
He followed a trail of clues left behind in published papers and eventually found 13 of the missing sequences, which revealed the viruses looked at in the studies - collected in January 2020 - had three fewer mutations than those collected at the Huanan Seafood Wholesale Market in December, where the first known outbreak occurred.
The source of the SARS-CoV-2 virus, which causes COVID-19, is hotly debated and has been the subject of international investigations. There have been reports of early cases with no link to the market at all, possibly as far back as September as far away as Italy.
"Analysis of these sequences... suggests that the Huanan Seafood Market sequences that are the focus of the joint WHO-China report are not fully representative of the viruses in Wuhan early in the epidemic," he wrote in a new paper, published online this week ahead of peer review.
"The earliest known SARS-CoV-2 sequences, which are mostly derived from the Huanan Seafood Market, are notably more different from these bat coronaviruses than other sequences collected at later dates outside Wuhan."
In other words, the virus was likely circulating in humans well before the December outbreak which Chinese officials reported to the World Health Organization at the end of the month. By then, the strain in Wuhan had already picked up three mutations missing from other strains - those in the deleted files.
The genetic sequences that were deleted from the database showed a virus closer to its bat origins than the ones collected at the market.
Dr Bloom said it's common to use "genomic epidemiology to infer the timing and dynamics of spread from analysis of viral sequences".
"But in the case of Wuhan, genomic epidemiology has also proven frustratingly inconclusive. Some of the problem is simply limited data: despite the fact that Wuhan has advanced virology labs, there is only patchy sampling of SARS-CoV-2 sequences from the first months of the city's explosive outbreak... just a handful of Wuhan sequences are available from before late January of 2020 . This paucity of sequences could be due in part to an order that unauthorised Chinese labs destroy all coronavirus samples from early in the outbreak, reportedly for 'laboratory biological safety' reasons."
The New York Times tried to find out why the sequences were deleted, but the scientists who uploaded them didn't respond. A spokesperson for the US National Library of Medicine, which runs the Sequence Read Archive, said they were removed in June 2020 at the scientists' request, saying they were going to be uploaded elsewhere. Dr Bloom looked, and couldn't find them anywhere.
"It therefore seems likely the sequences were deleted to obscure their existence," he wrote in his paper, linking it to a pattern of apparent obfuscation undertaken by Chinese officials in the pandemic's early days.
China has repeatedly denied accusations it withheld information early in the outbreak, and has dismissed claims the virus could have escaped from one of its labs in Wuhan.