In March 2020, when the WHO declared a pandemic, the public database of GISAID sequences contained 524 covid sequences. Over the next month, scientists sent another 6,000. By the end of May, the total was over 35,000. (In contrast, global scientists have added 40,000 flu sequences to GISAID throughout 2019)
“Without a name, forget it – we can’t understand what other people are saying,” says Anderson Brito, a postdoc in genomic epidemiology at the Yale School of Public Health, who contributes to Panga’s efforts.
As the number of spiral-like sequences increased, researchers trying to study them were forced to create entirely new infrastructure and standards on the go. The universal naming system was one of the most important elements of this effort: without it, scientists would struggle to talk to each other about how the virus’s offspring travel and change – either to ask a question or, more importantly, to sound the alarm.
Where did Pango come from
In April 2020, several prominent virologists in the UK and Australia proposed a system of letters and numbers for naming vines or new branches of the covid family. It had logic and hierarchy, although the names it generated – like B.1.1.7 – were a bit of a bite.
One of the authors of the paper was Áine O’Toole, a doctoral candidate at the University of Edinburgh. She soon became the primary person to actually do that sorting and classification, eventually combing hundreds of thousands of sequences by hand.
She says: “Very early on, he was the one who was available for the sequence curator. That was my job in the end. I guess I never understood the scale we were going to reach. ”
She quickly began building software to assign new genomes to real lineages. Shortly afterwards, another researcher, postdoc Emily Scher, built a machine learning algorithm to speed things up even more.
They named the software Pangolin, a reference to discuss the animal origin of covids. (The whole system is now simply known as Pango.)
The naming system, along with the software for its implementation, quickly became globally important. Although the WHO has recently started using Greek letters for variants that seem particularly worrying, such as delta, these nicknames are intended for the public and the media. Delta actually refers to a growing family of variants, which scientists call more precise Pango names: B.1.617.2, AY.1, AY.2 and AY.3.
“When alpha appeared in the UK, Pango made it easier for us to look for these mutations in our genomes to see if we had that lineage in our country as well,” Jolly says. “Since then, Pango has been used as a starting point for reporting and monitoring variants in India.”
Because Pango offers a rational, orderly approach to what would otherwise be chaos, it could forever change the way scientists name viral strains – allowing experts from around the world to work together with a common vocabulary. Brito says: “Most likely, this will be the format we will use to monitor any other new virus.”
Many basic tools for tracking the covid genome have been developed and maintained by early career scientists like O’Toole and Scher in the last year and a half. As the need for collaboration around the world exploded, scientists rushed to support it with ad hoc infrastructure like Pang. Much of that work fell on technologically smart young researchers in their twenties and thirties. They used informal networks and tools that were open source – meaning they were free to use, and anyone could volunteer to add improvements and enhancements.
“People who are at the forefront of new technologies are usually students and postdoctoral fellows,” says Angie Hinrichs, a bioinformatics scientist at UC Santa Cruz who joined the Pangolin project earlier this year. For example, O’Toole and Scher work in the lab of Andrew Rambaut, a genomic epidemiologist who published the first public sequences online after receiving them from Chinese scientists. “Coincidentally, it was the perfect place for them to provide these tools that have become absolutely critical,” Hinrichs says.
It was not easy. For most of 2020, O’Toole itself took the main responsibility for identifying and naming new lineages. The university was closed, but she and another Rambaut doctoral student, Verity Hill, were given permission to enter the office. Her commute to work, a 40-minute walk to school from the apartment where she lived alone, gave her a sense of normalcy.
Every few weeks, O’Toole would retrieve an entire repository of data from a GISAID database, which grew exponentially each time. It would then hunt groups of genomes with mutations that looked similar, or things that looked unusual and that could be mislabeled.
When she got particularly stuck, Hill, Rambaut, and other members of the lab would jump in to discuss the labels. But grumbling fell on her.
Deciding when the offspring of the virus deserve a new last name can be as much an art as it is a science. It was a painstaking process, sorting through an unprecedented number of genomes and asking over and over again: Is this a new variant of covid or not?
“It was pretty tiring,” she says. “But it was always really humble. Imagine going through 20,000 sequences from 100 different places in the world. I’ve seen sequences from places I’ve never even heard of. “
As time went on, O’Toole struggled to keep up with the pace of new genomes for sorting and naming.
In June 2020, there were over 57,000 sequences stored in the GISAID database, and O’Toole classified them into 39 variants. By November 2020, a month after she was due to submit her thesis, O’Toole had gone through the data on her own for the last time. It took her 10 days to complete all the sequences, which until then numbered 200,000. (Although Covid overshadowed his research on other viruses, he puts a chapter on Pang in his thesis.)
Fortunately, Pango software was made for collaboration, and others have stepped up. The Internet community – the one Jolly turned to when she noticed a variant spreading through India – sprang up and grew. This year, O’Toole’s work was much easier. New vines are now marked mostly when epidemiologists around the world contact O’Toole and the rest of the team via Twitter, email or GitHub – her preferred method.
“It’s more reactionary now,” says O’Toole. “If a group of researchers somewhere in the world is working on some data and they believe they have identified a new lineage, they can apply.”
The flood of data continued. Last spring, the team held a “pangoton,” a sort of hackathon in which it sorted 800,000 sequences into about 1,200 vines.
“We gave ourselves three solid days,” O’Toole says. “It took two weeks.”
Since then, the Pango team has hired several more volunteers, such as UCSC researcher Hindrix and researcher Yale Brito, who both got involved initially by adding their two cents to the Twitter and GitHub page. University of Cambridge postdoc Chris Ruis has turned his attention to helping O’Toole clear up GitHub’s backlog.
O’Toole recently asked them to formally join the organization as part of the newly created Pango network Vine Determination Board, which discusses and makes decisions about variant names. The second board, which includes the head of Rambaut’s laboratory, makes decisions at a higher level.
“We have a website and email that isn’t just my email,” O’Toole says. “It’s gotten a lot more formalized and I think it’s really going to help it scale.”
A few cracks at the edges began to show as the data grew. As of today, there are nearly 2.5 million covid sequences in GISAID, which the Pango team has divided into 1,300 branches. Each branch corresponds to its variant. According to the WHO, eight of them need to be watched.
With so much to process, the software starts to buckle. Things get mislabeled. Many strains look similar, because the virus develops the most favorable mutations over and over again.
As a stop measure, the team has built new software that uses a different sorting method and can catch things that Pango can miss.