A Mechanist’s Guide to the Coronavirus Genome


Hello and welcome to my Coronavirus Genome Walkthrough.

(Hoping any individual comes out with that Vaccine Speedrun soon. This boss fight is de facto shaping as a lot as be an intense one and we’ll need the complete artifacts we can come by.)

Right here, I purpose to provide a mechanistic clarification of the SARS-CoV-2 genome’s syntax and semantics. Let’s investigate what the SARS-CoV-2 viral genome truly does as if studying via code love a compiler, from nucleotides to amino acids the complete technique to proteins. From the four nefarious pairs the complete strategy as a lot as the performed protein-coated virus, what is a plague love here’s truly constituted of on the concrete, physical stage?

Understanding a Elephantine System

The underlying cause of this essay is less about the coronavirus per se and more about how having a tiny—however functionally full—share of viral RNA to analyze affords me a diversified opportunity to strive to personal a complete self-replicating machine from scratch. Right here’s no longer a feat that I would favor the fortitude to manually replicate with the paunchy human genome, as an illustration—however the coronavirus genome, love the nematode genome, is sufficiently tiny that we stand a likelihood at building a complete knowing. Perchance something connected to interpretability, however for biological programs in preference to man made neural networks.

As a consequence, this essay is never any longer intended to plan epidemiological conclusions; there are a good deal of alternative sources for that. This essay is set fully knowing a biological system on the chemical and physical stage.

Play, Curiosity, and Mechanical Understanding

All the blueprint via this essay, I follow my curiosity in the shape of serious play: if I explore I’m puzzled about something, I explore into it and explore it except I’m jubilant that I now notice, and that my knowing is a mechanical knowing. Issues are constituted of stuff! It seems to be to be that we can notice that stuff!

I could perchance well well skip over some foremost factors that weren’t confusing to me for the length of my personal research, however your lumber need no longer be the identical as mine. Even as you occur to’re puzzled about something while studying this essay, I befriend you to go and explore it up! Be outlandish when curiosity arises. That’s the meditation.

This all, a minimal of, has been my blueprint as I dispute out to build this share! As Ken Liu stated of his philosophy while translating The Three-Body Mission, “I could perchance well well no longer like succeeded, however these had been the factors I had in thoughts as I dispute about my project.”

Part 1, here, covers merely the genome and its translation to proteins. I am hoping to additionally write a Part 2 which would possibly perchance well perhaps veil the structure and characteristic of those proteins, their protein-protein interactions, and the paunchy viral existence cycle.

Let’s launch.

As a reminder, SARS-CoV-2 is a certain-sense single-stranded RNA virus.

What does this mean we can question?

  1. Single-stranded: Its genome is a single strand of RNA (ssRNA).
  2. Obvious-sense: That single strand of RNA would possibly perchance well well additionally be straight away translated into protein by the ribosomes of the cell it infects.

From this we can additionally infer that one in every of the proteins the virus encodes for must be RNA-dependent RNA polymerase (RdRP), a protein which synthesizes unique RNA given an RNA template. That’s staunch kind: RNA → RNA. Nonetheless, in holding with the central dogma of molecular biology, RNA → RNA is an unconscionable heresy. Correspondingly, RdRP is never any longer naturally screen in cells! All identified certain-sense ssRNA viruses due to this fact must encode RdRP in suppose to successfully commit this heresy.

…Wait a minute, the phrase “certain-sense ssRNA virus” implies the existence of detrimental-sense viruses. If those don’t encode their proteins straight away, how can they perhaps work?

Obvious sense and detrimental sense

Negative-sense ssRNA viruses additionally exist! Influenza, Ebola, and measles are examples.

The interior contents of detrimental-sense ssRNA viruses consists no longer of an RNA genome however of a ribonucleoprotein, which incorporates both an RNA genome as properly as a cohort of viral proteins salubrious of replicating RNA. Not like certain-sense ssRNA viruses, detrimental-sense ssRNA viruses must lumber with a working copy of their RNA-replicating proteins. This ribonucleoprotein has enzymatic job!

RdRP as drug target

Since RdRP has (as a long way as I do know) no legit cause in human cells and is now not naturally coded by them, would possibly perchance well well it provide a ability target for unusual antiviral treatment?

Velkov et al. 2014 explores RdRP as a drug target for antivirals against the Hendra virus, a detrimental-sense ssRNA virus, despite the indisputable fact that I’m unable to receive the paunchy textual notify material.

This overview examines doubtlessly the most contemporary knowledge in accordance with the multi-area architecture of the Hendra RdRP and highlights which critical area gains picture tangible targets for drug construction against this lethal illness.

There must be some reason that constructing antivirals against this protein is technically (or socially) sophisticated, or I’d like anticipated us to set aside it by now – there are many RNA viruses that this drug target would possibly perchance well well theoretically hit. Flagging this discrepancy for additional research.

Inspire to SARS-CoV-2! First, let’s come by us a genome. Clearly this virus has viewed some mutations as it’s unfold around, as you would possibly perchance perhaps explore at NextStrain, so we’ve technically got picks as to which one to analyze. For this thread I’ll merely persist with inspecting one model of the genome: Wuhan-Hu-1.

As a reminder, each and every A, G, C, and T in a genome is one in every of the four nucleotides: adenine, guanine, cytosine, and thymine. There are actually a good deal of ideas to engineer diversified unnatural nefarious pair programs by along with man made nucleotides, and these can also be integrated into transcription and translation, however for no topic reason, these four and no longer others are what existence in the raze ended up with.

The four nucleotides in DNA.

The genome of Wuhan-Hu-1 is supplied from NCBI GenBank. Since SARS-CoV-2 is an RNA virus, each and every T on this string technically represents a U, for uracil, RNA’s files-connected of thymine. The genome sequence is due to this fact:




Prepare along with the genome »

That’s 29,903 nucleotides. Since there are totally four imaginable nucleotides, we can estimate the ideas compression cost of each and every nucleotide at approximately 2 bits; the virus’s genome due to this fact requires totally 7.5 kilobytes to store. That’s roughly as worthy records, byte for byte, as there are characters on this essay as a lot as this level!

Lay out those 29,903 nucleobases along a ribose-phosphate backbone, studying them left to staunch kind from the 5’ pause to the 3’ pause, and bam – if that single moleculehad been teleported into a cell, that’s 100% chemically sufficientto infect an particular person with the plague du jour.

*plus the 5’ cap, talked about below

modulo viral load effects??

Easy ideas to elaborate the Wuhan-Hu-1 genome as a complete molecule.

Poly-A tail

First question, and perhaps doubtlessly the most evident one to the bare witness – what’s with the complete AAAAA on the pause of the viral genome?

29821 ...                                                ... AAAAAAAAAA

Prepare along with the genome »

It’s… yelling at us? Is it… struggling? Might well perchance soundless we support?

Straight forward: It’s a 3’ poly-A tail! This prolonged tail of adenosine monomers is extraordinarily overall in both our personal cells and in RNA viruses.

Our personal messenger RNA (mRNA) has a poly-A tail when it’s freshly produced in the nucleus so as to behind its degradation by the cell, permitting it to supreme prolonged sufficient to be transcribed into protein. Naturally, have to you’re a certain-strand RNA virus, you’re additionally going to hope to supreme prolonged sufficient to be transcribed into protein – so, you must to the identical characteristic, yourself.

Genome 0.11% defined. So a long way so staunch kind!

5’ cap

While we’re discussing chemical aspects of mRNA, dispute that the viral genome presumably must additionally like a 5’ cap – an extra 7-methylguanosine on the 5’ pause of its RNA strand – merely love mRNAs set aside.

A 5′ cap, consisting of a 7-methylguanosine as properly as methylation of the principle two ribose sugars.

The cap is never any longer straight away shown in the viral genome sequence or talked about in NCBI GenBank, however it no doubt is referenced in quite a bit of papers discussing coronaviral genomes:

Since 2003, the outbreak of excessive acute respiratory syndrome coronavirus has drawn elevated attention and stimulated diverse stories on the molecular virology of coronaviruses. Right here, we overview doubtlessly the most contemporary knowing of the mechanisms adopted by coronaviruses to come by the 5′-cap structure and methylation modification of viral genomic RNAs.

Coronaviruses have a cap structure on the 5′ ends of viral genomic RNA and subgenomic RNAs, which is generated via consecutive methylations by virally encoded guanine-N7-methyltransferase (N7-MTase) and a pair of′-O-methyltransferase (2′-O-MTase). The coronaviral N7-MTase is uncommon for its physical linkage with an exoribonuclease (ExoN) harbored in nonstructural protein 14 (nsp14) of coronaviruses.

Right here, we now like reconstituted full SARS-CoV mRNA cap methylation in vitro.

Handle the poly-A tail, the 5’ cap helps the genome to be identified and translated by ribosomes in preference to destroyed by the cell’s immune response.

How does the virus even be particular that that it receives a 5’ cap and a poly-A tail, no longer to level out its outer coat? Optimistically these questions will doubtless be resolved by our overview of its genes and proteins… let’s transfer on to explore at those!

Per the “Facets” a part of the genome, some other time from NCBI GenBank, here are the identifiable genes on this genome, in suppose:

  1. Orf1ab (for orf1ab polyprotein)
  2. S (for surface glycoprotein)
  3. Orf3a (for orf3a protein)
  4. E (for envelope protein)
  5. M (for membrane glycoprotein)
  6. Orf6 (for orf6 protein)
  7. Orf7a (for orf7a protein)
  8. Orf8 (for orf8 protein)
  9. N (for nucleocapsid phosphoprotein)
  10. Orf10 (for orf10 protein)

Let’s notice how these genes come by translated into proteins.

Translation of Orf1ab

Right here’s the principle gene in the genome and it is miles additionally by a long way the longest, weighing in at 7,096 amino acids:




These letters are single-letter amino acid abbreviations.

Unsurprisingly for the virus’s longest gene, this one seems to be to be to encode for our pal the RNA replicase, RdRP! Its paunchy protein product consumes monomers of RNA and catalyzes their polymerization.

And it is miles pretty prolonged: this virus has 10 genes, and this single gene represents 71.2% of the viral genome. That uncommon RNA → RNA replicase functionality truly is quite a bit of the ideas-theoretic notify material here. Quines; go figure!

More on this protein’s structure and characteristic later, however first: how set aside the underlying nucleotides of the Orf1ab gene come by these explicit amino acids?

A thermodynamic surprise: Ribosomal frameshift

The Orf1ab gene spans the differ from nucleotide 266 to nucleotide 21,555, inclusive. Nucleotides on this GenBank records are unfortunately 1-listed, no longer 0-listed.

We’re going to explore at nucleotide 266 the signature AUG of a launch codon, and at nucleotide 21,553 the UAA of an ochre terminate codon. So a long way so staunch kind!

241   ...                     ...AUGGA GAGCCUUGUC CCUGGUUUCA ACGAGAAAAC


21541 CUUGUUAACA ACUAA...                                           ...

Prepare along with the genome »

Nonetheless, confusingly, the length of this coding assign is 21,555 – 265 = 21,290, which is never any longer divisible by 3. Generally, 3 nucleotides = 1 amino acid, so a gene’s length is regularly divisible by 3. What’s going on?

Veil that in the GenBank records the gene is tagged ribosomal_slippage. Additionally dispute that in GenBank the gene’s assign is notated as be a part of(266..13468,13468..21555) in preference to merely 266..21555.

After some research, the acknowledge here is that nucleotide 13,468 is principally feeble twice, due to a -1 ribosomal frameshift, an spicy thermodynamic-biochemical quirk of particular viral genomes!

Per this article on ribosomal frameshifting in viruses:

Programmed ribosomal frameshifting is an change mechanism of translation to merge proteins encoded by two overlapping open studying frames. The frameshift occurs at low frequency and contains ribosomes slipping by one nefarious in both the 5’(-1) or 3’(+1) directions for the length of translation. Some viruses contains both a +1 and a -1 ribosomal frameshift. […]

All cis-appearing frameshift indicators encoded in mRNAs are minimally light of two purposeful aspects: a heptanucleotide “slippery sequence” conforming to the long-established raze XXXYYYZ, adopted by an RNA structural element, in most cases an H-kind RNA pseudoknot, positioned an optimal quantity of nucleotides (5 to 9) downstream.

If we explore around nucleotide 13,468, we set aside truly receive the heptanucleotide “slippery sequence” accountable: it’s UUUAAAC. That C is nucleotide 13,468 and it ends up getting transcribed twice.


Prepare along with the genome »

This frameshift affords us a complete length of 21,291 nucleotides. Subtract 3 for the terminate codon and then divide by 3, and we come by a quantity which goes the reported protein sequence’s length: 7,096 amino acids. Hooray!

So, the arithmetic tests out. We now know what ribosomal_slippage and be a part of(266..13468,13468..21555) mean, and we perceive how these 21,290 nucleotides become 7,096 amino acids. Nonetheless, I soundless like two questions:

  1. What??
  2. How does ribosomal frameshifting even work??

Ribosomal frameshifting on the molecular stage

Thermodynamic regulate of -1 programmed ribosomal frameshifting (Bock et al. 2019) explores how ribosomal frameshifting occurs by performing free-energy molecular dynamics simulations. This paper additionally explains the structure and characteristic of that heptanucleotide slippery sequence, pointing out:

Spontaneous ribosome slippage is a uncommon match that occurs, on practical, once in 104–105 codons. This low spontaneous frameshifting increases dramatically on explicit mRNAs that have sequences for programmed ribosomal frameshifting (PRF). PRF requires a slippery sequence, which in most cases contains a X XXY YYZ heptamer, the assign aside XXX and YYY are triplets of connected bases and Z is any nucleotide, which permits for cognate pairing of the P-assign and A-assign tRNAs in the 0-frame and −1-frame. The nature of the tRNAs certain to the slippery assign codons is serious, along with the modifications of nucleotides in the anticodon loop (i.e., at positions 34 and 37 of the tRNA).

This paper goes on to analyze several heptanucleotide slippery sequences, drawing examples from the E. coli dnaX gene and explaining their thermodynamic characteristics. Per their breakdown, each and every example sequence relies on quite a bit of of the next scamper pairings:

  • The U·G scamper pair. Per Varani and McClain 2000, the U·G scamper pair “has connected thermodynamic steadiness to Watson–Crick nefarious pairs and is practically isomorphic to them.”
  • A·A and U·U mismatches.
  • G·S and A·S pairs. Per Bock et al, E. coli “has a single tRNALys isoacceptor (anticodon 3’UUS5’) for decoding the 2 Lys codons, AAG and AAA,” the assign aside “S denotes the modified nucleotide mnm5s2U.”

As an instance, one in every of the heptanucleotide slippery sequences defined in Bock et al. is the heptanucleotide sequence UUUUAAG. When the ribosome reads this sequence, it in the origin parses it as ..U UUUPheAAGLys, however then jolts -1 backwards into the ... UUUPheUAALys studying frame. Despite in overall being translated as a terminate codon, that 2d UAA retains its hooked up tRNALys by the combo of a U·U mismatch and an A·S pair.

The UUUUAAG slippery sequence.

Unfortunately, UUUUAAG isn’t the sequence we’re drawn to if we want to personal SARS-CoV-2! We favor our explicit heptanucleotide slippery sequence of ardour, UUUAAAC, and no topic this paper’s thoroughness and usefulness, none of its examples involve it. How will we be particular that UUUAAAC has the thermodynamic properties that it desires in suppose for SARS-CoV-2 as a blueprint to come by a protein here?

After some investigation, I sooner or later stumbled upon Mutational Prognosis of the “Slippery-sequence” Element of a Coronavirus Ribosomal Frameshifting Stamp (Brierley, Jenner, and Inglis 1992), a research paper which covers exactly this identical heptanucleotide sequence (and in the context of coronaviruses, too), and even affords a favorable blueprint!

The UUUAAAC slippery sequence.

The paper performs some experiments and confirms how the UUUAAAC slippery sequence works:

  1. First, a tRNALeu and a tRNAAsn bind to the UUA and AAC.
  2. After a -1 frameshift, those two tRNAs are actually scamper-paired to UUU (with a U·U mismatch) and AAA (with an A·G mismatch1).
  3. Translation then proceeds as long-established from there, with the next codon producing a tRNAArg.

I’m soundless quite weirded out by ribosomal frameshift, however jubilant.

Partial translation of Orf1ab

One supreme detail while we’re discussing the gene Orf1ab. A quantity of papers I’ve read to this level appear to allude to the indisputable fact that Orf1ab truly produces two protein merchandise: one, its full protein product (named pp1ab), and some other, a partial translation (named pp1a) as a result of ribosome falling off on the ribosomal frameshift in preference to undergoing a frameshift match. That first half of of the sequence itself would possibly perchance well well additionally be called the gene Orf1a.

As an instance, from Graham et al. 2008 on the SARS coronavirus:

Translation of ORF1a results in a theoretical polyprotein of ∼500 kDa, while translation of ORF1ab results an ∼800 kDa polyprotein.

The ORF 1a and 1ab polyproteins are no longer detected for the length of an infection, since they are in all likelihood processed co- and put up-translationally into intermediate and venerable proteins by proteinase activities in the nascent polyproteins.

Each and every of those genes come by polyproteins that truly come by chopped up into smaller proteins earlier than going on to enact their characteristic, so the likelihood of premature termination of pp1a ends up being of tiny consequence moreover inasmuch as it partly reduces translation of pp1ab.

Translation of the complete opposite genes

This concludes our prognosis of the translation of the Orf1ab gene. Genome 71.31% defined to this level!

In comparability, the last 9 genes are quite uneventful. They all launch with a launch codon (AUG), pause with a terminate codon (UAA, UGA, or UAG), and don’t strive to set aside something complex in between.

Gene            Commence Nucleotide            Discontinuance Nucleotide            Gene Dimension            Polypeptide Dimension           
S    21563 25384 3822 1274
Orf3a    25393 26220 828 276
E    26245 26472 228 76
M    26523 27191 669 223
Orf6    27202 27387 186 62
Orf7a    27394 27759 366 122
Orf8    27894 28259 366 122
N    28274 29533 1260 420
Orf10    29558 29674 117 39

By their powers combined, that explains 97.53% of the genome. Even as you occur to buy a explore on the the rest, you’ll explore that there isn’t all that worthy left that’s no longer accounted for. With the 2 untranslated areas – there’s the 5’ UTR weighing in at 265 nefarious pairs, and there’s the 3’ UTR (which incorporates the poly-A tail) weighing in at 229 nefarious pairs. That covers 99.07% of the genome! The last 277 nefarious pairs are scattered in the apartment between the ten genes.

I now truly feel most ceaselessly confident that I do know what the nucleotides come by translated to!

Secondary structure

No discussion of the structure and characteristic of a prolonged, single-stranded share of RNA is full with out a discussion on secondary structure.

Yes, RNA has secondary structure too – it’s no longer merely for proteins! Staunch love double-helical DNA binds one strand to some other, a single strand of RNA can bind to itself when areas like sufficiently complementary nucleotides, forming stems, hairloops, and yet more complex 3D structures.

The pseudoknot would possibly perchance well well sound familiar – it’s talked about again in our discussion on ribosomal frameshifting:

All cis-appearing frameshift indicators encoded in mRNAs are minimally light of two purposeful aspects: a heptanucleotide “slippery sequence” conforming to the long-established raze XXXYYYZ, adopted by an RNA structural element, in most cases an H-kind RNA pseudoknot, positioned an optimal quantity of nucleotides (5 to 9) downstream.

Undoubtedly one of many more complex examples of RNA secondary structure, the pseudoknot turn into once first screen in the turnip yellow mosaic virus, which is itself some other single-stranded certain-sense RNA virus merely love the coronavirus.

What’s an H-kind RNA pseudoknot, despite the indisputable fact that? Fixed with Cao and Chen 2009,

An H-kind pseudoknot is formed by nefarious-pairing between a hairpin loop and the one-stranded assign of the hairpin. The structure contains two helix stems and two loops as properly as a imaginable third loop/junction that connects the 2 helix stems.

Perchance we can resolve the secondary structure of the SARS-CoV-2 genome, or if no longer, a minimal of resolve the secondary structure for the pseudoknot, as that part serves an extraordinarily powerful regulatory characteristic.

For RNAs shorter than 4000 nucleotides, there exist some on-line tools similar to RNAfold that can set aside predictions about the secondary structure of arbitrary RNAs; on the opposite hand, I’m unable to receive any that can address our 29,903-nucleotide minor behemoth of an RNA. So, that leaves me with shopping via the present literature to gaze what results exist for this explicit tall prolonged RNA.

RNA Genome Conservation and Secondary Structure in SARS-CoV-2 and SARS-Connected Viruses (Rangan et al. 2020) affords an outline of among the virus’s secondary structures, along with the pseudoknot, as properly as a considerate prognosis of which structural aspects like remained conserved as viruses on this family like developed.

As we know from our witness of ribosomal frameshifting, the pseudoknot would possibly perchance well well soundless occur around 5 to 9 nucleotides downstream of the slippery sequence. Right here’s a snippet of the genome sequence around nucleotide 13,468 some other time:


Prepare along with the genome »

In the raze, as depicted in that paper’s figure 4, here is how that sequence winds itself into an H-kind pseudoknot:

Beginning 6 nucleotides after the heptanucleotide slippery sequence, nucleotides 13474 via 13542 collectively raze a pseudoknot. (Right here drawn schematically in 2D – in real existence they would not be so stretched out!)

Veil the presence here of some U·G scamper pairs, as talked about in our earlier discussion on scamper pairings!

The pseudoknot, rendered in 3D by passing the above secondary structure first into RNAComposer and then into Net 3DNA.

For more files on secondary structure and the blueprint it would possibly perchance well perhaps like an affect on the existence cycle and transmissibility of RNA viruses, as properly as foremost factors about how secondary structure would possibly perchance well well additionally be elucidated, I additionally counsel:

  • RNA Structure—A Unnoticed Puppet Master for the Evolution of Virus and Host Immunity (Smyth et al. 2018)
  • Viral RNAs Are Unusually Compact (Gopal et al. 2014)
  • Visualizing the realm secondary structure of a viral RNA genome with cryo-electron microscopy (Garmann et al. 2015)
  • The affect of viral RNA secondary structure on interactions with innate host cell defences (Witteveldt et al. 2014)


As a result of Laura Vaughan for a considerate overview of this share and for support with RNA 3D structure visualization!

Shoutout additionally to Nabeel Qureshi who has labored on the same endeavor, writing up an executable iPython notebook for knowing the paunchy coronavirus genome.

As destiny would favor it, I do know Ramya Rangan from our time as students of Jean Yang! It turn into once truly thrilling to organically stumble upon a aged colleague’s most contemporary work, and especially to like so diverse my questions about coronaviral secondary structure merely answered straight away by that work.

Subsequent time: Protein is big!

That’s desirous about now! Pause tuned for part 2, A Mechanist’s Manual to the Coronavirus Proteome.

Read More

Leave A Reply

Your email address will not be published.