yana-notes

DNA Sequencing

2022-02-13 links: reference:

Genome Sequencing #

Polymerase chain reaction (PCR) is necessary to make/amplify (millions of) copies of a DNA sample, usually done via thermal cycling In certain organisms, TACG are not the only bases in the DNA - in mammals, methyl groups may be attached, such as in 5-methylcytosine.

Analysis #

  • 1% of the population needs to carry the same SNP for it to be classified as such.

  • Since base pairs are always known for a strand’s corresponding one, you only need to write down one of them. (The next question is why the body even needs two?)
  • Not only that, we have two copies of the DNA (we are diploids). Yes chromosomes are X-shaped pair things, but each chromosome is a pair, since we inherit one from each parent. These are refered to as homologs/homologous chromosomes. These are the ‘pairs’ we analyze and what gives rise to polymorphisms.
    • Allele simply refers to the nucleotide(s - can be umangop to hundreds) at a certain locus of interest, including of course those associated with an SNP. (Technically an SNP can have two nucleotides/alleles, and four if you count the base pairs)
    • A genotype is the set all the alleles of a genome.
      • Genotype + Environment = Phenotype
  • Okay SNP might be able to refer to an allele that can be but may not necessarily is different.
    • A haplotype is basically just a set of specific nucleotides of interest that are inherited together and are usually proximal on the chromosome. Sometimes haplotypes refer to one chromosome, sometimes both.
    • An SNP that leads to a different amino acid being encoded is a nonsynonymous substitution. Missense is a change of amino acid, and nonsense is a premature stop codon. Synonymous substitutions don’t change the AAs but still alter its function.
      • Similarly, a silent mutation is one that does not change the codon, unlike a frameshift mutation.
    • The rs number/rsID (dbSNP Reference SNP) are arbitrarily assigned I think. They correspond to specific locales and specific mutations. For example, Acid Sphingomyelinase’s rs120074124 representing 911T>C (p.Leu304Pro (not sure what the p. stands for tbh but the AA change is the coding effect)) Niemann-Pick A.
      • Sites like SNPedia/23andme have unque ‘i’ IDs.
  • At least in the context of sequencing, C/C, G/C etc. is unphased which is when you don’t know which chromosome each nucleotide is on. C|C format is phased.
  • An indel is where some nucleotides are just missing.
  • Tandem repeat = repeated nucleotide/set of nucleotides adjacent to eachother. Sometimes these are transcriptionally silent. There are also tandem repeat proteins, e.g. leucine-rich repeat domains, armadillo repeat domain.
  • In humans, >2/3 of chromosomal DNA consists of repetitive elements; ‘repeats’. Sometimes repeats are inverted, palindromic, whatever.
  • Filetypes:
    • FASTQ is what’s derived from the sequencer. It’s everything.
    • Alignment of a FASTQ file with a reference genome generates BAM (binary alignment map) and an associated BAI (Binary alignment index) file. From there, a VCF (variant cell format) can be produced, which is just a list of the SNPs that differ from the implied reference genome.
      • There are a few standards for reference genomes, namely GRCh37 and GRCh38, which was released 4 years after the former.
      • BAM can also be compressed into a CRAM file.
      • I think BAM is half the size of FASTQ so I wonder if it’s just one strand or something, while both are measured on FASTQ or something.

VCF #

  • The syntax of each line is as follows:

    • (#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PRO0000283)
    • chr1 11796321 rs1801133 G A 1347.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.673;ClippingRankSum=0.000;DB;DP=45;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=30.63;ReadPosRankSum=-0.686;SOR=0.242 GT:AD:DP:GQ:PL 1/1:1,43:44:99:1376,121,0
  • At the end is an array representing GT (genotype), AD (Allelic depths for the ref and alt alleles in the order listed), DP (Approximate read depth (reads with MQ=255 or with bad mates are filtered), GQ (genotype quality), and PL (Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification)

    • What’s obviously mostly important is GT and notice it is 1/1, indicating +/+ for the the ‘atlernate allele’, while 0 or - indicates the reference allele.
  • AC: Allele count in genotypes, for each ALT allele, in the same order as listed

  • AF: Allele Frequency, for each ALT allele, in the same order as listed

  • AN: Total number of alleles in called genotypes

  • BaseQRankSum: Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities

  • ClippingRankSum: Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases

  • DB: dbSNP Membership

  • DP: Approximate read depth; some reads may have been filtered

  • DS: Were any of the samples downsampled?

  • ExcessHet: Phred-scaled p-value for exact test of excess heterozygosity

  • FS: Phred-scaled p-value using Fisher’s exact test to detect strand bias

  • HaplotypeScore: Consistency of the site with at most two segregating haplotypes

  • InbreedingCoeff: Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation

  • MLEAC: Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed

  • MLEAF: Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed

  • MQ: RMS Mapping Quality

  • MQRankSum: Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities

  • QD: Variant Confidence/Quality by Depth

  • ReadPosRankSum: Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias

  • SOR: Symmetric Odds Ratio of 2x2 contingency table to detect strand bias

Tools #

Sequencers #

The first machine used by the Human Genome Project used the Sanger sequencing method. After that came next generation sequencers (NGS): 454, SOLiD, Illumina, MGI, DNBSEQ. Now we’re onto third-generation I think, with SMRT, Nanopore, etc.

  • Each pair is 2 bits, thus it’s 750MB total. In reality, variation is limited to <1%, so total variation can be losslessly compressed to ~4MB! Human genomes as email attachments

  • Some DNA sequences are noncoding DNA, rather serving modulatory roles.

  • 1 2 might need random BioEng stuff?

  • https://www.labx.com/dna-sequencers It’s $8,000+ after the Ion Torrents. Nothing unique though.

    • Sage Sciences?
  • The Bento Lab (nanopore) is a PCR, microcentrifuge, gel electrophoresis, and transilluminator. $1600-$2000.

Thermofisher #

They acquired Life Technologies (which also worked under the Applied Biosystems brand, which ALSO acquired SOLiD in 2006) in 2014.

  • Ion Torrent is their premier fourth-gen sequencer. https://www.thermofisher.com/us/en/home/life-science/sequencing/next-generation-sequencing/ion-torrent-next-generation-sequencing-technology.html It uses a proprietary semiconductor technology.

  • There’s like a dozen Ion Torrent PGM 508s on labx for $1000. $1500 for the 7467 (Ubuntu 10.04). idek which is newer.

    • I think these fuckas are indeed from the 2000s. Search online for their names and it’s site upon site of people selling used models.
    • The PGMs all take chips of either 314 (20Mb; 400-550k reads) 316 (100Mb 2-3m) or 318 (1Gb 4-5.5m) for sequence output. Read length is 200-400. >99% accurate. Single- and paired-end sequencing.
      • The workflow might require o But there are plenty if you search onliner benefits from the OneTouch
  • Their modern entry-level device is the GeneStudio S5 which takes chips 510 (2-3M) up to 550 (100-130M; 25GB)

    • They still don’t market this for whole genome sequencing.
    • No idea how much it costs - probably $50k - it ain’t used.
    • According to this flyer, the 540 and 550 chips are capable of whole genome sequencing, while for ’low-pass genome sequencing (PGS)’ the 510 is capable for 520/530 is recommended.

Illumina #

GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data

HiSeq, MiSeq, HiScanSQ, Genome Analyzer IIx, etc. They have entry-level machines but according to them, the only ones capable of human whole-genome sequencing is the big as NovaSeq 6000.

Oxford Nanopore #

MinION #

  • Despite the ION name, look at Nanopore - because that’s the technology they use, which utilizes flowcells: often some combination of microfluidics, high precision optics and electronics; sequencing reagents are at a minimum highly pure natural biochemicals like A/G/C/T-TP, but more typically such molecules modified in all sorts of interesting ways and used with highly engineered versions of natural enzymes or other proteins.

  • The flowcell measures electrical changes caused by DNA being driven through protein pores in membranes.

  • As of 2016, the minION Mk1b provided 2.4x coverage and 8.7Gb with one flow cell over 48 hours. Afaik, it’s dead for good after 48 hours. So, maybe you get 2 good uses out of it?

    • As of 2022 it can generate up to 50Gb. Flow cells are $900 and can be stored at 2-8°C for up to 12 weeks.
  • Also required is the actual library kit, which is looking like $600 minimum maybe?

  • MinION Mk1B: $1000

    • 50Gb per flow cell when run for 72 hours at 420 bases/second.
    • Conect to a computer. Recommended specs are a 4-core i7, and a 2060 with 8GB VRAM. R. The software used is MinKNOW, Guppy the basecaller, and EPI2ME for data analysis.
  • MinION Mk1C:

    • AIO device with a touchscreen and pre-installed basecalling and analysis software.
  • New method helps pocket-sized DNA sequencer achieve near-perfect accuracy

    • Using a barcoding system (I assume it’s one of those several hundred $ kits) the MinION can be incredibly accurate.

Sequence Assembly #

  • The devices must assemble contigs, fragments of 20-30,000 bases depending on the reading capability of the device. IT’s kinda speculative how many genes we have, but it’s something like 30,000-50,000, with about 3 billion base pairs. Bottom-up sequencing shears DNA into small fragments and reassembling them based on overlaps, up to the entire genome. The alignment is done via software: AMOS is an open source deal but this might be a dead project - but it may be all is needed.