Genome assembly with SPAdes презентация

Содержание

Слайд 2

Introduction

Слайд 3

Why to assemble?

Слайд 4

Why to assemble?

Sequencing data
Billions of short reads
Sequencing errors
Contaminants

Слайд 5

Why to assemble?

Sequencing data
Billions of short reads
Sequencing errors
Contaminants
Assembly
Corrects sequencing errors
Much longer sequences
Each

genomic region is presented only once
May introduce errors

Hard to perform analysis

Слайд 6

Assembly basics

Слайд 7

Assembly in a perfect world

Слайд 8

Assembly in real world

Слайд 9

De novo whole genome assembly

Слайд 10

De novo whole genome assembly

Слайд 11

Genomic repeats
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 12

Genomic repeats

TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 13

Genomic repeats

TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG

Слайд 14

Genomic repeats

TATTCTTCCACGTAGG
GGCCTTCCACGCTTCG
TATTCTTCCACGCTTCG
GGCCTTCCACGTAGG

Слайд 15

Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 16

Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG

Слайд 17

SPAdes assembler

Слайд 18

SPAdes first steps

spades.py

Слайд 19

SPAdes first steps

spades.py
spades.py --help
spades.py --test

Слайд 20

SPAdes first steps

spades.py
spades.py --help
spades.py --test
-o

Слайд 21

Input data formats

FASTA: .fasta / .fa
FASTQ: .fastq / .fq
Gzipped: .gz

Слайд 22

Input data options

Unpaired reads
Illumina unpaired
-s single.fastq
-s single1.fastq -s single2.fastq ...

Слайд 23

Input data options

Paired-end reads
Interlaced pairs in one file
>left_read_id
ACGTGCAGG…
>right_read_id
GCTTCGAGG…
Separate files
file1.fastq file2.fastq
>left_read_id >right_read_id
ACGTGCAGG… GCTTCGAGG…

Слайд 24

Input data options

Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1 file1.fastq --pe1-2 file2.fastq

Слайд 25

Input data options

Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1 file1.fastq --pe1-2 file2.fastq


--pe1-s unpaired.fastq

Слайд 26

SPAdes performance options

Number of threads
-t N
Maximal available RAM (GB)
SPAdes will terminate if exceeded
-m

M

Слайд 27

Pipeline options

Run only assembler (input reads are already corrected or quality-trimmed)
--only-assembler

Слайд 28

Input data options

Mate-pair reads
Cannot be used separately
Interlaced pairs in one file
--mp1-12 mp.fastq
Separate

files
--mp1-1 mp1.fastq --mp1-2 mp2.fastq

Слайд 29

Hybrid assembly options

PacBio CLR
--pacbio pb.fastq
Oxford Nanopore reads
--nanopore nanopore_reads.fastq

Слайд 30

Restarting SPAdes

SPAdes / system crashed
--continue -o your_output_dir

Слайд 31

Genome assembly evaluation with QUAST

Center for Algorithmic Biotechnology
SPbU

Слайд 32

In reality

SPAdes

ABySS

IDBA

Ray

Velvet
….

Слайд 33

Which assembler to use?

ABySS
ALLPATHS-LG
CLC
IDBA-UD
MaSuRCA
MIRA
Ray
SOAPdenovo
SPAdes
Velvet
and many more...

Слайд 34

Which assembler to use?

Different technologies (Illumina, 454, IonTorrent, ...)
Genome type and size (bacteria,

insects, mammals, plants, ...)
Type of prepared libraries (single reads, paired-end, mate-pairs, combinations)
Type of data (multicell, metagenomic, single-cell)

Слайд 35

There is no best assembler

Слайд 36

Which assembler to use?

Assemblathon 1 & 2
Simulated and real datasets
More than 30 teams

competing
Independent studies
Papers (GAGE, GAGE-B, GABenchToB)
Web-sites (nucleotid.es, …)
Surveys
Genome assembly evaluation tools
QUAST
GAGE

Слайд 37

Assembly evaluation

Basic evaluation
No extra input
Very quick
Reference-based evaluation
A lot of metrics
Very accurate
De novo evaluation
Advanced

analysis of de novo assemblies

Слайд 38

Basic statistics

Only assemblies are needed (no additional input)
Very fast to compute

Слайд 39

Contig sizes

Number of contigs

Слайд 40

Contig sizes

Number of contigs
Number of large contigs (i.e. > 1000 bp)

Слайд 41

Contig sizes

Number of contigs
Number of large contigs (i.e. > 1000 bp)
Largest contig length

Слайд 42

Contig sizes

Number of contigs
Number of large contigs (i.e. > 1000 bp)
Largest contig length
Total

assembly length

Слайд 43

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 44

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 45

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 46

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 47

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 48

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 49

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

Слайд 50

N50

The maximum length X for which the collection of all contigs of length

>= X covers at least 50% of the assembly

N50 = 60

Слайд 51

L50

The minimum number X such that X longest contigs cover at least 50%

of the assembly

L50 = 3

Слайд 52

L50

The minimum number X such that X longest contigs cover at least 50%

of the assembly

L50 = 3

Слайд 53

N50-variations

N25, N75
L25, L75

N25 = 100, N75 = 40
L25 = 1, L75 = 5

Слайд 54

N50-variations

N25, N75
L25, L75

N25 = 100, N75 = 40
L25 = 1, L75 = 5

Слайд 55

N50-variations

N25, N75
L25, L50, L75

Слайд 56

N50-variations

N25, N75
L25, L50, L75
Nx, Lx

Слайд 57

Other

Number of N’s per 100 kbp

Слайд 58

Other

Number of N’s per 100 kbp
GC %

Слайд 59

Other

Number of N’s per 100 kbp
GC %
Distributions of GC % in small windows:

GC=37

GC=44

GC=41

GC=...

Слайд 61

Reference-based metrics

A lot of metrics
Accurate assessment

Слайд 62

Basic reference statistics

Reference length
Reference GC %
Number of chromosomes

Слайд 63

Basic reference statistics

NGx, LGx

NG50 = 40
LG50 = 4

Слайд 64

Basic reference statistics

NGx, LGx

NG50 = 40
LG50 = 4

Слайд 65

Basic reference statistics

NGx, LGx

NG50 = 40 40
LG50 = 4 4

Слайд 66

Alignment statistics

Assembly

Reference genome

Слайд 67

Alignment statistics

Слайд 68

Genome fraction %

Alignment statistics

Слайд 69

Genome fraction %
Duplication ratio

Alignment statistics

Слайд 70

Genome fraction %
Duplication ratio
Number of gaps

Alignment statistics

Слайд 71

Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length

Alignment statistics

Слайд 72

Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned contigs (full &

partial)

Alignment statistics

Слайд 73

Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned contigs (full &

partial)
Number of mismatches/indels per 100 kbp

Alignment statistics

Слайд 74

Alignment statistics

Genome fraction %
Duplication ratio
Number of gaps
Largest alignment length
Number of unaligned contigs (full

& partial)
Number of mismatches/indels per 100 kbp
Number of genes/operons (full & partial)

Слайд 75

Misassemblies

Contig

Reference genome

Chromosome 1

Chromosome 2

Слайд 76

Misassemblies

Contig

Reference genome

Chromosome 1

Chromosome 2

Relocation

> 1kbp

Chromosome 2

Chromosome 1

Inversion

Chromosome 2

Chromosome 1

Translocation

Chromosome 2

Chromosome 1

Слайд 77

There is no best metric

NB!

Слайд 78

NA50

Assembly A

Assembly B

200

100

Слайд 79

NA50

Assembly A

Reference genome

Assembly B

200

100

Слайд 80

NA50

Assembly A

Reference genome

Assembly B

200

100

N50 = 200
# misassemblies = 2

N50 = 100
# misassemblies =

0

Слайд 81

NA50

Assembly A

Reference genome

Assembly B

200

100

N50 = 200
# misassemblies = 2
NA50 = 100

N50 = 100
#

misassemblies = 0
NA50 = 100

Слайд 82

QUality ASsesment Tool
for Genome Assemblies

Слайд 83

QUAST

Assembly statistics
Basic statistics
Reference-based evaluation
Simple de novo evaluation
Available as a web-based and a

command line tool
quast.sf.net

Слайд 84

QUAST: console tool

quast.py
quast.py --help

Слайд 85

QUAST basics

quast.py
quast.py --help
quast.py contigs.fasta
quast.py [options] contigs.fasta
quast.py -o out_dir contigs.fasta

Слайд 86

Reference options

Reference genome
-R reference.fasta
Gene annotation
-G genes.gff
Operon annotation
-O operons.gff

Слайд 87

QUAST output

Reports in different formats
Plain text table
Tab separated values (Excel, Google Spreadsheets)
Interactive HTML
Plots

(PDF/PNG/SVG)
Nx, NGx, NAx
Genes
Cumulative length
Interactive contig viewers (Icarus)
Contig alignment viewer
Contig size viewer

Слайд 88

Contig alignment viewer

All alignments for each contig
Misassembly details
Contig ordering along the genome
Overlaps

/ gaps

Слайд 89

Contig alignment viewer

Слайд 90

Contig size viewer

Contigs ordered from longest to shortest
N50, N75 (NG50, NG75)
Filtration by

contig size
Gene prediction results
Available without a reference

Слайд 91

Contig size viewer

Слайд 92

De novo evaluation

Слайд 93

Read-based statistics

Number of aligned/unaligned reads
% of assembly covered by reads

Слайд 94

Read-based statistics

Number of aligned/unaligned reads
% of assembly covered by reads
Points with low

coverage
Points with multiple read clipping
Points with incorrect insert sizes

Слайд 95

Annotation-based statistics

Number of ORFs

Слайд 96

Annotation-based statistics

Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM (Majoros et al.)

Слайд 97

Annotation-based statistics

Number of ORFs
Number of gene/operon-like regions
GeneMarkS (Borodovsky et al.)
GlimmerHMM (Majoros et al.)
Number

of conservative genes
BUSCO (Simão et al.)
CEGMA (Korf et al., no longer supported)
Имя файла: Genome-assembly-with-SPAdes.pptx
Количество просмотров: 76
Количество скачиваний: 0