Genome assembly with SPAdes презентация

Август 6, 2021

Главная
Информатика
Genome assembly with SPAdes

Содержание

2. Introduction
3. Why to assemble?
4. Why to assemble? Sequencing data Billions of short reads Sequencing errors Contaminants
5. Why to assemble? Sequencing data Billions of short reads Sequencing errors Contaminants Assembly Corrects sequencing errors
6. Assembly basics
7. Assembly in a perfect world
8. Assembly in real world
9. De novo whole genome assembly
10. De novo whole genome assembly
11. Genomic repeats TATTCTTCCACGTAGGGCCTTCCACGCTTCG
12. Genomic repeats TATTCTTC CTTCCACG CACGTAGG GGCCTTCC CTTCCACG CACGCTTCG TATTCTTCCACGTAGGGCCTTCCACGCTTCG
13. Genomic repeats TATTCTTC CTTCCACG CACGTAGG GGCCTTCC CTTCCACG CACGCTTCG
14. Genomic repeats TATTCTTCCACGTAGG GGCCTTCCACGCTTCG TATTCTTCCACGCTTCG GGCCTTCCACGTAGG
15. Genomic repeats TATTCTTCCACGTAGG ACGTAGGGCCTT GCCTTCCACGCTTCG TATTCTTCCACGTAGGGCCTTCCACGCTTCG
16. Genomic repeats TATTCTTCCACGTAGG ACGTAGGGCCTT GCCTTCCACGCTTCG
17. SPAdes assembler
18. SPAdes first steps spades.py
19. SPAdes first steps spades.py spades.py --help spades.py --test
20. SPAdes first steps spades.py spades.py --help spades.py --test -o
21. Input data formats FASTA: .fasta / .fa FASTQ: .fastq / .fq Gzipped: .gz
22. Input data options Unpaired reads Illumina unpaired -s single.fastq -s single1.fastq -s single2.fastq ...
23. Input data options Paired-end reads Interlaced pairs in one file >left_read_id ACGTGCAGG… >right_read_id GCTTCGAGG… Separate files
24. Input data options Paired-end reads Interlaced pairs in one file --pe1-12 file.fastq Separate files --pe1-1 file1.fastq
25. Input data options Paired-end reads Interlaced pairs in one file --pe1-12 file.fastq Separate files --pe1-1 file1.fastq
26. SPAdes performance options Number of threads -t N Maximal available RAM (GB) SPAdes will terminate if
27. Pipeline options Run only assembler (input reads are already corrected or quality-trimmed) --only-assembler
28. Input data options Mate-pair reads Cannot be used separately Interlaced pairs in one file --mp1-12 mp.fastq
29. Hybrid assembly options PacBio CLR --pacbio pb.fastq Oxford Nanopore reads --nanopore nanopore_reads.fastq
30. Restarting SPAdes SPAdes / system crashed --continue -o your_output_dir
31. Genome assembly evaluation with QUAST Center for Algorithmic Biotechnology SPbU
32. In reality SPAdes ABySS IDBA Ray Velvet ….
33. Which assembler to use? ABySS ALLPATHS-LG CLC IDBA-UD MaSuRCA MIRA Ray SOAPdenovo SPAdes Velvet and many
34. Which assembler to use? Different technologies (Illumina, 454, IonTorrent, ...) Genome type and size (bacteria, insects,
35. There is no best assembler
36. Which assembler to use? Assemblathon 1 & 2 Simulated and real datasets More than 30 teams
37. Assembly evaluation Basic evaluation No extra input Very quick Reference-based evaluation A lot of metrics Very
38. Basic statistics Only assemblies are needed (no additional input) Very fast to compute
39. Contig sizes Number of contigs
40. Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp)
41. Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp) Largest contig length
42. Contig sizes Number of contigs Number of large contigs (i.e. > 1000 bp) Largest contig length
43. N50 The maximum length X for which the collection of all contigs of length >= X
44. N50 The maximum length X for which the collection of all contigs of length >= X
45. N50 The maximum length X for which the collection of all contigs of length >= X
46. N50 The maximum length X for which the collection of all contigs of length >= X
47. N50 The maximum length X for which the collection of all contigs of length >= X
48. N50 The maximum length X for which the collection of all contigs of length >= X
49. N50 The maximum length X for which the collection of all contigs of length >= X
50. N50 The maximum length X for which the collection of all contigs of length >= X
51. L50 The minimum number X such that X longest contigs cover at least 50% of the
52. L50 The minimum number X such that X longest contigs cover at least 50% of the
53. N50-variations N25, N75 L25, L75 N25 = 100, N75 = 40 L25 = 1, L75 =
54. N50-variations N25, N75 L25, L75 N25 = 100, N75 = 40 L25 = 1, L75 =
55. N50-variations N25, N75 L25, L50, L75
56. N50-variations N25, N75 L25, L50, L75 Nx, Lx
57. Other Number of N’s per 100 kbp
58. Other Number of N’s per 100 kbp GC %
59. Other Number of N’s per 100 kbp GC % Distributions of GC % in small windows:
60. Other
61. Reference-based metrics A lot of metrics Accurate assessment
62. Basic reference statistics Reference length Reference GC % Number of chromosomes
63. Basic reference statistics NGx, LGx NG50 = 40 LG50 = 4
64. Basic reference statistics NGx, LGx NG50 = 40 LG50 = 4
65. Basic reference statistics NGx, LGx NG50 = 40 40 LG50 = 4 4
66. Alignment statistics Assembly Reference genome
67. Alignment statistics
68. Genome fraction % Alignment statistics
69. Genome fraction % Duplication ratio Alignment statistics
70. Genome fraction % Duplication ratio Number of gaps Alignment statistics
71. Genome fraction % Duplication ratio Number of gaps Largest alignment length Alignment statistics
72. Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned contigs (full
73. Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned contigs (full
74. Alignment statistics Genome fraction % Duplication ratio Number of gaps Largest alignment length Number of unaligned
75. Misassemblies Contig Reference genome Chromosome 1 Chromosome 2
76. Misassemblies Contig Reference genome Chromosome 1 Chromosome 2 Relocation > 1kbp Chromosome 2 Chromosome 1 Inversion
77. There is no best metric NB!
78. NA50 Assembly A Assembly B 200 100
79. NA50 Assembly A Reference genome Assembly B 200 100
80. NA50 Assembly A Reference genome Assembly B 200 100 N50 = 200 # misassemblies = 2
81. NA50 Assembly A Reference genome Assembly B 200 100 N50 = 200 # misassemblies = 2
82. QUality ASsesment Tool for Genome Assemblies
83. QUAST Assembly statistics Basic statistics Reference-based evaluation Simple de novo evaluation Available as a web-based and
84. QUAST: console tool quast.py quast.py --help
85. QUAST basics quast.py quast.py --help quast.py contigs.fasta quast.py [options] contigs.fasta quast.py -o out_dir contigs.fasta
86. Reference options Reference genome -R reference.fasta Gene annotation -G genes.gff Operon annotation -O operons.gff
87. QUAST output Reports in different formats Plain text table Tab separated values (Excel, Google Spreadsheets) Interactive
88. Contig alignment viewer All alignments for each contig Misassembly details Contig ordering along the genome Overlaps
89. Contig alignment viewer
90. Contig size viewer Contigs ordered from longest to shortest N50, N75 (NG50, NG75) Filtration by contig
91. Contig size viewer
92. De novo evaluation
93. Read-based statistics Number of aligned/unaligned reads % of assembly covered by reads
94. Read-based statistics Number of aligned/unaligned reads % of assembly covered by reads Points with low coverage
95. Annotation-based statistics Number of ORFs
96. Annotation-based statistics Number of ORFs Number of gene/operon-like regions GeneMarkS (Borodovsky et al.) GlimmerHMM (Majoros et
97. Annotation-based statistics Number of ORFs Number of gene/operon-like regions GeneMarkS (Borodovsky et al.) GlimmerHMM (Majoros et
99. Скачать презентацию

Слайд 2

Introduction

Слайд 3

Why to assemble?

Слайд 4

Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants

Слайд 5

Why to assemble?
Sequencing data
Billions of short reads
Sequencing errors
Contaminants
Assembly
Corrects sequencing errors
Much longer sequences
Each

genomic region is presented only once
May introduce errors

Hard to perform analysis

Слайд 6

Assembly basics

Слайд 7

Assembly in a perfect world

Слайд 8

Assembly in real world

Слайд 9

De novo whole genome assembly

Слайд 10

De novo whole genome assembly

Слайд 11

Genomic repeats
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 12

Genomic repeats
TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 13

Genomic repeats
TATTCTTC
CTTCCACG
CACGTAGG
GGCCTTCC
CTTCCACG
CACGCTTCG

Слайд 14

Genomic repeats
TATTCTTCCACGTAGG
GGCCTTCCACGCTTCG
TATTCTTCCACGCTTCG
GGCCTTCCACGTAGG

Слайд 15

Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG
TATTCTTCCACGTAGGGCCTTCCACGCTTCG

Слайд 16

Genomic repeats
TATTCTTCCACGTAGG
ACGTAGGGCCTT
GCCTTCCACGCTTCG

Слайд 17

SPAdes assembler

Слайд 18

SPAdes first steps
spades.py

Слайд 19

SPAdes first steps
spades.py
spades.py --help
spades.py --test

Слайд 20

SPAdes first steps
spades.py
spades.py --help
spades.py --test
-o

Слайд 21

Input data formats
FASTA: .fasta / .fa
FASTQ: .fastq / .fq
Gzipped: .gz

Слайд 22

Input data options
Unpaired reads
Illumina unpaired
-s single.fastq
-s single1.fastq -s single2.fastq ...

Слайд 23

Input data options
Paired-end reads
Interlaced pairs in one file
>left_read_id
ACGTGCAGG…
>right_read_id
GCTTCGAGG…
Separate files
file1.fastq file2.fastq
>left_read_id >right_read_id
ACGTGCAGG… GCTTCGAGG…

Слайд 24

Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1 file1.fastq --pe1-2 file2.fastq

Слайд 25

Input data options
Paired-end reads
Interlaced pairs in one file
--pe1-12 file.fastq
Separate files
--pe1-1 file1.fastq --pe1-2 file2.fastq

--pe1-s unpaired.fastq

Слайд 26

SPAdes performance options
Number of threads
-t N
Maximal available RAM (GB)
SPAdes will terminate if exceeded
-m

Слайд 27

Pipeline options
Run only assembler (input reads are already corrected or quality-trimmed)
--only-assembler

Слайд 28

Input data options
Mate-pair reads
Cannot be used separately
Interlaced pairs in one file
--mp1-12 mp.fastq
Separate

files
--mp1-1 mp1.fastq --mp1-2 mp2.fastq

Слайд 29

Hybrid assembly options
PacBio CLR
--pacbio pb.fastq
Oxford Nanopore reads
--nanopore nanopore_reads.fastq

Слайд 30

Restarting SPAdes
SPAdes / system crashed
--continue -o your_output_dir

Слайд 31

Genome assembly evaluation with QUAST
Center for Algorithmic Biotechnology
SPbU

Слайд 32

In reality
SPAdes
ABySS
IDBA
Ray
Velvet
….

Слайд 33

Which assembler to use?
ABySS
ALLPATHS-LG
CLC
IDBA-UD
MaSuRCA
MIRA
Ray
SOAPdenovo
SPAdes
Velvet
and many more...

Слайд 34

Which assembler to use?
Different technologies (Illumina, 454, IonTorrent, ...)
Genome type and size (bacteria,

insects, mammals, plants, ...)
Type of prepared libraries (single reads, paired-end, mate-pairs, combinations)
Type of data (multicell, metagenomic, single-cell)

Слайд 35

There is no best assembler

Слайд 36

Which assembler to use?
Assemblathon 1 & 2
Simulated and real datasets
More than 30 teams

competing
Independent studies
Papers (GAGE, GAGE-B, GABenchToB)
Web-sites (nucleotid.es, …)
Surveys
Genome assembly evaluation tools
QUAST
GAGE

Слайд 37

Assembly evaluation
Basic evaluation
No extra input
Very quick
Reference-based evaluation
A lot of metrics
Very accurate
De novo evaluation
Advanced

analysis of de novo assemblies

Слайд 38

Basic statistics
Only assemblies are needed (no additional input)
Very fast to compute

Слайд 39

Contig sizes
Number of contigs

Слайд 40

Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000 bp)

Слайд 41

Contig sizes
Number of contigs
Number of large contigs (i.e. > 1000 bp)
Largest contig length

Слайд 42