Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences-布布扣-bubuko.com

Abstract

Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Issue Section:

Standard Article

INTRODUCTION

DNA sequencing technologies have revolutionized genetic and genomic analyses, facilitating de novo assemblies of genomes from various species with small to large complex genomes of species such as wheat

(1). Genome assemblies using Illumina technologies, here referred to as short-read sequences, are typically highly fragmented but sequence bases are accurate. The contiguity of genome assemblies can be dramatically improved by using long single-molecule sequencing technologies principally led by Pacific Biosciences SMRT and Oxford Nanopore (ONT) platforms

(2). Sequencing reads yielded from both technologies have relatively high rates of errors, and are dominated by small insertions or deletions. Although consensus sequences from high coverage of sequencing reads reduce errors in genome assemblies, nonrandom errors, or biased errors, in reads can result in inaccurate consensus sequences. Biased errors could be caused by epigenomic modifications that affect sequencing signals for base calling (3,4).

To mitigate per-base errors in a draft assembly, sequencing polishing algorithms have been developed using signal-level raw data (5,6). However, for genome sequences of many species, a great number of errors still exist after multiple rounds of sequence polishing, or error correction (7). Practically, errors can be further reduced via correction with additional Illumina short reads. Variants revealed by alignments of Illumina reads with assembled sequences are typically used for error correction (8). This strategy works well for small low-repetitive genomes because most assembly regions can be uniquely covered by Illumina reads. For large repetitive genomes, the strategy works less well due to a lower proportion of genomes uniquely aligned by Illumina reads, a higher rate of misalignments and even no alignments at some poorly assembled regions (9).

Assembly quality of genome sequences is related to assembly contiguity and completeness, correctness of sequence ordering and consensus base accuracy. Community-based projects such as GAGE (10) and Assemblathon (11) looked at a suite of criteria for a comprehensive assessment when benchmarking different assembly methods when the ground truth assembly is known. The value of N50, the length of the smallest contig of a set of the top long contigs that cover half of assembly space, is widely used as an indicator of assembly contiguity. Alignment rates or number of variants based on alignment to assembled sequences with genome sequencing reads or RNA sequencing reads, and comparison against the reference genomes of related varieties or species can indicate assembly quality. Tools were developed for comparing some of these parameters for genome assemblies, such as QUAST (12). Conserved benchmarking universal single-copy orthologs (BUSCO) (13) and core eukaryotic genes mapping approach (CEGMA) (14) have been used to assess genome completeness simply based on evaluating the coding or gene space. LTR (long terminal repeat) assembly index that indicates the assembly quality of LTR retrotransposons was designed to evaluate assembly continuity, extending assembly quality assessment to repetitive regions (15).

In addition, approaches for genome assembly characterization were also developed based on profiles of k-mers, substrings of length k from longer DNA sequences. K-mer-based approaches have been used to quantify genome size, repetitive levels and heterozygosity in assembled sequences (16–19), and to perform reference-free genome comparisons based on sequencing data (20,21). A method KAT (k-mer analysis toolkit) was developed to profile k-mer spectra of both sequencing reads and assemblies and to visualize the difference of k-mer abundance in the assembly and read data (22). Here, we quantified abundance of k-mers from sequencing reads and k-mer occurrences in the assembly genome, and developed a single value, k-mer abundance difference (KAD), per k-mer. Given a set of input reads, KAD analysis can evaluate the accuracy of nucleotide base quality at both genome-wide and single-locus levels, which, indeed, is appropriate, efficient and powerful for assessing genome sequences assembled with inaccurate long reads.

MATERIALS AND METHODS

Simulation of genome sequences with single-nucleotide substitution errors

Genome sequences were simulated with various single-nucleotide substitution errors using the software simuG (23). The parameters of simuG were set as ‘-snp_count < total nucleotide number of genome × variation rate > -titv_ratio 0.5’.

Simulation of genome sequences with different error types

The software simuG was also used to simulate genome sequences with different error types (23). For single-nucleotide substitution errors, the parameters of simuG were set as ‘-snp_count < total nucleotide number of genome × variation rate > -titv_ratio 0.5’. For short insertions and deletions (INDELs), the parameter was set as ‘-indel_count < total nucleotide number of genome × variation rate > -ins_del_ratio 1’. For long sequence redundancy, the parameter was set as ‘-cnv_count < number of long sequence redundancy > -cnv_gain_loss_ratio Inf’. For long sequence deletion, the parameter was set as ‘-cnv_count < number of long sequence redundancy > -cnv_gain_loss_ratio 0’.

Simulation of reads with and without errors

The software DWGSIM (https://github.com/nh13/DWGSIM) was used to generate reads with or without errors (error-free reads) using reference genomes of multiple species with various genome sizes. To simulate reads without errors with different read depths, the parameter was set as ‘-e 0 -E 0 -C < read depth > -1 150 -2 150 -r 0 -R 0 -X 0 -y 0 -c 0 -S 0’. To simulate reads with single-nucleotide substitution errors with 50× read depth, the parameter was set as ‘-e < sequencing error rate > -E < sequencing error rate > -C 50 -1 150 -2 150 -r 0 -R 0 -X 0 -y 0 -c 0 -S 0’. The parameter -C represents read depth, and the parameters -1 and -2 represent the lengths of the first and second reads of paired-end reads. To simulate reads without errors, all the parameters controlling error rates in reads (-r, -R, -X, -y, -c, -S) were set to 0.

Calculation of true capture rate and false capture rate

Error k-mers were identified with the KAD script ‘KADprofile.pl’, which determined the KAD value per k-mer and identified error k-mers. To calculate the true capture rate (TCR) value that stands for the percentage of simulated base errors detected by KAD, all the error k-mers detected by KAD were aligned to their simulated genomes with bowtie (24) and the overlapping error k-mers were merged into error regions, which was implemented by the KAD script ‘KADdist.pl’. The ratio of simulated errors located in error regions to total simulated errors was calculated as the TCR value. The false capture rate (FCR) value stands for the percentage of error k-mers that do not overlap with simulated base errors. Therefore, the FCR value was determined by calculating the ratio of the number of error k-mers that do not overlap with simulated errors to the total number of error k-mers detected by KAD.

Xvv1601 whole genome sequencing via PacBio and genome assembly

The strain Xanthomonas vasicola pv. vasculorum (Xvv1601) was collected in a Kansas corn field in 2016. Bacterial growth and DNA extraction referred to a previous procedure (25). The 10–20 kb whole genome shotgun (WGS) libraries were constructed using Xvv1601 genomic DNAs. The library was sequenced with P6-C4 chemistry on SMRT cells of PacBio RS II at the Yale Center for Genomic Analysis. PacBio genome assembly: Canu (v1.3) was previously used for genome assembly (26). PacBio reads with the minimum length of 5 kb were used.

To generate Illumina data for the bacterium Xvv1601, the sequencing library was prepared using the Illumina TruSeq DNA LT Sample Prep Kit. Paired-end 2 × 300 bp reads were generated on an Illumina MiSeq at the Integrated Genomic Facility at Kansas State University. To examine the impact of read depths on error detection, the module ‘sample’ of the software seqtk (https://github.com/lh3/seqtk) was used to downsample Illumina reads to ～90×, ～80×, ～70×, ～60×, ～50×, ～40×, ～30× and ～20×.

Identification of polymorphisms between two Xvv1601 assembly versions

The software MUMmer 4 (27) was used to identify DNA polymorphisms between the two assemblies (canu and final) of the bacterial strain Xvv1601. The two assembly sequences were aligned with the nucmer command. Alignments were filtered with the command delta-filter with (-1 -l 10000 -i 90), which resulted in unique alignments with at least 10 kb matches and at least 90% identity between the two assembled genomes. The alignments passing the filtering were used for the variant discovery with ‘show-snps’.

B71 whole genome sequencing using Nanopore MinION

Wheat blast, a devastating emerging wheat disease, is caused by the fungus Magnaporthe oryzae Triticum (MoT) (28). Nanopore long reads from a virulent MoT strain B71 were produced for the de novo genome assembly. B71 nuclear genomic DNA was prepared as described previously (28). Genomic DNA was subjected to 20 kb size selection using Bluepippin cassette kit BLF7510 with High-Pass Protocol (Sage Science, USA), followed by library preparation with the SQK-LSK109 kit (Oxford Nanopore, UK). Library was loaded to the flow cell FLO-MIN106D (Oxford Nanopore, UK) and sequenced on MinION (Oxford Nanopore, UK). Guppy version 2.2.2 was used to convert Nanopore raw data (fast5) to FASTQ data with default parameters.

B71 Nanopore genome assembly and sequence polishing

Nanopore reads were input to Canu 1.8 for genome assembly with the following parameters: genomeSize = 45m; minReadLength = 5000; minOverlapLength = 1000; corOutCoverage = 80 (26). Nanopore reads were aligned back to the Canu assembly with minimap2 (2.14-r892) with the parameter of (-ax map-ont). Alignments in BAM format converted by Samtools (1.9) were input for assembly polishing with Nanopolish (version 0.11.0) with default parameters (https://github.com/jts/nanopolish). The Nanopolish procedure was repeated twice. Nanopolish-polished sequences were further polished with trimmed reads of Illumina sequencing data (SRA accession SRR6232156) using Pilon (version 1.23) (8). In the Pilon polishing, Illumina reads were aligned to Nanopolish-polished sequences with the aligner bwa (0.7.12-r1039) with default parameters of the module ‘mem’ (29). Pilon used bwa alignments and polish assembled sequences with the parameters of (–minmq 40 –minqual 15). Assembled contigs were renamed based on their similarity to the assembly of B71Ref1 (28). We also manually fixed a misassembly that joined a previously identified mini-chromosome’s sequence with chromosome 6. The assembly and polishing procedure resulted in ONTv0.14. The same Pilon procedure was applied to further polish the PacBio assembly B71Ref1 (28) with Illumina reads, resulting in a new assembly PBRef1.3.

Whole genome sequence alignment via NUCmer

The nucmer command from the software MUMmer 4 (27) was used for whole genome alignment between B71Ref1.3 and ONTv0.14. The parameter of ‘-L 1000’ was used in the command nucmer and the parameter of ‘-L 5000 -I 98’ in the command of show-coords, which resulted in alignments with at least 5 kb matches and at least 98% identity between the two assembled genomes.

KAD analysis to analyze B71 genome assemblies

Using trimmed B71 Illumina reads, the KAD analysis was performed for both assemblies ONTv0.14 and B71Ref1.3 with the script ‘KADprofile.pl’, which determined the KAD value per k-mer and grouped k-mers. The script ‘KADdist.pl’ was used to map k-mers to the assembly genomes and profile distributions of k-mers from each k-mer group, particularly the group of error k-mers.

KAD analysis to assess the maize reference genome

The version 4 of maize reference genome of the bred line B73 was downloaded from ensemblgenomes.org (30). B73 WGS sequencing data were from SRA accessions of SRR4039069 and SRR4039070. KAD was run using trimmed WGS data with 47-mer as the k-mer length.

KAT analysis

The KAT software (version 2.4.2) was downloaded from the KAT GitHub (https://github.com/TGAC/KAT). Different tools in KAT were run with the Xvv1601 and B71 datasets, respectively, to test KAT analysis. The KAT hist was used to identify k-mer spectra from Illumina sequencing data of Xvv1601 and B71, and the parameters of KAT hist were -t 4 -l 1 -h 1000000 -i 1 -m 25 -H 100000000. The KAT gcp was used to calculate GC contents of k-mers in Xvv1601 and B71 sequencing data, and the parameters of KAT gcp were -t 4 -x 1 -y 1000 -m 25 -H 100000000. The KAT comp was used to compare the sequencing data and assemblies. For Xvv1601, two assemblies, canu and final, were compared to Xvv1601 sequencing data with the KAT comp parameters -t 4 -m 25. For B71, two assemblies, ONTv0.14 and B71Ref1.3, were compared to B71 sequencing data with the same parameters.

KAD analysis on heterozygous genomes

The simulated heterozygous Escherichia coli genomes were used to perform the KAD analysis. Simulations were performed on the E. coli reference genome with the single-nucleotide substitution frequencies from 1% to 10%. The E. coli reference genome was combined with those simulated genomes to form heterozygous genomes. From each heterozygous genome, 25× reads with no errors were simulated from each haplotype. Then, the KAD analyses were performed using the E. coli reference genome with each simulated heterozygous read dataset.

RESULTS

Rationale of KAD profiling and software development

With the availability of long-noisy-read and short-accurate-read sequencing technologies, genome sequences nowadays are often constructed by using both long and short WGS reads. The Illumina sequencing platform is the dominant short-read technology and produces accurate reads with ～0.1% error rate, predominated by single-nucleotide substitutions (31). High-depth sequencing data (e.g. 30× or above) and relatively uniformly distributed reads across the genome enable to quantify genome content through k-mer analysis. Specifically, the abundance of a k-mer from short reads is correlated with occurrence of the k-mer in the genome (21). For most genomes, single-copy k-mers each of which is present once in the genome are dominant among all k-mer sequences (non-redundant k-mers with one or multiple copies) derived from the genome. The mode of sequencing depths of single-copy k-mers (?

Base errors in assemblies can be detected through KAD analysis

The use of KAD in error detection was tested by the simulation using an E. coli reference genome (4.7 Mb). The KAD calculation requires a genome assembly and short reads generated from the genome. We simulated 50× reads without errors (error-free reads) from the E. coli reference genome and 10 sets of genome sequences with 0.1–1% single-nucleotide substitution errors (see the ‘Materials and Methods’ section). The KAD value using 25-mer as the k-mer size was determined for each k-mer derived from simulated genomes that contain varying numbers of errors. A k-mer with the KAD value equaling ?1 was referred to as an error k-mer. As expected, the numbers of error k-mers increased with error rates of the simulated genomes (Figure 1A). We extended simulations using larger genomes from four additional species, namely yeast (Saccharomyces cerevisiae, 12.4 Mb), Arabidopsis (Arabidopsis thaliana, 121.6 Mb), rice (Oryza sativa japonica, 381.3 Mb) and maize (Zea mays, 2.2 Gb). For each genome, 50× error-free reads were simulated. Similar to the simulation using the E. coli genome, the number of error k-mers detected by KAD analysis in four species showed a linear correlation with error rates (Figure 1A, Supplementary Figure S1 and Supplementary Table S1), which supported the conjecture that the number of error k-mers determined by KAD analysis accurately reflects error rates regardless of genome size and complexity.

Figure 1.

Open in new tab Download slide

Error detection through KAD analysis. (A) Number of error k-mers detected in simulated E. coli and maize genomes with different simulated error rates. The left and right y-axes represent the number of error k-mers detected in the simulated E. coli and maize genome sequences, respectively. (B) Under the 0.5% rate of errors in simulated genomes, ratios of the number of error k-mers to the number of simulated errors in five simulated genomes were plotted versus corresponding genome sizes.

Error k-mers were mapped to simulated genomes and the regions covered by error k-mers were referred to as error regions. The simulated errors located in error regions were considered as errors captured by KAD and the percentage of captured errors was defined as TCR. In relatively small genomes (E. coli, yeast and Arabidopsis) with error rates ranging from 0.1% to 1%, the TCR values were all >99% (Table 1), which indicated that KAD analysis can inform almost all errors in these simulated genomes. For a genome with a moderate size (rice), the TCR values reduced to ～95%. The TCR values were further reduced to ～67% for maize that has a large and complicated genome, suggesting that the size and complexity of a genome have impacts on error detection.

DISCUSSION

Remarkable progress has been made to advance genome assemblies in recent decades, including cost-efficient and accurate high-throughput short-read sequencing, long-read single-molecule sequencing, and improved assembly and error correction algorithms (36,37). The evaluation of multiple versions of assemblies using different assembly algorithms and various assembly procedures is critical for the optimization of genome assemblies. It is also important to assess the final assembly products for information on errors and incompleteness. Here, we developed a simple but effective method for genome evaluation based on the quantitative comparison of k-mer abundances between accurate short-read sequencing data and the assembly sequences. The method, referred to as KAD, depicts how sequence contents from high-depth short reads are represented in each examined genome assembly, identifies the regions in the assembly where base errors occur, estimates the overall error rate of an assembly and finds the regions containing potential sequence redundancy or incompleteness. Additionally, problems with short-read data, if they exist, could be detected. The KAD method has been implemented and the scripts are freely accessible (https://github.com/liu3zhenlab/kad), which would be useful for genome assembly projects of a wide range of species from small bacterial genomes to large eukaryotic genomes.

KAD first analyzes high-depth (40× or higher recommended) Illumina data to determine the sequencing depth, or the coverage of the genome. This analysis assumes that k-mers having a single copy in the genome are most abundant among all k-mers generated from the genome, which is true for genomes of most species if the k-mer length is sufficiently long (e.g. 25 nt). The method can be used to assess genome sequences of heterozygous diploids, but caution is needed when analyzing polyploidy genomes such as wheat. Based on that assumption, the k-mer abundance that occurs most frequently (the mode of k-mer abundances) should represent the sequencing depth of single-copy k-mers if reads contain no sequencing errors. Reads with sequencing errors, even at a low frequency of sequencing errors, generate a very high number of k-mers with small abundance (e.g. 1–3), and can slightly reduce k-mer abundance of each correct k-mer. Such reduction is negligible as it is very small. We thus use the mode of k-mer abundances (the most frequent k-mer abundance that is not small) as the estimate of the sequencing depth, which is an important parameter in the formula to calculate KADs.

For a complete assembly with no errors and unbiased sequencing data with no contamination, the KAD profile would be ideal in that KADs of all k-mers would be around 0. In reality, problematic k-mers with KADs distant from 0 can be found. We categorized problematic k-mers into four groups: OverRep, Error, LowUnderRep and HighUnderRep. The group of ‘Error’ is well defined. KAD values of all k-mers of this group are –1, which indicates that no such k-mers are produced from sequencing reads, but they are present in the assembly. Importantly, the number of k-mers in this group (error k-mers) reflects the sequence error rate of the assembly. Our simulation data indicate that the assembly error rate can be estimated using the number of error k-mers. This estimation is particularly useful nowadays because many genome assemblies retain a number of errors from noisy long reads that are used for assemblies (7). Importantly, most of these errors create new sequences that are not present in the actual genome sequences. Therefore, most error k-mers can be unambiguously mapped to error regions in the assembled sequences even though error k-mers are located in highly repetitive regions. This provides a unique approach to identify errors in repetitive sequences that are error-prone. The other three groups are arbitrarily categorized by defining the range of KADs. KAD thresholds to define these categories are adjustable but the defaults worked well for all the genomes that we tested.

The over-represented group includes k-mers with low KAD values (e.g. <?1), representing k-mers occurring multiple times in the assembly, but are absent or represented less frequently in reads than expected. Most likely, these k-mers are derived from sequence redundancy in the assembly, or regions containing systematic errors across multiple locations in the assemblies, or sequences in the genome that are biasedly under-represented in reads. Biased under-representation in reads could occur at extremely highly GC- or AT-rich regions (25). The under-represented groups include k-mers with high positive KAD values (e.g. >0.75), representing k-mers showing fewer copies in the assembly than indicated by reads. For example, k-mers from a region incorrectly collapsed due to tandem repeats would be categorized in these groups. The higher the KAD values, the higher the level of potential assembly incompleteness. However, when k-mers are derived from high-copy organelles or plasmids, the high KAD values reflect the fact that high copies of the sequence are present but only one is probably assembled. In addition, when k-mers are from contamination in short reads, their KAD values could be high. When mapping over- and under-represented k-mers to the assembly, the issue of multiple mapping is frequently encountered. KAD scripts allow users to tune the parameter of the maximum number of locations, which enables the examination of k-mers that are located at multiple genomic regions. However, users need to bear this in mind that the over- or under-represented signal from a region could be repeatedly shown at multiple locations.

In this study, for most analyses, we chose 25-mer as the k-mer size because it is an optimal size in the genome assembler ALLPATHS-LG for analyzing k-mer abundance spectrum (38) and we have successfully applied it to examine repetitive sequences of maize that has a large complex genome (21). A shorter k-mer size can have a higher resolution to pinpoint error regions but compromises the uniqueness of each k-mer in the assembled genome sequences. In our simulation result, we showed that 25-mer has high power and accuracy for detecting base error in assembled genomes with small or moderate sizes. We also showed that large complex genomes like maize need longer k-mers for improving the uniqueness of k-mers in the genomes. However, error regions identified using longer k-mers are wider and more likely to cover multiple errors, and the analysis requires more computation resources. In addition, sequencing reads used in KAD analysis are not error-free. The longer the k-mer size, the higher the likelihood that a k-mer from reads would carry errors (39). Therefore, slightly higher depth than 40× would help ensure reliable error detection when a longer k-mer length (e.g. 49-mer) is used.

The development of the KAD analysis was inspired by a motivation to generate a high-quality genome assembly and to develop a method to compare different assemblies for nomination of the final release. With the KAD bioinformatics pipeline, we can quantify the overall sequence error rate and locate errors. Existing error correction algorithms can use the information provided by KAD for targeted error correction, which should reduce false correction. KAD can also help to determine whether one or multiple rounds of polishing were sufficient to meet the convergence criteria, for example, on the basis of the diagnostic plots including KAD error profiles and KAD landscape plots. In the future, new error correction approaches, particularly approaches that are not based on read alignments, could be developed along with KAD for precise error correction in both non-repetitive and repetitive sequences.

DATA AVAILABILITY

Bacterial sequencing raw data and genome assembly are available in the NCBI BioProject under the accession number of PRJNA421835. Fungal Nanopore sequencing raw data and genome assembly are available in the NCBI BioProject under the accession number of PRJNA355407. KAD codes are available at GitHub (https://github.com/liu3zhenlab/KAD)

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences 利用短读段和基因组组装序列之间的k-mer丰度差异(KAD)进行 因子估计组装碱基误差