Monthly Archives: January 2016

TGAC Ash tree assemblies (Tree 18 & 35)

The Contributors

Bernardo Clavijo and Team,
The Genome Analysis Centre (TGAC), Norwich

Assembly summary:

Tree Number of scaffolds Total sequence (Mbp) % of Ns N50 (Kbp)
Tree18 37,452 865.1 5.1 180.4
Tree35 29,847 845.6 1.4 137.6

Contig assembly

Contigs were assembled from 250bp paired-end reads generated from a PCR-free protocol. The DISCOVAR de novo software [1] was used. We used KAT [2] spectra-cn plots to QC motif representation, and tailored our data generation towards a maximum complexity, precisely sized, low bias sampling.

Haplotype filter

Expectation maximisation heuristics based on k-mer spectra of the raw reads were applied to the contigs to create a mosaic genome representation by collapsing the haplotypes into one choice per locus. The filtered set of contigs represents all homozygous content and roughly half of the heterozygous content which simplifies the scaffolding stage.

Scaffolding

Nextera LMP were constructed, QC’d, and chosen for sequencing as described in TGAC’s published method [3], and pre-processed with a pipeline based on Nextclip [4]. Haplotype-filtered contigs were scaffolded using SOAPdenovo2 [5]. SOAPdenovo2 replaces N-stretches (gaps) in contigs with Cs and Gs during scaffolding so to correct for this contigs were mapped back to the scaffolds and the gaps converted back to Ns.

Contamination screening and filtering

Scaffolds shorter than 1kbp were removed. The remaining scaffolds were checked for contamination against NCBI’s nucleotide database using BLAST+ and the results joined to NCBI’s taxonomy database. Results were filtered to show hits of >98percent identity over >90% of their length. From this list, scaffolds identified as contamination were removed.

Assemblies are available to download from oadb ftp site
Tree 18 assembly

Tree 35 assembly

Both genomes are available to blast query at
TGAC ash genome blast site

1) http://www.broadinstitute.org/software/discovar/blog/
2) http://www.tgac.ac.uk/KAT/
3) D. Heavens, G. G. Accinelli, B. Clavijo, and M. D. Clark, “A method to simultaneously construct up to 12 differently sized Illumina Nextera long mate pair libraries with reduced DNA input, time, and cost.,” BioTechniques, vol. 59, no. 1, pp. 42–45, 2015.
4) R. M. Leggett, B. J. Clavijo, L. Clissold, M. D. Clark, and M. Caccamo, “NextClip: an analysis and read preparation tool for Nextera long mate pair libraries,” Bioinformatics, p. btt702, 2013.
5) R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J. Tang, G. Wu, H. Zhang, Y. Shi, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D. W. Cheung, S.-M. Yiu, S. Peng, Z. Xiaoqian, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T.-W. Lam, and J. Wang, “SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.,” Gigascience, vol. 1, no. 1, p. 18, 2012.

Contact: Bernardo Clavijo, Algorithms Team Leader, TGAC.
Bernardo.Clavijo@tgac.ac.uk