FIR analysis: genes encoding predicted secreted proteins occur in both gene sparse and gene dense regions of the H. pseudoalbidus genome

The contributors

Daniel Bunting (Nuffield student), Kentaro Yoshida, Dan MacLean and Diane Saunders at TSL.

The material

We used the potential Hymenoscyphus pseudoalbidus KW1 effector candidates identified in (http://oadb.tsl.ac.uk/?m=20130910).

Background information

In filamentous plant pathogens such as the late blight oomycete pathogen Phytophthora infestans, a repeat-driven expansion has created repeat and transposable element (TE) rich, gene-sparse regions that are distinct from the gene-dense conserved regions, known as a two-speed genome architecture. Determining the distance of a gene to its closest coding gene neighbours, (designated flanking intergenic regions, FIRs), can be used to determine whether a gene resides in a gene-dense or gene-sparse environment. Given that genes associated with pathogenicity tend to have long FIRs in pathogen genomes, genome architecture could be used to identify new candidate pathogenicity genes.

The analysis

To investigate whether a similar organisation occurs in the genome of H. pseudoalbidus we firstly identified candidate effector genes in the gene annotations  (http://oadb.tsl.ac.uk/?m=20130910). In order to determine whether genes encoding secreted proteins are in gene sparse or dense regions of the genome we modified the de novo gene calls using RNA-seq data to extend based on overlaps with transcripts, to create the file extended_genes.gff by aligning the RNAseq reads from KW1 against the KW1 assembly, using BWA. For each gene model in the TGAC gene predictions that was within 100nt of another gene we extracted reads on the same strand that fell within -1000nt of the start or 1000nt of the end. With these reads, starting with the start and end of the gene we followed read overlaps as far as possible, until reads no longer overlapped. The most distal read then counted as the new gene start/end.

The FIR distribution for genes in the H.pseudoalbidus genome can be seen below and is indicative of a single speed genome, with genes encoding secreted proteins dispersed both in gene-sparse and gene-dense regions of the genome.

 

Overlayed

Figure. The single speed H.pseudoalbidus genome. Distribution of H.pseudoalbidus genes according to the length of their 5′ and 3′ flanking intergenic regions (FIRs). Red circles, core genes; blue circles, genes encoding predicted secreted proteins.