MToolBox

A bioinformatics pipeline for human mtDNA analysis from high throughput and automated sequencing data

                                                                                                          

 

 

What is MToolBox?

 

MToolBox software implements an effective computational strategy for human mitochondrial genomes assembling and analysis from mitochondria-targeted and off-targeted sequencing data. A filtering based on quality score (≥25), depth of coverage (≥5) and exclusion of reads mapping on both nuclear and mitochondrial sequences (NumtS) is carried out for variant calling and genomes reconstruction from Whole Exome and Genome Sequencing studies.

Functional annotation and prioritization analysis of detected variants are also performed through variants recognition by the two mitochondrial reference sequences commonly used, the revised Cambridge Reference Sequence (rCRS) and the Reconstructed Sapiens Reference Sequence (RSRS), and by a related macro-haplogroup consensus sequence (MHCS) after haplogroup assignment for each analyzed sample. Traditional Sanger sequencing data may be also submitted for this analysis.

A summary file and a compressed folder (.tar.gz) containing all the output files generated by the pipeline can be downloaded once the analysis is completed, including a VCF file and FASTA files for each sample analyzed.

The summary of the analysis include

-        statistics about coverage of reconstructed mitochondrial genomes (for NGS data)

-        haplogroup(s) prediction

-        functional annotation of prioritized variants.

 

For details, see MToolBox output files.

 

How to use MToolBox

-        Input format

-        Reference Sequences

-        Filtering and extra options

Output files

Command line tool

 

MToolBox workflow

 

How to use MToolBox

The user is required to select the input format and the reference sequence for read mapping. Filtering and extra options are disabled for fasta.

 

Input format

BAM

SAM

FASTQ

FASTQ.GZ

FASTA

The tool accepts different NGS data format for a complete analysis, or FASTA files for haplogroup prediction, variant annotation and prioritization steps only. The pipeline works on both Whole Genome and Whole Exome Sequencing data.

Warnings!

The user can submit more samples in a single MToolBox run, but they must be in the same format.

The total file size must not exceed 50 MB. A faster analysis could be performed by submitting only mitochondrial reads. The usage of compressed (FASTQ.GZ) or binary (BAM) files is highly recommended.

BAM, SAM and FASTA input files MUST be renamed as <sample_name>.ext. FASTQ files MUST be renamed as <sample_name>.R1.fastq, <sample_name>.R2.fastq for PAIRED-END data and <sample_name>.fastq for SINGLE END data.

 

Reference sequences

hg19+rCRS

hg19+RSRS

The tool uses hg19 as nuclear reference genome for read mapping and rCRS or RSRS as mitochondrial reference sequences, at user's choice.

Haplogroup prediction is based on RSRS Phylotree build 15, while a functional annotation is made for each variant recognized by both the two reference sequences.

 

Filtering and extra options

Enable duplicate read removal

PCR duplicates removal through PicardTools MarkDuplicates is recommended for a more accurate allele quantification.

Enable local realignment around known insertions and deletions

The IndelRealigner utility of the GATK suite is used for local realignment of insertions and deletions around a set of 127 events annotated in Mitomap and HmtDB.

Minimum distance of ins/dels from read end

The minimum distance from the read end required to retain an insertion/deletion during variant calling. Default value: 0.5

Warning! Only values ≥ 5 are allowed.

Heteroplasmy threshold for FASTA consensus sequence

The heteroplasmy threshold is set for variants to be reported in the FASTA consensus sequence output. Contig sequences included in this file report mismatch and ins/dels variants equal to or above the fixed threshold, otherwise IUPAC ambiguities are reported. Default value: 0.8

Warning! Ins/dels below the threshold will not be reported in the FASTA sequence.

 

MToolBox output files

MToolBox default output files are:

- VCF_file.vcf: contains all the mitochondrial variant positions against the mitochondrial reference sequence identified in the whole set of submitted samples and other meta-information

- mt_classification_best_results.csv: reports for each sample the best haplogroup prediction. If the sorting results in more than one best haplogroup prediction with equal probability, the output will enclose all of them

- prioritized_variants.txt: brief annotation of prioritized variants (variants recognized by the 3 reference sequences, sorted in ascending order of nucleotide variability scores)

- OUT_<sample_name> folders, produced for each sample, containing:

- outmt.sam: reads mapped onto the human mitochondrial DNA

- logmt.txt: GSNAP log file for mitochondrial DNA mapping

- outmt.fastq: fastq file of single reads extracted from outmt.sam file

- outmt1.fastq: fastq file of paired reads extracted from outmt.sam file

- outmt2.fastq: fastq file of paired reads extracted from outmt.sam file

- outhumanS.sam: single reads mapped onto the entire human genome

- loghumanS.txt: GSNAP log file for human genome mapping (single reads)

- outhumanP.sam: paired reads mapped onto the entire human genome

- loghumanP.txt: GSNAP log file for human genome mapping (paired reads)

- OUT.sam: alignments of reads uniquely mapped on mitochondrial genome

- OUT2.sam: alignments of reads uniquely mapped on mitochondrial genome, after processing with IndelRealigner and/or MarkDuplicates. This file will be generated anyway, even if these two processes have been disabled

- mtDNAassembly-table.txt: the main table describing the assembly position by position

- mtDNAassembly-Contigs.fasta: a fasta file including all reconstructed contigs

- mtDNAassembly-coverage.txt: a text file including the coverage per contig and per mitochondrial known annotation

- logassemble.txt: the log file of the assembleMTgenome.py script

- sorted.csv: a table with each haplogroup whose prediction is > 90%

- merged_diff.csv file: reports the SNPs between the query genome and each of the three sequences RSRS, rCRS and hg_MHCS (haplogroup specific Macro-Haplogroup Consensus Sequence)

- <sample_name>.csv: a table where, for all the haplogroups present in the Phylotree Build 15, are reported the same data as in the file sequence_name>.sorted.csv, except for the Missing Sites field

- annotation.csv: a further elaboration of the file merged_diff.csv, providing, for each mt variant allele between the query genome and each of the three sequences RSRS, rCRS and hg_MHCS, several annotations based on different resources: HmtDB, Mitomap, pathogenicity prediction software MutPred, Polyphen-2 and SNPs&GO, 1000 Genomes variants frequency, links to OMIM, dbSNP and Mamit-tRNA. An overall disease score ranging between 0 and 1 is also provided. It was calculated as a weighted average of all the pathogenicity prediction scores considering both the number of times each method gives the right prediction and the number of times each method gives the best probability compared with the other ones on a training dataset of 53 non synonymous variants selected among mitochondrial diseases or cancer associated mutations (Mitomap). The higher the score, the higher the possibility the variant site is deleterious.

Further explanations of files content can be found here.

 

MToolBox command line software

For large datasets, the user could use the command line version of MToolBox. It works on Unix systems and requires Python 2.7 and the preliminary installation of SAMtools, GSNAP and MUSCLE.

 

How to cite

If you use MToolBox, please cite:

Claudia Calabrese*, Domenico Simone*, Maria Angela Diroma, Mariangela Santorsola, Cristiano Gutta', Giuseppe Gasparre, Ernesto Picardi, Graziano Pesole, Marcella Attimonelli.

MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing. Bioinformatics 2014; doi:10.1093/bioinformatics/btu483.