MToolBox
A bioinformatics pipeline for human mtDNA analysis from high throughput and automated sequencing data
|
Input format |
||
BAM SAM FASTQ FASTQ.GZ FASTA |
The tool accepts different NGS data format for a complete analysis, or FASTA files for haplogroup prediction, variant annotation and prioritization steps only. The pipeline works on both Whole Genome and Whole Exome Sequencing data. Warnings! The user can submit more samples in a single MToolBox run, but
they must be in the same format. The total file size must not exceed 50 MB. A faster analysis could be performed by submitting only mitochondrial reads. The usage of compressed (FASTQ.GZ) or binary (BAM) files is highly recommended. BAM, SAM and FASTA input files MUST be renamed as <sample_name>.ext. FASTQ files MUST be renamed as <sample_name>.R1.fastq, <sample_name>.R2.fastq for PAIRED-END data and <sample_name>.fastq for SINGLE END data. |
|
Reference sequences |
||
hg19+rCRS hg19+RSRS |
The tool uses hg19 as nuclear reference genome for read mapping and rCRS or RSRS as mitochondrial reference sequences, at user's choice. Haplogroup prediction is based on RSRS Phylotree build 15, while a functional annotation is made for each variant recognized by both the two reference sequences. |
|
Filtering and extra options |
||
|
PCR duplicates removal through PicardTools MarkDuplicates is recommended for a more accurate allele quantification. |
|
|
The IndelRealigner utility of the GATK suite is used for local realignment of insertions and deletions around a set of 127 events annotated in Mitomap and HmtDB. |
|
|
The minimum distance from the read end required to retain an insertion/deletion during variant calling. Default value: 0.5 Warning! Only values ≥ 5 are allowed. |
|
|
The heteroplasmy threshold is set for variants to be reported in the FASTA consensus sequence output. Contig sequences included in this file report mismatch and ins/dels variants equal to or above the fixed threshold, otherwise IUPAC ambiguities are reported. Default value: 0.8 Warning! Ins/dels below the threshold will not be reported in the FASTA sequence. |
MToolBox default output
files are:
- VCF_file.vcf:
contains all the mitochondrial variant positions against the mitochondrial
reference sequence identified in the whole set of submitted samples and other
meta-information
-
mt_classification_best_results.csv: reports for each sample the best haplogroup
prediction. If the sorting results in more than one best haplogroup prediction
with equal probability, the output will enclose all of them
-
prioritized_variants.txt: brief annotation of prioritized variants (variants
recognized by the 3 reference sequences, sorted in ascending order of
nucleotide variability scores)
-
OUT_<sample_name> folders, produced for each sample, containing:
-
outmt.sam: reads mapped onto the human mitochondrial DNA
-
logmt.txt: GSNAP log file for mitochondrial DNA mapping
-
outmt.fastq: fastq file of single reads extracted from outmt.sam file
-
outmt1.fastq: fastq file of paired reads extracted from outmt.sam file
-
outmt2.fastq: fastq file of paired reads extracted from outmt.sam file
-
outhumanS.sam: single reads mapped onto the entire human genome
-
loghumanS.txt: GSNAP log file for human genome mapping (single reads)
-
outhumanP.sam: paired reads mapped onto the entire human genome
-
loghumanP.txt: GSNAP log file for human genome mapping (paired reads)
-
OUT.sam: alignments of reads uniquely mapped on mitochondrial genome
-
OUT2.sam: alignments of reads uniquely mapped on mitochondrial genome, after
processing with IndelRealigner and/or MarkDuplicates. This file will be
generated anyway, even if these two processes have been disabled
-
mtDNAassembly-table.txt: the main table describing the assembly position by
position
-
mtDNAassembly-Contigs.fasta: a fasta file including all reconstructed contigs
-
mtDNAassembly-coverage.txt: a text file including the coverage per contig and
per mitochondrial known annotation
-
logassemble.txt: the log file of the assembleMTgenome.py script
-
sorted.csv: a table with each haplogroup whose prediction is > 90%
-
merged_diff.csv file: reports the SNPs between the query genome and each of the
three sequences RSRS, rCRS and hg_MHCS (haplogroup specific Macro-Haplogroup
Consensus Sequence)
-
<sample_name>.csv: a table where, for all the haplogroups present in the
Phylotree Build 15, are reported the same data as in the file
sequence_name>.sorted.csv, except for the Missing Sites field
-
annotation.csv: a further elaboration of the file merged_diff.csv, providing,
for each mt variant allele between the query genome and each of the three
sequences RSRS, rCRS and hg_MHCS, several annotations based on different
resources: HmtDB, Mitomap, pathogenicity prediction
software MutPred, Polyphen-2 and SNPs&GO, 1000 Genomes variants frequency, links
to OMIM, dbSNP and Mamit-tRNA. An overall disease score
ranging between 0 and 1 is also provided. It was calculated as a weighted
average of all the pathogenicity prediction scores considering both the number
of times each method gives the right prediction and the number of times each
method gives the best probability compared with the other ones on a training
dataset of 53 non synonymous variants selected among mitochondrial diseases or
cancer associated mutations (Mitomap).
The higher the score, the higher the possibility the variant site is
deleterious.
Further explanations of
files content can be found here.
MToolBox command line software
For large datasets, the user could use the command line version of MToolBox. It works on Unix systems and requires Python 2.7 and the preliminary installation of SAMtools, GSNAP and MUSCLE.
How
to cite
If you use MToolBox, please cite:
Claudia Calabrese*, Domenico Simone*, Maria
Angela Diroma, Mariangela Santorsola, Cristiano Gutta', Giuseppe Gasparre,
Ernesto Picardi, Graziano Pesole, Marcella Attimonelli.
MToolBox: a highly automated pipeline for
heteroplasmy annotation and prioritization analysis of human mitochondrial
variants in high-throughput sequencing. Bioinformatics
2014; doi:10.1093/bioinformatics/btu483.