Methods

This section provides an overview of the methods and tools that were developed for quality control of mtDNA datasets with respect to empop. More detailed information can be found in the respective publications.

Need for Quality Control (QC)

A collaborative exercise showed that uniformity of mtDNA sequencing among different laboratories was not the case at this time and the following four main groups of errors were determined:

clerical errors
sample mix-ups
contaminations and
discrepancies with respect to the mtDNA nomenclature

A conservative error-rate was calculated at 10.7% for a virtual database.

Theses results and earlier published concerns strongly emphasized the need for appropriate safety regulations when mtDNA profiles are compiled for database purposes in order to accomplish the high standard required for mtDNA databases that are used in the forensic context

For details see: Parson et al 2004

Quality Control of population datasets using Quasi-median Networks

Next, a concept for mtDNA data generation, analysis, transfer and quality control that meets forensic standards was presented. Due to the complexity of mtDNA population data tables it is often difficult or even impossible to detect errors. Therefore, software based on quasi-median network analysis that visualizes mtDNA data tables highlighting sequencing, interpretation and transcription errors was introduced. This was made publicly available in the form of the EDNAP mtDNA Population Database, short EMPOP.

For details see: Parson and Dür 2007

Optimization of Quality Control using Quasi-median Networks

The complexity of the networks required a filtering of highly recurrent mutations prior to any analysis in order to simplify the data set. Haplogroup-specific filters were suggested and presented. Examples are shown on a West-Eurasian etalon data set containing more than 3500 control region sequences.

For details see: Zimmermann et al 2011

Alignment-free database searches for unbiased frequency estimates (SAM)

The nature of reporting mitotypes (difference coded) with the drawback of ambiguity further led to the development of a string-based search algorithm (SAM) that converts query and database sequences to position-free nucleotide strings and thus eliminates the possibility that identical sequences will be missed in a database query. Moreover, SAM provided additional flexibility to incorporate phylogenetic data, site-specific mutation rates, and other biologically relevant information that would refine the interpretation of mitochondrial DNA data.

For details see: Röck et al 2011

Haplogrouping using a Maximum Likelihood approach (EMMA)

Another substantial factor for the QC of mitotypes is the assignment of haplogroups. By introducing EMMA, an approach that also considered private mutations was provided to the community. Especially the quality control of partial sequences could be improved hereby.

Moreover, haplogrouping is an important method for other disciplines as well.

For details see: Röck et al 2013

Updated query engine for optimized database searches (SAM2)

With the knowledge gained through several years of quality control of submitted datasets, the basic search algorithm was updated to continue to comply the forensic needs.

Besides general code-optimization, the higlights can be summarized as follows:

more conservative database search results
updated alignment/nomenclature conventions for phylogenitcally instable regions
new mode of finding sequence neighbours
introduction of block.indels
evaluation of the effect of additional length variants
dissemination through empop.online

Moreover, SAM2 is also capable of haplogroup estimation and therefore replaces EMMA.

For details see: Huber et al 2018

Fine-Tuning Phylogenetic Alignment and Haplogrouping of mtDNA Sequences

Further improvement of alignment and haplogroup estimation of mitochondrial DNA (mtDNA) sequences.

For details see: Dür et al. 2021