Skip to main content
Alignment with heterozygous genotypes coded by ambiguty characters

Heterozygotes as ambiguity characters. Mistakes you don’t want to make

Ambiguity characters are often used to code heterozygous genotypes. However, using heterozygotes as ambiguity characters may bias many estimates because most software would use such genotypes as uncertainty. This problem is very obvious but according to my experience, it frequently stays unnoticed.

IUPAC nucleotide code

The current nucleic acid notation appeared a long time before the next-generation sequencing and whole genome data analyses. Characters A, C, G, and T were introduced to represent the four nucleotides of a DNA molecule. Ambiguity characters W, S, M, K, R, Y were proposed to code positions when there is some uncertainty between two nucleotides and B, D, H, V were used when there is only confidence that a position is not one of the four nucleotides. This coding system is known as IUPAC nucleotide code.

It worked well in the early DNA sequencing era when scientist studied short haploid DNA sequences, and there is still no alternative today. All software uses this coding system.

Read More