Skip to main content
Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)


The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.


I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Read More

Alignment with heterozygous genotypes coded by ambiguty characters

Heterozygotes as ambiguity characters. Mistakes you don’t want to make


Ambiguity characters are often used to code heterozygous genotypes. However, using heterozygotes as ambiguity characters may bias many estimates because most software would use such genotypes as uncertainty. This problem is very obvious but according to my experience, it frequently stays unnoticed.


IUPAC nucleotide code

The current nucleic acid notation appeared a long time before the next-generation sequencing and whole genome data analyses. Characters A, C, G, and T were introduced to represent the four nucleotides of a DNA molecule. Ambiguity characters W, S, M, K, R, Y were proposed to code positions when there is some uncertainty between two nucleotides and B, D, H, V were used when there is only confidence that a position is not one of the four nucleotides. This coding system is known as IUPAC nucleotide code.

It worked well in the early DNA sequencing era when scientist studied short haploid DNA sequences, and there is still no alternative today. All software uses this coding system.

Read More