Skip to main content
Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Read More