The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.
I prepared this tutorial for the course Scalable Data Science, which I attended as a student.
Below I provide a Databricks notebook and a video where I explain this notebook.
Watch the lecture where I presented this notebook!