Processing Genomic Data with Apache Spark (Big Data tutorial)

I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data in Databricks Platform.

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.>

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Below I provide a Databricks notebook and a video where I explain this notebook.

Open this notebook in a full-width view in a new tab.

Watch the lecture where I presented this notebook!

If you have any questions or suggestions, feel free to email me.

Written on November 8, 2017