Skip to main content
Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Below I provide a Databricks notebook and a video where I explain this notebook.

Open this notebook in a full-width view.

Watch the lecture where I presented this notebook!

2 comments on “Processing Genomic Data with Apache Spark (Big Data tutorial)

leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.