Skip to main content
Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.


I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Below I provide a Databricks notebook and a video where I explain this notebook.

Open this notebook in a full-width view.


 

Watch the lecture where I presented this notebook!

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

leave a comment

Your email address will not be published. Required fields are marked *