Skip to main content
Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Read More

GATK best practice for a non-model organism

GATK: the best practice for genotype calling in a non-model organism

For genotype calling in non-model organisms, modifications of the GATK Best Practices, which are developed specifically for human data, are often essential. This post shows my approach to this issue.

The GATK (Genome Analysis Toolkit) is the most used software for genotype calling in high-throughput sequencing data in various organisms. Its Best Practices are great guides for various analyses of sequencing data in SAM/BAM/CRAM and VCF formats. However, the GATK was designed and primarily serves to analyze human genetic data and all its pipelines are optimized for this purpose. Using the same pipelines without any modifications on non-human data can lead to some inaccuracy. This is especially an issue when a reference genome is not the same species as analyzed samples.

Here, I describe my GATK pipeline of genotype calling on whole genome sequencing data of Capsella bursa-pastoris, a non-model organism with the reference genome available only for a sister species. Although it is a particular study case, I believe that the explanation of my modifications can help other researchers to adopt this pipeline to their non-model organisms.

Read More