Skip to main content
Genomic variant calling pipeline

Genomic variant calling pipeline


I would like to share with you my automatic genomic variant calling pipeline. Using such genomic variant calling pipeline becomes essential when a project scales to dozens and hundreds of genomes.


As probably any beginner, I used to process my genomic data with manual interference at every step. So, I would submit mapping jobs for all samples on a computing cluster, when they all done I would submit mark duplicates jobs etc. Moreover, I would also manually write sbatch scripts (my cluster UPPMAX uses the Slurm Workload Manager). It was not efficient.

Well, I used replacements (with sed) and loops (with for i in x; do ...) to reduce the amount of work, but there were many manual steps. I managed to process 24-31 small Capsella genomes (~200Mb) this way during my PhD projects. Now, I work with the dog genome which is much bigger (~2.5Gb) and I also need to analyze many more samples (82 genomes at the moment). So, I had to write this genomic variant calling pipeline to make my workflow as automatic as possible.

Read More

Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)


The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.


I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Read More

GATK best practice for a non-model organism

GATK: the best practice for genotype calling in a non-model organism


For genotype calling in non-model organisms, modifications of the GATK Best Practices, which are developed specifically for human data, are often essential. This post shows my approach to this issue.


The GATK (Genome Analysis Toolkit) is the most used software for genotype calling in high-throughput sequencing data in various organisms. Its Best Practices are great guides for various analyses of sequencing data in SAM/BAM/CRAM and VCF formats. However, the GATK was designed and primarily serves to analyze human genetic data and all its pipelines are optimized for this purpose. Using the same pipelines without any modifications on non-human data can lead to some inaccuracy. This is especially an issue when a reference genome is not the same species as analyzed samples.

Here, I describe my GATK pipeline of genotype calling on whole genome sequencing data of Capsella bursa-pastoris, a non-model organism with the reference genome available only for a sister species. Although it is a particular study case, I believe that the explanation of my modifications can help other researchers to adopt this pipeline to their non-model organisms.

Read More