Processing Genomic Data with Apache Spark (Big Data tutorial)

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

GATK best practice for a non-model organism

For genotype calling in non-model organisms, modifications of the GATK Best Practices, which are developed specifically for human data, are often essential. This post shows my approach to this issue.

The GATK (Genome Analysis Toolkit) is the most used software for genotype calling in high-throughput sequencing data in various organisms. Its Best Practices are great guides for various analyses of sequencing data in SAM/BAM/CRAM and VCF formats. However, the GATK was designed and primarily serves to analyze human genetic data and all its pipelines are optimized for this purpose. Using the same pipelines without any modifications on non-human data can lead to some inaccuracy. This is especially an issue when a reference genome is not the same species as analyzed samples.

Here, I describe my GATK pipeline of genotype calling on whole genome sequencing data of Capsella bursa-pastoris, a non-model organism with the reference genome available only for a sister species. Although it is a particular study case, I believe that the explanation of my modifications can help other researchers to adopt this pipeline to their non-model organisms.

Alignment with heterozygous genotypes coded by ambiguty characters

Ambiguity characters are often used to code heterozygous genotypes. However, using heterozygotes as ambiguity characters may bias many estimates because most software would use such genotypes as uncertainty. This problem is very obvious but according to my experience, it frequently stays unnoticed.

IUPAC nucleotide code

The current nucleic acid notation appeared a long time before the next-generation sequencing and whole genome data analyses. Characters A, C, G, and T were introduced to represent the four nucleotides of a DNA molecule. Ambiguity characters W, S, M, K, R, Y were proposed to code positions when there is some uncertainty between two nucleotides and B, D, H, V were used when there is only confidence that a position is not one of the four nucleotides. This coding system is known as IUPAC nucleotide code.

It worked well in the early DNA sequencing era when scientist studied short haploid DNA sequences, and there is still no alternative today. All software uses this coding system.

Taking notes on a computer instead of paper

The best note-taking software on Linux

In my first blog post, I would like to share my thoughts about one of the most necessary software for a researcher, a note-taking software. I use exclusively Linux OS and all the programs I describe I tested on Linux, but most of them are also available on other platforms.

I absolutely agree with the saying that the worst piece of paper is better than the best memory. Given that we live in the digital age, I would also add that an electronic note synchronized with a cloud is even better than the best paper because paper is so easy to lose 🙂 .

The worst piece of paper is better than the best memory.

Nowadays we face a constant flow of enormous amount of information and remembering things is harder than ever. Often we have no time to consume all the information we want, so how we can remember everything.

Taking notes is an essential part of a researcher’s routine, and it has to be done efficiently.

