Skip to main content
Plot a rivers map in R thumbnail

Plot a map with rivers in R

There are many great libraries to create geographic maps in R. However, making a map with rivers in R is not trivial. I was not able to find an R library with a rivers map, but I found a simple way to add rivers to a map.

I recently needed to create a map for a publication where we study fish species. So, showing sampling locations on a map with rivers was a requirement. An obvious solution for that was of course to use R. And although making a map with rivers in R turned out to be easy, I spent half the day searching for a solution.

I use R to create maps for all my publications. It is a free, simple, and precise way to plot points on a nice-looking map. If I need to add more things to the map, I save it in SVG format and edit it in Inkscape, an open-source editor for vector graphics.

Read More
R, Bash in Jupyter Notebooks

Python, R, Bash in one Jupyter Notebook

Combining Python, R, Bash in one Jupyter Notebook makes tracking of the workflow easier, simplifies sharing and makes you more efficient and professional. 

Why Jupyter Notebooks?

If you read my Big Data tutorial, you are already familiar with Databricks notebooks. These notebooks allow combining code from many different programming languages (Scala, Python etc.) in one notebook. I thought it would be great to set up a similar notebook environment locally on my computer to manage my workflows.

Read More

Genomic variant calling pipeline

Genomic variant calling pipeline

I would like to share with you my automatic genomic variant calling pipeline. Using such genomic variant calling pipeline becomes essential when a project scales to dozens and hundreds of genomes.

As probably any beginner, I used to process my genomic data with manual interference at every step. So, I would submit mapping jobs for all samples on a computing cluster, when they all done I would submit mark duplicates jobs etc. Moreover, I would also manually write sbatch scripts (my cluster UPPMAX uses the Slurm Workload Manager). It was not efficient.

Well, I used replacements (with sed) and loops (with for i in x; do ...) to reduce the amount of work, but there were many manual steps. I managed to process 24-31 small Capsella genomes (~200Mb) this way during my PhD projects. Now, I work with the dog genome which is much bigger (~2.5Gb) and I also need to analyze many more samples (82 genomes at the moment). So, I had to write this genomic variant calling pipeline to make my workflow as automatic as possible.

Read More

Artwork for Genomics analysis with Spark and Hail

Processing Genomic Data with Apache Spark (Big Data tutorial)

The current scale of genomic data production requires scaling the processing tools to analyze all that data. Hail, an open-source framework built on top of Apache Spark, provides such tools. It is capable of processing gigabyte-scale data on a laptop or terabyte-scale data on a cluster. In this tutorial, I show a simple Hail pipeline to filter a VCF file and build a PCA plot to explore the structure of the data.

I prepared this tutorial for the course Scalable Data Science, which I attended as a student.

Read More

GATK best practice for a non-model organism

GATK: the best practice for genotype calling in a non-model organism

For genotype calling in non-model organisms, modifications of the GATK Best Practices, which are developed specifically for human data, are often essential. This post shows my approach to this issue.

The GATK (Genome Analysis Toolkit) is the most used software for genotype calling in high-throughput sequencing data in various organisms. Its Best Practices are great guides for various analyses of sequencing data in SAM/BAM/CRAM and VCF formats. However, the GATK was designed and primarily serves to analyze human genetic data and all its pipelines are optimized for this purpose. Using the same pipelines without any modifications on non-human data can lead to some inaccuracy. This is especially an issue when a reference genome is not the same species as analyzed samples.

Here, I describe my GATK pipeline of genotype calling on whole genome sequencing data of Capsella bursa-pastoris, a non-model organism with the reference genome available only for a sister species. Although it is a particular study case, I believe that the explanation of my modifications can help other researchers to adopt this pipeline to their non-model organisms.

Read More

Alignment with heterozygous genotypes coded by ambiguty characters

Heterozygotes as ambiguity characters. Mistakes you don’t want to make

Ambiguity characters are often used to code heterozygous genotypes. However, using heterozygotes as ambiguity characters may bias many estimates because most software would use such genotypes as uncertainty. This problem is very obvious but according to my experience, it frequently stays unnoticed.

IUPAC nucleotide code

The current nucleic acid notation appeared a long time before the next-generation sequencing and whole genome data analyses. Characters A, C, G, and T were introduced to represent the four nucleotides of a DNA molecule. Ambiguity characters W, S, M, K, R, Y were proposed to code positions when there is some uncertainty between two nucleotides and B, D, H, V were used when there is only confidence that a position is not one of the four nucleotides. This coding system is known as IUPAC nucleotide code.

It worked well in the early DNA sequencing era when scientist studied short haploid DNA sequences, and there is still no alternative today. All software uses this coding system.

Read More

Taking notes on a computer instead of paper

The best note-taking software on Linux

In my first blog post, I would like to share my thoughts about one of the most necessary software for a researcher, a note-taking software. I use exclusively Linux OS and all the programs I describe I tested on Linux, but most of them are also available on other platforms.

I absolutely agree with the saying that the worst piece of paper is better than the best memory. Given that we live in the digital age, I would also add that an electronic note synchronized with a cloud is even better than the best paper because paper is so easy to lose 🙂 .

The worst piece of paper is better than the best memory.

Nowadays we face a constant flow of enormous amount of information and remembering things is harder than ever. Often we have no time to consume all the information we want, so how we can remember everything.

Taking notes is an essential part of a researcher’s routine, and it has to be done efficiently.

Read More