Dmytro Kryvokhyzha

R loops are slow: How to deal with that

2019-10-01T00:00:00+00:00

When you start learning R, no one will tell you that R loops are slow. At least, I was not taught about this and I have not seen an explicit statement about this in R textbooks. Later on, you begin using R beyond trivial tasks and you discover that R loops often becomes a bottleneck of your scripts. You would wonder why it is so and how to deal with that. This is exactly what happened to me.

Indeed, R for loops are inefficient, especially if you use them wrong. Searching for why R loops are slow discovers that many users are wondering about this question. Below, I summarize my experience and online discussions regarding this issue by providing some trivial code examples.

R is an interpreted language

This is what you need to keep in mind when you write in R or any other interpreted language. Make interpreted language easy for the end-user comes at the costs of processing such a code. There is a lot of extra computing steps to interpret the user-friendly code into computer code and execute it. That is why a compiled language is much faster as it doesn’t carry the extra baggage of interpreted language.

Does this mean you need to learn C or similar languages? Of course, not, though it won’t hurt to know some C :-). You just need to be aware of this fact and try to write your R code in a way that makes it efficient.

Below, I provide some examples that will help you understand when I am talking about. I also think these examples can be used as the best practices for R loops programming.

Keep R loops code minimal

Let’s have a look at the example when even some extra characters that do nothing impact the processing speed.

Create a matrix with random numbers:

set.seed(123)
m <- matrix(runif(1000000, max = 100), ncol = 100)

Calculate row means:

loopmean <- function(x){
  v <- c()
  for (i in c(1:dim(x)[1])){
    v[i] <- mean(x[i,])
  }
  return(v)
}

system.time(loopmeanD <- loopmean(m))
#  user  system elapsed 
# 0.039   0.003   0.044

The function here doesn’t matter. I just picked up mean as the most trivial example. We are interested in the amount of time it takes to process this loop.

If we reuse the same code but add some extra brackets in mean(), it will take substantially longer to process:

loopmeanBrackets <- function(x){
  v <- c()
  for (i in c(1:dim(x)[1])){
    v[i] <- mean(((((((((((((((((x[i,])))))))))))))))))
  }
  return(v)
}

system.time(loopmeanBracketsD <- loopmeanBrackets(m))
#  user  system elapsed
# 0.051   0.000   0.050

identical(loopmeanD, loopmeanBracketsD)
# [1] TRUE

We have changed nothing in terms of math. It is still the same calculation as before. However, R needs to go through each ( and ) every loop cycle and this slows down the code a lot.

So, next time you write your loop, make it as minimal as possible in terms of character numbers.

Process by columns

R naturally process the data by columns faster than by row. If you need to loop through the columns, transform your data and loop through the columns:

loopmeanColumn <- function(x){
  v <- c()
  for (i in (1:dim(x)[2])){
    v[i] <- mean(x[,i])
  }
  return(v)
}

tm <- t(m)
system.time(loopmeanColumnD <- loopmeanColumn(tm))
#  user  system elapsed 
# 0.037   0.000   0.036

identical(loopmeanD, loopmeanColumnD)
# [1] TRUE

Allocate memory

R also processes loops faster when you allocate the memory for the output object. In this case, R just needs to fill in the cells in a vector instead of extending the vector every loop cycle.

vertorloopmean <- function(x){
  v <- vector(length = dim(x)[1])
  for (i in c(1:dim(x)[1])){
    v[i] <- mean(x[i,])
  }
  return(v)
}

system.time(vertorloopmeanD <- vertorloopmean(m))
#  user  system elapsed 
# 0.031   0.000   0.032 

identical(loopmeanD, vertorloopmeanD)
# [1] TRUE

Use apply

When you search online why R loops are slow, you are likely to find the advice to use apply because it is faster. I also thought that apply is faster than for loops until I did the small research for this blog-post. In fact, apply also loops through the data and often it seems to be a little faster than for loops because its code tends to be shorter:

applymean <- function(x){
  v <- apply(m, 1, mean)
  return(v)
}

system.time(applymeanD <- apply(m, 1, mean))
#  user  system elapsed 
# 0.035   0.004   0.038 
identical(loopmeanD, applymeanD)
# [1] TRUE

Processing by columns is also fater for apply:

applymeanColumn <- function(x){
  v <- apply(tm, 2, mean)
  return(v)
}

system.time(applymeanColumnD <- applymeanColumn(m))
#  user  system elapsed 
# 0.036   0.000   0.036
identical(loopmeanD, applymeanColumnD)
# [1] TRUE

Please, see the benchmarking of all loops below to get more details on how apply compares to for loops. In this case, it actually was not faster than the for loop.

Compile your functions

You can improve the performance of your function by compiling it to byte code. This is especially beneficial when your function code is long.

library(compiler)

loopmeanCompiled <- cmpfun(loopmean)

system.time(loopmeanCompiledD <- loopmeanCompiled(m))
#  user  system elapsed 
# 0.035   0.000   0.035 
identical(loopmeanD, loopmeanCompiledD)
# [1] TRUE

Parallelize

R has several libraries that allow parallelizing the processing between the core of your processor.

I usually use doParallel library for that. It is not beneficial in this mean example, because it takes longer to split the processes between cores and collect the results that to run everything on one core. However, when each loop cycle is long enough, parallelizing helps a lot.

library(doParallel)

registerDoParallel(cores=12)

system.time(loopParallelD <- foreach(i=1:dim(m)[1], .combine=c) %dopar% mean(m[i,]))
#  user  system elapsed 
# 1.173   0.157   1.042 
identical(loopmeanD, loopParallelD)
# [1] TRUE

Use Built-in functions

Everything described above helps only marginally. You can get some performance improvments with these tricks but you will neber beat the built-in R functions that call C code directly without interpretation step. Just look at how much faster is the built-in function to calculate row means:

system.time(rowMeanD <- rowMeans(m))
#  user  system elapsed 
# 0.002   0.000   0.002 
identical(loopmeanD, rowMeanD)
# [1] TRUE

So, before you write your function, make sure there is no R library with such function already.

Write in C++

There is also an option to write your code in C++ and compile it with Rcpp to R code. This will also result in a considerably faster code. But of course, you need to know some C++ for that.

#include 
using namespace Rcpp;

//[[Rcpp::export]]
NumericVector cRowMeans(NumericMatrix x) {
 int nrows = x.nrow();   
 NumericVector v(nrows);
 for (int i = 0; i < nrows; i++){
   v[i] = mean(x.row(i));
 }
 return v;
}

system.time(cRowMeansD <- cRowMeans(m))
#  user  system elapsed 
# 0.004   0.000   0.004
identical(loopmeanD, cRowMeansD)
# [1] TRUE

Benchmarking R loops

Using system.time() several times with the same function will produce little different results. Although the system.time presented above is comparable it does not fully reflect the reality. As I have mentioned above, only after benchmarking all these functions, I discovered that apply was not as fast as I expected. Moreover, in this particular code, it was slower than a simple for loop.

library(microbenchmark)

mbm <- microbenchmark("LoopRowExtraBrackets" = loopmeanBrackets(m),
                      "ApplyRow" = applymean(m),
                      "ApplyColumn" = applymeanColumn(tm),
                      "LoopRowCompiled" = loopmeanCompiled(m),
                      "LoopRow" = loopmean(m),
                      "LoopColumn" = loopmeanColumn(tm),
                      "LoopRowToVector" = vertorloopmean(m),
                      "CLoop"= cRowMeans(m),
                      "Built-in_rowMeans" = rowMeans(m),
                      check = 'equal', times=1000)

mbm
# Unit: milliseconds
#                 expr       min        lq      mean    median        uq       max neval  cld
# LoopRowExtraBrackets 44.693969 49.214291 53.174126 51.161645 53.941782 93.298477   100   d
#             ApplyRow 38.594932 44.356220 50.847681 46.949665 51.282932 88.668781   100   d
#          ApplyColumn 38.211502 44.035419 51.626075 47.071399 52.140827 94.046727   100   d
#              LoopRow 35.798877 40.832460 43.676707 42.363606 44.313524 80.957104   100   c 
#      LoopRowCompiled 33.665563 40.451894 42.942701 42.355379 44.258304 73.966924   100   c 
#           LoopColumn 34.566808 39.875668 42.668796 41.743636 44.099745 76.010563   100   c 
#      LoopRowToVector 32.187435 37.927207 40.912034 39.813388 42.197814 74.008110   100   c 
#             RowLoopC  2.794117  3.721194  5.664805  4.260946  6.525059 49.985055   100   b  
#    Built-in_rowMeans  1.571780  1.668413  1.815267  1.687677  1.791447  3.601554   100   a

And the visualization of these results:

library(ggplot2)
autoplot(mbm)

Code

You can download the R code and test everything yourself.

Conclusion

We use R not because of its speed but rather because of its ease of use. The most efficient R code will never be faster than the alternative C code. But knowing the behavior of R I described above will help you to make your R loops the fastest within the limitation of R as an interpreted language.

AWK is awesome

2019-09-17T00:00:00+00:00

AWK has been the most beneficial programming language I have ever learned. It took me only a day to learn most of it and it saved me several weeks if not months already. I use AWK almost every day.

It is better to see AWK in action once than to hear about it a thousand times. So, let’s start with the examples.

Table summary

I usually use AWK to calculate some simple summary statistics for a table. For example, let’s assume you have a file table.txt with some numeric values:

CHR STAR    END LOD SCORE
chr1	211829	211850	lod=31	333
chr1	211867	211871	lod=13	247
chr1	211877	211903	lod=66	408
chr1	211913	211927	lod=61	400
chr1	211971	211994	lod=60	399
chr1	211996	212024	lod=72	417
chr6	310311	310324	lod=16	268
chr6	312061	312066	lod=13	247
chr6	312100	312206	lod=376	580
chr6	312653	312728	lod=19	285
chr6	312908	313028	lod=348	573
chr6	313549	313788	lod=900	667
chr6	313589	313784	lod=747	648

Mean

You can quickly get the mean SCORE value:

awk '{s+=$5} END {print s/(NR-1)}' table.txt

where s+=$5 sums up all values of the 5th column; NR is a built-in variable that equals to the row number. I use NR-1 because I skip the header. The command after the END is executed when the end of the file is reached.

To see what AWK does line by line, run this command:

awk 'BEGIN{print "SCORE", "SUM", "LINE_NUMBER"} {s+=$5; print $5, s, NR} END {print "mean:", s/(NR-1)}' table.txt

But how to calculate the mean of the LOD columns that has lod= in front of each number?

You can use AWK to clean tha data and do the calculation:

awk 'gsub( "lod=", "" , $4){s+=$4}END{print s/(NR-1)}' table.txt

gsub( "lod=", "" , $4) replaces lod= with an empty string before any calculation is done.

You can also limit the calculation to one chromosome:

awk '$1=="chr1" {n++; s+=$5} END {print s/n}' table.txt

We use the condition if ($1=="chr1") do ({n++; s+=$5}). Also, NR is replaced with n++ to count only the lines that meet the condition $1=="chr1"

Min and max

Using the same principles, you can get the minimum and maximum values of the SCORE column:

awk 'NR==2 || $5 < min {min=$5} END{ print min}' table.txt
awk 'NR==2 || $5 > max {max=$5} END{ print max}' table.txt

|| means OR statement in AWK.

You can combine these two commands in one:

awk 'NR==2 {min=$5; max=$5} $5 > max {max=$5} $5 < min {min=$5} END {print "min: ", min, "\nmax: ", max}' table.txt

NR==2 {min=$5; max=$5} assigns the initial values of min and max using the second row. $5 > max {max=$5} and $5 < min {min=$5} are conditional statements that are checked one after another.

Mean, max, and min in one line

You can also combine all three calculations in one line and get all statistics in one run:

awk ' NR==2 {min=$5; max=$5} $5 > max {max=$5} $5 < min {min=$5} {s+=$5} END {print "min: ", min, "\nmax: ", max, "\nmean: ", s/(NR-1)}' table.txt

Genotypes summary

There are more complicated cases where you can use AWK.

You may want to do some calculations of the genotype table generated by VariantsToTable from the GATK:

#CHROM  POS     REF     12.4.GT 13.16.GT        16.9.GT
scaffold_1      191     A       ./.     ./.     A/A
scaffold_1      563     T       T/T     ./.     T/A
scaffold_1      647     A       C/C     C/C     A/C
scaffold_1      669     T       T/T     T/T     T/T
scaffold_1      679     C       C/A     C/A     C/A
scaffold_1      704     T       T/C     T/C     T/C
scaffold_1      721     T       C/C     C/C     C/C
scaffold_1      722     C       C/T     C/T     C/T
scaffold_1      733     G       G/T     G/T     G/*

For example, I often calculate the number heterozygous, homozygous sites and missing genotypes. To that end, I use this AWK script written in the summarizeTAB.awk file:

{if (NF > maxNF ) {
    for (i = 4; i <= NF; i++)
        countN[i] = 0; countHomo[i] = 0; countHetero[i] = 0; countNA[i] = 0; maxNF = NF;
    }
    if (NR == 1 ) { for (i = 4; i <= NF; i++) samples[i] = $i;}
    else {
    for (i = 3; i <= NF; i++)
        {if ($i == "N" || $i == "./.") countN[i]++;
        else if ($i == "A/A" || $i == "T/T" || $i == "G/G" || $i == "C/C") countHomo[i]++;
        else if ($i == "G/A" || $i == "T/C" || $i == "A/C" || $i == "G/T" || $i == "C/G" || $i == "A/T" || \
                 $i == "A/G" || $i == "C/T" || $i == "C/A" || $i == "T/G" || $i == "G/C" || $i == "T/A") \
                 countHetero[i]++;
        else countNA[i]++;
        }
    }
}
    END {
        print "Sample", "Genotypes", "Heterozygots", "Homozygots", "Missing", "Unknown";
        for (i = 4; i <= maxNF; i++)
            print samples[i], countHomo[i]+countHetero[i]+0, countHetero[i]+0, countHomo[i]+0,  countN[i]+0, countNA[i]+0;
        }

It loops through the columns starting from the 4th one and calculates the number of the number heterozygous, homozygous, missing, and unknown genotypes. These number are stored in corresponding variables.

When the script is too long to fit it in one line as in this case, you can write it into a file and tell AWK to execute it:

awk -f summarizeTAB.awk geno.tab

AWK vs Python

The AWK code is usually shorter and works faster than Python. I do not have a dramatic example when my AWK code is substantially shorter than the equivalent Python code. But there are great examples from other AWK users.

Be careful with AWK

There is one key point you need to keep in mind when you work with AWK. It doesn’t throw an error when it encounters something unusual. Instead, AWK tried to guess how to handle it and proceeds silently. This can put you in danger.

Using the mean SCORE column example from above, you can see that AWK treated the characters in the header as 0.

This would throw you an error in Python. A character string and numeric value cannot be summed. But AWK doesn’t give such an error.

You would have done a mistake in the mean calculation if you calculated the line numbers as n++. It would have counted the header too. That’s why I also deduced 1 from the number of rows specified with NR.

Similarly, if you have missing data points in a form of NA, you need to tell AWK to skip them:

awk 'NR>1 && $5!="NA" {s+=$5; n++; print $5, s, n} END {print "mean:", s/n}' table.txt

I also used NR>1 to skip the header.

So, you need to be aware of this behavior of AWK when you have a mixture of data types.

Where to learn AWK

If you want to learn AWK, I recommend the course “To awk or not to…”. It was fantastic when I took it in 2017, and it has improved since then.

I also often visit this AWK page, for quick reference on the functions.

If you have never used AWK, give it a try. It may change your life forever.

Interpopulation comparison of Copy Number Variants

2019-09-10T00:00:00+00:00

I showed how to efficiently genotype Copy Number Variants with GATK and Snakemake. As a continuation of the Copy Number Variation topic, I will share how I compared the Copy Number Variation along the genome between three different populations. If you also analyze the population genomic data, I hope you will find this post useful.

Although the GATK Copy Number Variants (CNVs) calling pipeline utilizes the population variation during the CNVs calling in the cohort mode, it produces separate VCF files for each sample. The CNVs in such VCF files look similar to this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1
chrN    45894001        CNV_45894001_46949000      N       ,     .       .       END=46949000    GT:CN:NP:QA:QS:QSE:QSS  0:2:1055:47:3077:78:119
chrN    46949001        CNV_46949001_46956000      N       ,     .       .       END=46956000    GT:CN:NP:QA:QS:QSE:QSS  2:4:7:6:9:15:8
chrN    46956001        CNV_46956001_55222000      N       ,     .       .       END=55222000    GT:CN:NP:QA:QS:QSE:QSS  0:2:8263:17:3077:108:19
chrN    55222001        CNV_55222001_55223000      N       ,     .       .       END=55223000    GT:CN:NP:QA:QS:QSE:QSS  1:0:1:493:493:493:493

If you compare the CNVs from different samples, most likely you will find that breaking points are not the same across your samples. This poses a problem of connecting the CNVs from different samples to estimate the interpopulation differences along the genome.

Variation in breaking points of CNVs across samples in IGV

To overcome this problem, we decided to bin each CNV segments according to breaking points it overlaps. This allowed us to merge CNVs of all samples into one large table where the genomic coordinates are the same:

CHROM	 POS	     END       s1   s2  s3  s4  s5  s6  s7  s8
chrN	46939001	46949000	2	2	2	2	2	2	2	2	
chrN	46949001	46951000	4	5	1	4	5	1	4	3	
chrN	46951001	46955000	4	5	1	4	5	1	4	3	
chrN	46955001	46956000	4	5	1	4	5	1	4	3	
chrN	46956001	46957000	2	2	2	2	2	2	2	2	

Such a table can be used to estimate various statistics along the genome for different populations. For example, I estimated Vst between populations (it’s like Fst but for CNVs.)

Let me show you everything step-by-step.

Visualize the CNV variation

To make sure my CNV calls are good, I explored the CNV variation between my samples in IGV.

First, I extracted the CNV genotypes form the VCF files using GATK and converted the resulting tables into seg and tab formats using Snakemake:

# Use Snakemake 4, GATK 4

CHROMOSOMES, SAMPLES, = glob_wildcards('chr{i}_{sample}_segments_cohort.vcf')
REF = '/path/to/reference.fa'

rule all:
    input:
        expand('chr{i}_{sample}_segments_cohort.seg', sample=SAMPLES, i=CHROMOSOMES)

rule toTable:
    input:
        ref = REF,
        vcf = 'chr{i}_{sample}_segments_cohort.vcf'
    output:
       'chr{i}_{sample}_segments_cohort.table'
    shell:
        '''
        gatk --java-options "-Xmx8G" VariantsToTable \
        -R  {input.ref} \
        -V  {input.vcf} \
        -F ID -GF CN \
        -O {output}
        '''

rule toSeg:
    input:
        'chr{i}_{sample}_segments_cohort.table'
    output:
       seg = 'chr{i}_{sample}_segments_cohort.seg',
       tab = 'chr{i}_{sample}_segments_cohort.tab'
    params:
        '{sample}'
    shell:
        '''
        sed 's/CNV_//g;s/_/\t/g' {input} | \
        awk -v s={params} 'BEGIN{ {print "CHROM\\tPOS\\tEND\\t"s".CN"} } NR>1 { {print $0} }' \
        >  {output.tab} && \
        sed 's/CNV_//g;s/_/\t/g' {input} | \
        awk -v s={params} 'BEGIN{ { print s"\\tCHROM\\tPOS\\tEND\\t"s".CN" } } NR>1 { {print s"\\t"$0} }' \
        >  {output.seg}
        '''

This produced three files per sample. Here is the example of these files:

chrN_sample1_segments_cohort.table:

ID                              sample1.CN
CNV_chrN_45894001_46949000        2
CNV_chrN_46949001_46956000        4
CNV_chrN_46956001_55222000        2
CNV_chrN_55222001_55223000        0

chrN_sample1_segments_cohort.tab:

CHROM   POS     END     sample1.CN
chrN    45894001        46949000        2
chrN    46949001        46956000        4
chrN    46956001        55222000        2
chrN    55222001        55223000        0

chrN_sample1_segments_cohort.seg:

sample1        CHROM     POS            END     sample1.CN
CFA010182       chrN    45894001        46949000        2
CFA010182       chrN    46949001        46956000        4
CFA010182       chrN    46956001        55222000        2
CFA010182       chrN    55222001        55223000        0

Then, you load the *.seg files into IGV and you will obtain a picture similar to this one:

Visualizing CNVs in IGV

In IGV, read indicates duplications, blue marks deletions, and white depicts the diploid state. The intensity of the color corresponds to number of gained or lost copies.

Bin the segments

To bin all the segments into the same set of segments across samples, I merged and sorted the coordinates from all samples:

cat chrN_*_segments_cohort.tab | cut -f 1,2,3 | grep -v POS | sort -V -u -k 2,2 -k 3,3 | awk 'BEGIN{print"CHROM\tPOS\tEND"}{print $0}' > CNV_intervals.bed

And created the reference interval file in R:

d <- read.table('chrN_CNV_intervals.bed', header = T)

breaks <- sort(unique(c(d$POS-1, d$END)))
bins <- data.frame(CHROM=rep('chrN', length(breaks[-1])), POS = head(breaks, -1)+1, END=breaks[-1])
options(scipen = 999) # disables scientific notation
write.table(bins, 'CNV_intervals_bins.bed', sep='\t', quote = F, row.names = F)

You can visualize the original interval list and bins with this R code:

plot_intervals <- function(df) {
  lines <- c(1:length(df$CHROM))
  nLines <- max(lines)
  plot(0, xlim = c(min(df$POS), max(df$END)), ylim = c(1, nLines), type="n",
       main = "Merged intervals", xlab="", ylab="Interval number")
  for (i in lines){
    segments(df$END[i], nLines+1-i, df$POS[i], nLines+1-i, lwd=2, col = "gray")
  }
}

jpeg('chrN_CNV_intervals_bins.jpeg', width = 740, height = 600)
par(mar=c(4,4,2,1), mfrow=c(2,1), cex=1.1)
plot_intervals(d)
plot_intervals(bins)
dev.off()

Merge all CNV files

Then, I used merge_CNVs_tabs.py to merge all CNV files with CNV_intervals_bins.bed:

for i in *_segments_cohort.tab
    do
        python ~/git/genotype-files-manipulations/merge_CNVs_tabs.py -i $i -r chrN_CNV_intervals_bins.bed -o $i.bin && cut -f 4 $i.bin > $i.bin.col4
    done
paste sample1_segments_cohort.tab.bin tab/*.bin.col4 > segments_cohort_bins.tab

The resulting files have the following format:

CHROM	 POS	     END       s1   s2  s3  s4  s5  s6  s7  s8
chrN	46939001	46949000	2	2	2	2	2	2	2	2	
chrN	46949001	46951000	4	5	1	4	5	1	4	3	
chrN	46951001	46955000	4	5	1	4	5	1	4	3	
chrN	46955001	46956000	4	5	1	4	5	1	4	3	
chrN	46956001	46957000	2	2	2	2	2	2	2	2	

Calculate Vst

The obtained segments_cohort_bins.tab can be used to calculate various statistics. For example, you can calculate Vst in R:

d <- read.table('chrN_segments_cohort_bins.tab', header = T)

dd <- d[,-c(1:3)]
group <- factor(c("red", "black", "blue", "red", "black", "blue", "red", "blue"))

getVst <- function(dat, groups, comparison) {
  groupLevels <- levels(groups)
  dat1 <- na.omit(dat[groups==groupLevels[groupLevels==comparison[1]]])
  dat2 <- na.omit(dat[groups==groupLevels[groupLevels==comparison[2]]])
  Vtotal <- var(c(dat1, dat2))
  Vgroup <- ((var(dat1)*length(dat1)) + (var(dat2)*length(dat2))) /
             (length(dat1)+length(dat2))
  Vst <- c((Vtotal-Vgroup) / Vtotal)
  if (Vst == "NaN"){
    Vst <- 0
  }
  return(Vst)
}

d$Vst_red_black <- apply(dd, 1, function(x) getVst(x, group, c("red", "black")))
d$Vst_red_blue <- apply(dd, 1, function(x) getVst(x, group, c("red", "blue")))
d$Vst_blue_black <- apply(dd, 1, function(x) getVst(x, group, c("blue", "black")))

write.table(d, 'chrN_segments_cohort_bins_Vst.csv', sep='\t', quote = F, row.names = F)

The table segments_cohort_bins_Vst.csv will look like this:

CHROM   POS    END     s1 s2 s3 s4 s5 s6 s7 s8 Vst_red_black Vst_red_blue Vst_blue_black
chrN 46939001 46949000  2  2  2  2  2  2  2  2             0    0.0000000            0.0
chrN 46949001 46951000  4  5  1  4  5  1  4  3             1    0.6923077            0.8
chrN 46951001 46955000  4  5  1  4  5  1  4  3             1    0.6923077            0.8
chrN 46955001 46956000  4  5  1  4  5  1  4  3             1    0.6923077            0.8
chrN 46956001 46957000  2  2  2  2  2  2  2  2             0    0.0000000            0.0

Exploring the distribution of Vst can identify genomic regions of hight divergence:

Final thought

You can use the CNV table chrN_segments_cohort_bins.tab to calculate many other things.

We found to be the most parsimonious solution to bin the CNV segments to merge all samples into one table. If there is a better way to solve the problem of variation in breaking points of CNVs for the interpopulation comparison, please let me know.

If you have any questions or suggestions, feel free to email me.

The best free Research Data Repository

2019-08-27T00:00:00+00:00

You need to deposit your research data to a repository and you are lost in options. I have been in the same situation recently.

If your data is of specific type then the choice is obvious. You deposit that data to a data-type specific repository. For example, nucleic acid sequence data need to be uploaded to the Sequence Read Archive (SRA). Scripts and programs should be deposited to GitHub or similar resource with a version control system. Usually, you need to make your best to use these repositories because this will increase the chance of your data to be found by other researchers. Here is an extensive list of data-type specific repositories.

But if you also have some non-standard data formats, you need to use a generalist repository. The most popular ones are Dryad, FigShare, and Zenodo. These were the repositories I found first. Later, I also discovered the Open Science Framework (OSF) and it became my number one research data repository.

My key criteria when I was looking for the best repository for my scientific data were:

Free
DOI
Ability to update files
Directory structure

Publishing in open-access journals already costs a fortunate, so I wanted to use a free repository to avoid additional spending. A digital object identifier (DOI) is probably a must for any publication. It is especially useful if you publish a dataset without a link to any paper. A DOI makes it easier to cite the dataset. I also would like to have an option to edit or update the data after the initial deposit. Mistakes are always possible and it is better to be able to correct them. The amount of data grows enormously and usually my projects have many files structured in directories. I would like to keep this directories order in my repositories too. The OSF repository meets these requirements the best.

Let me briefly summarize my option on each of the repositories I tried.

Dryad

Dryad is the most popular research data repository. It is recommended by many journals. I used it to publish the supplementary data for my Molecular Ecology paper. By publishing in Molecular Ecology, you get a link to deposit your data to Dryad for free.

However, it is not a free repository. You need to pay $120 for a submission of up to 20GB, and +$50 for each additional 10GB. On the other hand, such a business model guarantees long term existence of this repository.

I like it for its simple and easy to use interface. Uploading the data is very simple and fast. You get a DOI for your data and some simple metrics such as a number of page views and downloads. But you cannot edit anything after the submission. There is no directory structure support, so you can upload a directory only as an archive file.

Pros:

popular
simple
DOI
metrics

Cons:

non-free
no edit/update after the submission
no directory structure support
not optimized for downloading many files at once

FigShare

FigShare is a great repository for visual content. It shows a preview of every file. If I recall correctly, this was the initial purpose of FigShare. Now, you can also use FigShare to upload any file types.

There is no limit on files size if you make them public. You can modify your files after the publication with a version control system.

I think FigShare should be used only to share posters, slides, and figures. It is not convenient for sharing dozens of files. You can use collections and project, to unite many files. But there is no easy way to download many files. The interface of the repository is also not simple. You often need to navigate several windows to access a file.

Pros:

popular
free
DOI
unlimited space
image preview

Cons:

optimized only for single visual file sharing
complicated to use
no directory structure support
not optimized for downloading many files at once

Zenodo

Zenodo is good in many regards. It is free. There is a version control system. The DOI is provided. You can meter page views and downloads.

The file size limit is 50GB per dataset but you can have an unlimited number of datasets.

However, you cannot create folders with files. You can upload each folder as a separate dataset or compress each folder into an archive and upload it. But this is not an ideal solution.

Pros:

popular
free
DOI
simple interface
version control system

Cons:

no directory structure support
not optimized for downloading many files at once
50GB limit per dataset

Open Science Framework

OSF is my favorite repository to store my research data. It is surprisingly not very popular. It took a while until I found it. I believe its popularity will grow as it is an amazing repository for scientific data.

OSF is free. You get a DOI for your repository. There is a version control system. It supports directory structure in repositories. You can update your files after the publication and the history of the repository is tracked.

The default file size limit is 5 GB. But you can extend this limit with add-ons.

The OSF interface is more advanced than in other repositories. I consider it an advantage. But it is little too advanced and some user may find it difficult to use. So, I will still list it in the cons.

Pros:

free
DOI
version control system
supports directory structure
optimized for downloading many files at once

Cons:

not popular
advanced interface
5GB limit per file (no number of files limit)

I have not explored the funding of other repositories but OSF is secured by funding for 50+ years. The chance it will disappear is very small.

Mendeley

Mendeley is known as a digital library app with great reference tools. Recently, it also launched the Mendeley Data service. I found out about this Mendeley Data repository while writing this blog post.

It is a simple repository. If you already use Mendeley and you do not want to bother with other options, go ahead and use Mendeley Data.

You can see its pros and cons below. I only would like to emphasize that there is a moderation step to publish your data. So, be ready to wait sometime before your data becomes public.

Pros:

popular
simple
DOI
supports directory structure
optimized for downloading all files at once

Cons:

no version control system
moderation
10 GB per dataset

Summary

This is not a comprehensive review. I just evaluate these repositories from my requirements. For example, you may need to check the funding of free repositories to make sure they won’t disappear soon. I also did not pay attention to license types these repositories support because I usually release my data into the public domain anyway.

If you think there is something crucial I missed, please let me know and I will add it.

Snakemake checkpoint tutorial

2019-08-16T00:00:00+00:00

If you want to use Snakemake to run some programs that output an unknown number of files, you need to tell Snakemake about that. If you use Snakemake 4, you can do that by marking the output with dynamic(). If you upgraded to Snakemake 5, you better use checkpoint. Using dynamic() will work in Snakemake 5, but you will see a message saying that dynamic output is deprecated and will be fully replaced by checkpoints in Snakemake 6.

This post shows how to use both dynamic() and checkpoint.

You probably better focus on checkpoint because this is a more up-to-date solution. But checkpoint may not work correctly sometimes. For example, I tested it with the GATK IntervalListTools and it did not work correctly, while dynamic() worked fine with the exactly same command. Thus, knowing both approaches can be helpful.

Checkpoint

Checkpoint function was introduced in Snakemake 5 and it will completely replace dynamic() in Snakemake 6. So, if you have not tried it, it is time to learn it.

Here is a dummy code that shows how checkpoint works:

rule final_output:
    input:
        'scatter_copy_head_collect/all.txt'

# generate random number of files
checkpoint scatter:
    output:
        directory('scatter')
    shell:
        '''
        mkdir {output}
        N=$(( $RANDOM % 10))
        for j in $(seq 1 $N); do echo -n $j > {output}/$j.txt; done
        '''

# process these unknown number of files
rule scatter_copy:
    output:
        txt = 'scatter_copy/{i}_copy.txt',
    input:
        txt = 'scatter/{i}.txt',
    shell:
        '''
        cp -f {input.txt} {output.txt}
        echo -n "_copy" >> {output.txt}
        '''
# process scatter_copy output
rule scatter_copy_head:
    output:
        txt = 'scatter_copy_head/{i}_head.txt',
    input:
        txt = 'scatter_copy/{i}_copy.txt',
    shell:
        '''
        cp -f {input.txt} {output.txt}
        echo "_head" >> {output.txt}
        '''

# collect the results of processing unknown number of files
# and merge them together into one file:

def aggregate_input(wildcards):
    '''
    aggregate the file names of the random number of files
    generated at the scatter step
    '''
    checkpoint_output = checkpoints.scatter.get(**wildcards).output[0]
    return expand('scatter_copy_head/{i}_head.txt',
           i=glob_wildcards(os.path.join(checkpoint_output, '{i}.txt')).i)

rule scatter_copy_head_collect:
    output:
        combined = 'scatter_copy_head_collect/all.txt',
    input:
        aggregate_input
    shell:
        '''
        cat {input} > {output.combined}
        '''

Explore the outputs, to understand how this pipeline works:

Dynamic

Dynamic output is outdated approach but it seems to be more stable and reliable at the moment. So, if you experience some problems with checkpoint, in most cases, you can write the same pipeline with dynamic().

This is the same pipeline as above but it utilizes dynamic() instead of checkpoint:

rule final_output:
    input:
        'scatter_copy_head_collect/all.txt'

# this was a checkpoint step above:
rule scatter:
    output:
        dynamic('scatter/{i}.txt')
    shell:
        '''
        N=$(( $RANDOM % 10))
        for j in $(seq 1 $N); do echo -n $j > scatter/$j.txt; done
        '''

# this rule is not different from checkpoint
rule scatter_copy:
    output:
        txt = 'scatter_copy/{i}_copy.txt',
    input:
        txt = 'scatter/{i}.txt',
    shell:
        '''
        cp -f {input.txt} {output.txt}
        echo -n "_copy" >> {output.txt}
        '''

# this rule is not different from checkpoint either:
rule scatter_copy_head:
    output:
        txt = 'scatter_copy_head/{i}_head.txt',
    input:
        txt = 'scatter_copy/{i}_copy.txt',
    shell:
        '''
        cp -f {input.txt} {output.txt}
        echo "_head" >> {output.txt}
        '''

# to collect all files, you need to tell Snakemake that input is dynamic:
rule scatter_copy_head_collect:
    output:
        combined = 'scatter_copy_head_collect/all.txt',
    input:
        indivfiles = dynamic('scatter_copy_head/{i}_head.txt')
    params:
        gathered = lambda wildcards, input: ' '.join(input.indivfiles)
    shell:
        '''
        cat {params.gathered} > {output.combined}
        '''

Final thoughts

Checkpoints are claimed to be more powerful that dynamic() by the Snakemake developers. I believe they are right but my impression is that dynamic() is easier to use. Maybe I have not fully comprehended checkpoint yet.

Besides, as I mentioned above I was not able to make it work with GATK. So, I will try to use checkpoint but I may also step back to dynamic() too.

Finally, I would like to acknowledge this Stackoverflow answer that inspired me to write this tutorial.

Call germline Copy Number Variants with GATK in Snakemake

2019-08-15T00:00:00+00:00

I needed to call copy number variants (CNVs) in my dog dataset. I had different tools on my radar including Manata, LUMPY, CNVnator, and GenomeSTRiP. Among these tools, I liked Manata for its incredible speed. But it lacked the cohort mode calling which I thought was preferable for my population-level data. Only GenomeSTRiP had the cohort calling mode. I have not run GenomeSTRiP myself, but I talked to a person who tried it and he told me it was not the easiest tool to set up and run. I also recall GATK had a beta version that could call CNVs. Checking the GATK website revealed that this functionality has been released already. So, I decided to proceed with trusted GATK for calling germline copy number variants in my dataset.

The GATK documentation for this pipeline is in BETA for the moment of writing this post but it is enough to run the pipeline. I tested it and had no obvious problems. I am not going to describe each step of this pipeline in details as you can read about them. I will briefly list the steps and provide the Snakemake code to execute this pipeline.

Requirements

You will need GATK 4 in GATK Conda environment and Snakemake 4.

GATK Python environment

I run this pipeline with GATK 4.1.2.0. To call CNVs with GATK 4, you need to load a Python environment with gcnvkernel module. I use Conda installation for that:

conda env create -f /path/to/gatk/gatkcondaenv.yml
conda init bash # restart shell to take effect
conda activate gatk

Snakemake

I started writing this pipeline in Snakemake 5. I used recently introduced checkpoints to handle unknown output (see the scattering step below). But I encountered a problem which I was not able to fix. So, I downgraded to Snakemake 4.3.1 and used the older dynamic() function for scattering. Everything worked fine.

Steps to call copy number variants with GATK

These steps are described here only for a quick reference. For a detailed description of each step and options, read the GATK guide.

Bin intervals

PreprocessIntervals takes a reference fasta file as input and creates a binned interval lists. If you want to process only a subset of the genome, specify it with the option -L:

gatk --java-options "-Xmx8G" PreprocessIntervals \
-R canFam3.fa \
--padding 0 \
--bin-length 1000 \
-L chr35:100000-2000000 \
-imr OVERLAPPING_ONLY \
-O interval_chr35.interval_list

Bin size should negatively correlate with coverage, e.g. higher coverage data can have smaller bins. The default bin length of 1000 is recommended for 30x data.

Count reads per bin

This step counts reads overlapping each interval. It takes the interval list from the previous step and a BAM file as input and outputs a read counts table. The output can be in a human-readable TSV format (option --format TSV) or HDF5 (default) which is faster to process by GATK.

gatk --java-options "-Xmx8G" CollectReadCounts \
-R canFam3.fa \
-imr OVERLAPPING_ONLY \
-L interval_chr35.interval_list \
-I sample1.bam \
-O sample1_chr35.hdf5

OVERLAPPING_ONLY prevents the merging of abutting intervals as recommended by the GATK team.

Annotate and Filter intervals (Optional)

This step helps to remove problematic regions in the cohort calling mode. However, the pipeline should work fine without any interval filtering.

You can annotate intervals with GC content, mappability, and segmental duplication information:

gatk --java-options "-Xmx8G" AnnotateIntervals \
-R canFam3.fa  \
-L interval_chr35.interval_list \
--mappability-track canFam3_mappability.bed.gz \
--segmental-duplication-track canFam3_segmental_duplication.bed.gz \
--interval-merging-rule OVERLAPPING_ONLY \
-O annotated_intervals_chr35.tsv

The information on mappability and segmental duplication need to be provided.

The GATK team recommends generating mappability with Umap and Bismap. I also used GEM to generate mappability.

To obtain segmental duplication information, I tried to run SEDEF and ASGART on the CamFam3 genome. Unfortunately, my attempts were unsuccessful: both programs crashed without a clear error message.

So, I annotated my data only with GC content and mappability.

Annotated intervals are then filtered based on tunable thresholds:

gatk --java-options "-Xmx8G" FilterIntervals \
-L interval_chr35.interval_list \
--annotated-intervals annotated_intervals_chr35.tsv \
-I sample1_chr35.hdf5 \
-I sample2_chr35.hdf5 \
--minimum-gc-content 0.1 \
--maximum-gc-content 0.9 \
--minimum-mappability 0.9 \
--maximum-mappability 1.0 \
--minimum-segmental-duplication-content 0.0 \
--maximum-segmental-duplication-content 0.5 \
--low-count-filter-count-threshold 5 \
--low-count-filter-percentage-of-samples 90.0 \
--extreme-count-filter-minimum-percentile 1.0 \
--extreme-count-filter-maximum-percentile 99.0 \
--extreme-count-filter-percentage-of-samples 90.0 \
--interval-merging-rule OVERLAPPING_ONLY \
-O gcfiltered_chr35.interval_list

Call contig ploidy

This step is needed to generate global baseline coverage and noise data for the subsequent steps:

gatk --java-options "-Xmx8G" DetermineGermlineContigPloidy \
-L interval_chr35.interval_list \
-I sample1_chr35.hdf5 \
-I sample2_chr35.hdf5 \
--contig-ploidy-priors ploidy_priors.tsv \
--output-prefix  dog \
--interval-merging-rule OVERLAPPING_ONLY \
-O ploidy-calls_chr35

You need to provide ploidy prior probabilities. Here is an example of priors I used:

CONTIG_NAME	PLOIDY_PRIOR_0	PLOIDY_PRIOR_1	PLOIDY_PRIOR_2	PLOIDY_PRIOR_3
chr35	0.01	0.01	0.97	0.01
chrX	0.01	0.49	0.49	0.01

If you have the information on the sex of your sample, it is advised to compare it with the ploidy call results.

Scatter intervals

GATK 4 utilizes a new approach for parallelization of processes that requires scattering your data. This step does exactly that. It splits the interval list into shards which can be processed in parallel. The results of these scattered processes are collected at the later step.

To scatter the intervals into ~5K intervals, run:

mkdir -p scatter_chr35
gatk --java-options "-Xmx8G" IntervalListTools \
--INPUT interval_chr35.interval_list \
--SUBDIVISION_MODE INTERVAL_COUNT \
--SCATTER_CONTENT 15000 \
--OUTPUT scatter_chr35

It is recommended to have at least ~10–50Mbp genomic coverage per scatter. So, scatters of ~15K with ~1K bins would have ~15Mb coverage.

Call copy number variants

This step detects both rare and common CNVs on a scattered shard:

gatk --java-options "-Xmx8G" GermlineCNVCaller  \
--run-mode COHORT \
-L scatter_chr35/fragment/scattered.interval_list \
-I sample1_chr35.hdf5 \
-I sample2_chr35.hdf5 \
--contig-ploidy-calls ploidy-calls_chr35/dogs-calls \
--annotated-intervals annotated_intervals_chr35.tsv \
--output-prefix fragment \
--interval-merging-rule OVERLAPPING_ONLY \
-O cohort-calls_chr35

You need to run this command on each fragment produced by IntervalListTools from the Scattering step. This can be easely achived with Snakemake as you will see below.

To increase the sensitivity of calls, you need to fine-tune different parameters. For details visit this GATK page

Call copy number segments

This step collects the results from scattered shards and calls copy number state per sample for intervals and segments in the VCF format:

gatk --java-options "-Xmx8G" PostprocessGermlineCNVCalls \
--model-shard-path cohort-calls_chr35/frag_temp_0001_of_3-model \
--model-shard-path cohort-calls_chr35/frag_temp_0002_of_3-model \
--model-shard-path cohort-calls_chr35/frag_temp_0003_of_3-model \
--calls-shard-path cohort-calls_chr35/frag_temp_0001_of_3-calls \
--calls-shard-path cohort-calls_chr35/frag_temp_0002_of_3-calls \
--calls-shard-path cohort-calls_chr35/frag_temp_0003_of_3-calls \
--sequence-dictionary '/path/to/reference/canFam3.dict' \
--allosomal-contig chrX \
--contig-ploidy-calls ploidy-calls_chr35/dogs-calls \
--sample-index 0 \
--output-genotyped-intervals  chr35_sample1_intervals_cohort.vcf.gz \
--output-genotyped-segments  chr35_sample1_segments_cohort.vcf.gz

You need to provide a sample index with --sample-index. The first sample in your input list has index 0, the second one is 1, etc.

Here is an example of genotyped-segments in VCF:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1
chr35   100000  CNV_chr35_100000_309999 N       ,     .       .       END=309999      GT:CN:NP:QA:QS:QSE:QSS  0:2:208:94:3077:98:136
chr35   310000  CNV_chr35_310000_311999 N       ,     .       .       END=311999      GT:CN:NP:QA:QS:QSE:QSS  1:1:2:159:284:50:98
chr35   312000  CNV_chr35_312000_1999999 N       ,     .       .       END=1999999     GT:CN:NP:QA:QS:QSE:QSS  0:2:1603:50:3077:131:50

GATK CNV pipeline in Snakemake

All the commands above can be executed as a distributed pipeline with Snakemake. For example, processing two chromosomes and two samples would look like this:

You can adapt the code below for your needs. Just change the list of input file names and chromosomes numbers.

SAMPLES, = glob_wildcards('/path/to/BAMs/{sample}_merged_markDupl_BQSR.bam')
CHRN = list(range(1, 39))
CHRN.append('X')
CHR = CHRN
REF = '/path/to/reference/canFam3.fa'
DICT = '/path/to/reference/canFam3.dict'
MAP = 'canFam3_mappability_150.merged.bed.gz'
SEGDUP = 'segmental_duplication.bed.gz'

rule all:
    input:
        expand('chr{j}_{sample}_intervals_cohort.vcf.gz', j=CHR, sample=SAMPLES),
        expand('chr{j}_{sample}_segments_cohort.vcf.gz', j=CHR, sample=SAMPLES)

rule make_intervals:
    input:
        REF
    params:
        'chr{j}'
    output:
        'interval_chr{j}.interval_list'
    shell:
        '''
        gatk --java-options "-Xmx8G" PreprocessIntervals \
        -R {input} \
        --padding 0 \
        -L {params} \
        -imr OVERLAPPING_ONLY \
        -O {output}
        '''

rule count_reads:
    input:
        ref = REF,
        bam = '{sample}_merged_markDupl_BQSR.bam',
        interval = 'interval_chr{j}.interval_list'
    output:
        '{sample}_chr{j}.hdf5'
    shell:
        '''
        gatk --java-options "-Xmx8G" CollectReadCounts \
        -R {input.ref} \
        -imr OVERLAPPING_ONLY \
        -L {input.interval} \
        -I {input.bam} \
        -O {output}
        '''

rule annotate:
    input:
        ref = REF,
        interval = 'interval_chr{j}.interval_list',
        mappability = MAP,
        segduplication = SEGDUP
    output:
        'annotated_intervals_chr{j}.tsv'
    shell:
        '''
        gatk --java-options "-Xmx8G" AnnotateIntervals \
        -R {input.ref} \
        -L {input.interval} \
        --mappability-track {input.mappability} \
        --segmental-duplication-track {input.segduplication} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {output}
        '''

rule filter_intervals:
    input:
        interval = 'interval_chr{j}.interval_list',
        annotated = 'annotated_intervals_chr{j}.tsv',
        samples = expand('{sample}_{chromosome}.hdf5', sample=SAMPLES, chromosome='chr{j}'),
    output:
        'gcfiltered_chr{j}.interval_list'
    params:
        files = lambda wildcards, input: ' -I '.join(input.samples)
    shell:
        '''
        gatk --java-options "-Xmx8G" FilterIntervals \
        -L {input.interval} \
        --annotated-intervals {input.annotated} \
        -I {params.files} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {output}
        '''

rule determine_ploidy:
    input:
        interval = 'gcfiltered_chr{j}.interval_list',
        samples = expand('{sample}_{chromosome}.hdf5', sample=SAMPLES, chromosome='chr{j}'),
        prior = 'ploidy_priors.tsv',
    params:
        prefix = 'dogs',
        files = lambda wildcards, input: ' -I '.join(input.samples)
    output:
        'ploidy-calls_chr{j}'
    shell:
        '''
        gatk --java-options "-Xmx8G" DetermineGermlineContigPloidy \
        -L {input.interval} \
        -I {params.files} \
        --contig-ploidy-priors {input.prior} \
        --output-prefix  {params.prefix} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {output}
        '''

rule scattering:
    input:
        interval = 'gcfiltered_chr{j}.interval_list'
    output:
        dynamic('scatter_chr{j}/{fragment}/scattered.interval_list')
    params:
        'scatter_chr{j}'
    shell:
        '''
        mkdir -p {params} # needed because Snakemake fails creating this directory automatically
        gatk --java-options "-Xmx8G" IntervalListTools \
        --INPUT {input.interval} \
        --SUBDIVISION_MODE INTERVAL_COUNT \
        --SCATTER_CONTENT 15000 \
        --OUTPUT {params}
        '''

rule cnvcall:
    input:
        interval = 'scatter_chr{j}/{fragment}/scattered.interval_list',
        sample = expand("{sample}_{chromosome}.hdf5", sample=SAMPLES, chromosome='chr{j}'),
        annotated = 'annotated_intervals_chr{j}.tsv',
        ploidy = 'ploidy-calls_chr{j}'
    output:
        modelf = "cohort-calls_chr{j}/frag_{fragment}-model",
        callsf = "cohort-calls_chr{j}/frag_{fragment}-calls"
    params:
        outdir = 'cohort-calls_chr{j}',
        outpref = 'frag_{fragment}',
        files = lambda wildcards, input: " -I ".join(input.sample)
    shell:
        '''
        gatk --java-options "-Xmx8G" GermlineCNVCaller  \
        --run-mode COHORT \
        -L {input.interval} \
        -I {params.files} \
        --contig-ploidy-calls {input.ploidy}/dogs-calls \
        --annotated-intervals {input.annotated} \
        --output-prefix {params.outpref} \
        --interval-merging-rule OVERLAPPING_ONLY \
        -O {params.outdir}
        '''

def sampleindex(sample):
    index = SAMPLES.index(sample)
    return index

rule process_cnvcalls:
    input:
        model = dynamic("cohort-calls_chr{j}/frag_{fragment}-model"),
        calls = dynamic("cohort-calls_chr{j}/frag_{fragment}-calls"),
        dict  = DICT,
        ploidy = 'ploidy-calls_chr{j}'
    output:
        intervals = 'chr{j}_{sample}_intervals_cohort.vcf.gz',
        segments = 'chr{j}_{sample}_segments_cohort.vcf.gz'
    params:
        index = lambda wildcards: sampleindex(wildcards.sample),
        modelfiles = lambda wildcards, input: " --model-shard-path ".join(input.model),
        callsfiles = lambda wildcards, input: " --calls-shard-path ".join(input.calls)
    shell:
        '''
        gatk --java-options "-Xmx8G" PostprocessGermlineCNVCalls \
        --model-shard-path {params.modelfiles} \
        --calls-shard-path {params.callsfiles} \
        --sequence-dictionary {input.dict} \
        --allosomal-contig chrX \
        --contig-ploidy-calls {input.ploidy}/dogs-calls \
        --sample-index {params.index} \
        --output-genotyped-intervals  {output.intervals} \
        --output-genotyped-segments  {output.segments}
        '''

If you need to run Snakemake on a cluster, I explained how to do that previously.

Final thoughts

Although the documentation for copy number variants calling with GATK is in beta, it is sufficient to perform the CNV analysis. GATK is easy to install and it is reasonably fast. GATK now scatters the data during some steps to improve the efficacy. This approach is especially worthy if you run GATK on a Spark cluster. This is where large scale genomics is moving. However, if you do not have access to a full-scale Spark cluster, you can use GATK with this Snakemake pipeline on a cluster that has some job scheduler like SLURM, for example.

If you have any questions or suggestions, feel free to email me.

Estimate genome mappability with GEM library

2019-08-13T00:00:00+00:00

GEM mappability was the most popular program to estimate genome mappability a few years ago. However, a lot of things have changed since that time. Not only published tutorials don’t work anymore, but even finding GEM with the mappability option is not that easy.

The link in the original paper doesn’t work anymore. Moreover, if you google GEM mappability, you will find out that mappability was removed from GEM. I faced these and some other issues when I tried to get a mappability track for my data with GEM. Therefore, I would like to share scripts and commands I used to get GEM mappability in 2019.

Download GEM library

As I mentioned before, the mappability option has been removed from GEM. This removal was intended to be temporarily in 2018. But mappability is still not there in the mid-2019. So, downloading GEM 3 from its Github page won’t help you. Luckily, previous versions are still available at Sourceforge.net. I downloaded GEM-binaries-Linux-x86_64-core_i3-20130406-045632.tbz2.

Extract the downloaded archive and make all files in the bin folder executable. You GEM library is ready!

Estimate GEM mappability

To get mappability in GEM format run these commands:

gem-indexer -T 10 -i canFam3.fa -o canFam3_gem_index
gem-mappability -T 10 -I canFam3_gem_index.gem -l 150 -o canFam3_mappability_150

I used a 150bp kmer size because my data was generated with 150bp read length. Also, I run it on 10 cores (-T 10). You can change these options to fit your needs.

Convert GEM mappability to BED

GEM mappability file may not be suitable input for many programs. For example, GATK takes mappability data in a BED file. BED files are also easy to convert to many other formats.

I found this Github repository that shows how to convert GEM mappability to BED format:

gem-2-wig -I canFam3_gem_index.gem -i canFam3_mappability_150.mappability -o canFam3_mappability_150
wigToBigWig canFam3_mappability_150.wig canFam3_mappability_150.sizes canFam3_mappability_150.bw
bigWigToBedGraph canFam3_mappability_150.bw  canFam3_mappability_150.bedGraph
bedGraphTobed canFam3_mappability_150.bedGraph canFam3_mappability_150.bed 0.3

In these commands: gem-2-wig is part of the GEM library. wigToBigWig and bigWigToBedGraph can be downloaded from here. bedGraphTobed is available in the same Github repository.

Merge overlapping intervals in BED

Some programs including GATK require overlapping mappability intervals to be merged. You can achieve that with my python script:

python ~/git/genotype-files-manipulations/combine_overlapping_BEDintervals.py -i canFam3_mappability_150.bed -o canFam3_mappability_150.merged.bed -v 0

where -v defines the overhang size between intervals.

GATK Index

Since I mentioned GATK many times across this post, I also add these two commands to compress and index mappability data for GATK:

bgzip canFam3_mappability_150.merged.bed
gatk IndexFeatureFile -F canFam3_mappability_150.merged.bed.gz

Conclusion

I believe GEM estimation of genome mappability is still valid in 2019. Finding the correct version of GEM and a few other scripts was not straightforward, but otherwise this approach is fast and simple. Luckily, you do not need to do all the work I have done :-)

If you want to use some of the latest approaches for mappability estimation, try Umap and Bismap. Also, keep checking the latest version of GEM, maybe it already has the mappability option at the time you are reading this post.

If you have any questions or suggestions, feel free to email me.

Creating a duty schedule in R

2019-08-06T00:00:00+00:00

As a person who possesses some programming skills, I try to automate everything I can. Recently, I became responsible for creating a kitchen duty schedule at work. So, I wrote an R script that takes a list of people as input and outputs a PDF with the schedule and I would like to share it with you.

Schedule requirements

The duty assumes one person cleans the kitchen for a week and another person makes fika on that week. It is also essential to take into account that the same person should not be responsible for both kitchen and fika during the same year. The frequency of being in the schedule list should also be fairly distributed among people.

Generate a schedule table

First, you need to load the list of people, and extract the names:

d <- read.table("people-list.csv", sep = "\t", header = T, stringsAsFactors = F)
names <- d$Name

Then, randomly pick a few people (in my case it was 9) who will be assigned to the kitchen duty:

Kitchen <- sample(names, 9)

Do the same to assign the fika duty but make sure people from the kitchen duty list are excluded:

Fika <- sample(names[!(names %in% Kitchen)], 9)

After the people lists are created, generate the start and end dates as well as week numbers for these lists:

start <- seq(as.Date("19/08/19", format = "%d/%m/%y"), by = "week", length.out = 9)
end <- seq(as.Date("23/08/19", format = "%d/%m/%y"), by = "week", length.out = 9)
week <- strftime(start, format = "%V")

In the end, merge these list into a table:

dd <- data.frame(Kitchen, Fika, start, end, week)
write.table(dd, "kitchen-schedule_week34-42.csv", sep = "\t", row.names = F)

Everything seems to be done. One could just load this table into a spreadsheet editor, format it to a nice look and print. But why waste time on this manual work if you can automate this step too.

Plot a table in R

Instead of manually formatting the obtained table in a spreadsheet editor, you can add a few more lines R code and get a print-ready table:

library(gridExtra)
library(grid)

pdf("kitchen-schedule_week34-42.pdf", width=11.69, height=8.27)
g <- tableGrob(dd, rows = NULL, theme = ttheme_default(base_size = 16,
               padding = unit(c(20, 12), "mm")))
grid.newpage()
grid.draw(g)
dev.off()

In the end, you will obtain a PDF page of A4 size with this kind of table:

Generating new schedule tables

Next time you generate a schedule table, you just need to exclude the people who were assigned some duties before:

d <- read.table("people-list.csv", sep = "\t", header = T, stringsAsFactors = F)
toexlcude <- read.table('kitchen-schedule_week34-42.csv',
                        header = T, sep = "\t", stringsAsFactors = F)

names <- d$Name[!(d$Name %in% c(toexlcude$Kitchen, toexlcude$Fika))]

The rest of the code is the same as above. If you have several duty lists with the names you need to exclude, just merge them before applying the exclusion.

Full Code

All the code put together:

library(gridExtra)
library(grid)

d <- read.table("people-list.csv", sep = "\t", header = T, stringsAsFactors = F)

if(file.exists('previous_kitchen-schedule.csv')){
  toexlcude <- read.table('previous_kitchen-schedule.csv',
                          header = T, sep = "\t", stringsAsFactors = F)
  names <- d$Name[!(d$Name %in% c(toexlcude$Kitchen, toexlcude$Fika))]
}else{
  names <- d$Name
}

Kitchen <- sample(names, 9)
Fika <- sample(names[!(names %in% Kitchen)], 9)

start <- seq(as.Date("21/10/19", format = "%d/%m/%y"), by = "week", length.out = 9)
end <- seq(as.Date("25/10/19", format = "%d/%m/%y"), by = "week", length.out = 9)
week <- strftime(start, format = "%V")

dd <- data.frame(Kitchen, Fika, start, end, week)
write.table(dd, "kitchen-schedule_week43-51.csv", sep = "\t", row.names = F)
sample(d$Name, 1)

pdf("kitchen-schedule_week43-51.pdf", width=11.69, height=8.27)
g <- tableGrob(dd, rows = NULL, theme = ttheme_default(base_size = 16,
               padding = unit(c(20, 12), "mm")))
grid.newpage()
grid.draw(g)
dev.off()

Final thoughts

If you will ever be asked to volunteer for creating duty schedules, do not hesitate to agree. It will cost you so little. Just modify this script for your needs and generate a duty schedule in R with one click.

If you have any questions or suggestions, feel free to email me.

How to change Docker storage location

2019-06-23T11:43:00+00:00

It happened to me several times that I didn’t have enough space in my root partition to store Docker containers and I had to move the Docker default storage location to another partition. In this post, I wrote down how to do that for my readership and future myself :)

Docker containers are relatively large (> 1G) and by default Docker stores all containers in /var/lib/docker, which is located in the root partition of your Linux system. I usually have separate root and home partitions, and given that Linux doesn’t take much space, I allocate 15-30G for my root partition. This happened not to be enough to work with Docker and I had to move the Docker storage location to another larger partition. However, it turned out not to be easy.

Do NOT do this to move Docker storage location

These two solutions could have worked in the past as you may often find them online, but neither of them worked for me with Ubuntu-based Linux distros in 2018-2019 (Docker version > 17).

1. Symlink

The first obvious idea was to symlink the default storage location to another partition:

sudo ln -s /mnt/newlocation /var/lib/docker

2. DOCKER_OPTS

Another often posted solution is to stop Docker:

sudo systemctl stop docker

Edit the /etc/default/docker file by adding the new location with the -g in the DOCKER_OPTS line:

DOCKER_OPTS="-dns 8.8.8.8 -dns 8.8.4.4 -g /mnt/newlocation"

Then start Docker again:

sudo systemctl start docker

After that Docker should use /mnt/newlocation as a new storage location.

UPDATE: It seems DOCKER_OPTS solution may work if you add the $DOCKER_OPTS variable to the systemd script /lib/systemd/system/docker.service:

ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS

However, I have not tested it because the solution I describe below is simpler and probably more correct.

Change Docker storage location: THE RIGHT WAY!

Luckily, the right way to change Docker storage location was not more complicated than the two non-working options I have described above.

You need to create a JSON file /etc/docker/daemon.json with the content pointing to the new storage location:

{
"data-root": "/mnt/newlocation"
}

You can read more about daemon.json in Docker docs.

Then, restart Docker or reboot the system:

sudo systemctl restart docker

If you get any error during the restart, pay attention to spaces in daemon.json. JSON files are sensitive to indentation and an extra or lacking space may cause an error. If Docker restarts fine, this new setting will make Docker place all new containers to the new location. However, old containers will stay in /etc/default/docker. I recommend removing all old containers:

docker system prune -a

And downloaded them again:

docker pull

Final thoughts

It is unfortunate that this simple solution with daemon.json was not the first I found when I tried to fix the “not enough space” issue due to Docker images taking too much space. So, I hope this blog post will save time some users who need to change their Docker storage location.

If you have any questions or suggestions, feel free to email me.

RNA-Seq STAR mapping with Snakemake

2019-04-18T16:47:00+00:00

I have described my pipelines for genotype calling in both non-model and model organisms. I also showed how one can automate a genotype calling pipeline with automatically generated sbatch scripts that handle dependencies between jobs for the Slurm Workload Manager. I used a python script for that but I mentioned that probably it was not the most efficient way and using Nextflow or Snakemake would probably be a better option. I finally got my hands on Snakemake when I was working on my RNA-Seq mapping pipeline. You can read the description of this pipeline below and you can also get my Snakemake file at the end of this post to run this pipeline with your data.

RNA-Seq STAR mapping pipeline

There are many different mapping software for RNA-Seq data. The choice is always difficult. For example, I used stampy for RNA-seq mapping in my Capsella project. The reason behind this choice was that we performed an allele-specific expression analysis with the DNA count data as a null distribution. Therefore, to keep the consistency between the two datasets, I used the same aligner. In addition, stampy is not a bad aligner for RNA-Seq data and my favorite aligner for divergent reads in the genotyping pipeline.

However, for my current dog projects, I choose to use STAR aligner. It is a splicing aware aligner, and what is particularly important for large projects, it is one of the fastest aligners. I also use STAR in the multi-sample 2-pass mapping mode that better maps spliced reads (See STAR documentation).

The whole pipeline consists of STAR 2-pass alignment and reads counting with HTSeq:

Index the reference genome
Map reads to the reference genome (2-pass mode)

2.1. Standard STAR mapping.

2.2. Collect the junctions information from all samples.

2.3. Use new junctions from all samples for the 2nd pass mapping.
Count the number of reads mapped to each gene.

All these STAR mapping steps can be automated with Snakemake as you will see below.

1. Index the reference genome

STAR needs to use its own index files during mapping. These index files are quite large. For example, for the dog reference genome, all STAR index files weight 23Gb, while the actual FASTA file is only 2.3Gb. But I believe that it is these large index files that allow STAR to perform alignment so fast.

So, to index the reference, you need to execute this code:

mkdir canFam3STAR

STAR --runThreadN 20 \
--runMode genomeGenerate \
--genomeDir canFam3STAR \
--genomeFastaFiles canFam3.fa \
--sjdbGTFfile canFam3.gtf \
--sjdbOverhang 100

I think these options are self-explanatory. --runThreadN indicates the number of cores to be used. --sjdbOverhang can be specified as ReadLength-1. You can also 100 which is recommended as a generally good value in the STAR documentation. canFam3 is the reference name for both FASTA and GTF file. You need to change this name for your reference in all commands below.

If you have only GFF annotation, you can convert GFF to GTF with Cufflinks:

gffread canFam3.1.92.gff3 -T -o canFam3.gtf

2. Run the mapping

You can run the standard 1-pass STAR mapping and the results should be good overall. However, given that STAR is very fast, running the 2-pass mode does not take too long and it can improve the mapping to novel junctions. Basically, you run the 1-pass STAR mapping to discover junctions information, then you collect and filter that information from all samples and run the 2-pass using that information.

2.1. Pass1 STAR mapping

The first pass of STAR mapping is a standard run that outputs an alignment and splice junction information.

mkdir Sample1_pass1
cd Sample1_pass1

STAR --runThreadN 20 \
--genomeDir /path/to/canFam3STAR \
--readFilesIn /path/to/Sample1_001_R1.fastq.gz,/path/to/Sample1_002_R1.fastq.gz /path/to/Sample1_001_R2.fastq.gz,/path/to/Sample1_002_R2.fastq.gz \
--readFilesCommand zcat  \
--outSAMtype BAM Unsorted

Again, most of the options are self-explanatory. --readFilesCommand zcat is needed to extract gz compressed reads. --outSAMtype will output an unsorted BAM instead of a default SAM. This saves disk space. If you have your sample sequences in several lanes, you can list these files with comma separation in --readFilesIn as I did above.

This command will produce several output files, among which we are mostly interested in the splice junction information file SJ.out.tab that will be used in the next step. So. I discard the alignment BAM file because it takes too much disk space.

rm Sample1_pass1/Aligned.out.bam

2.2 Filter and collect the splicing information

To filter poorly supported junctions, I keep only the junctions that are supported by at least 3 uniquely mapped reads:

mkdir pass1SJ
for i in Sample*pass1/SJ.out.tab
    do 
        awk '{ if ($7 >= 3) print $0}' $i > $i.filtered
        mv $i.filtered pass1SJ/
    done
rename SJ.out.tab.filtered SJ.filtered.tab pass1SJ/*.filtered

I think it is really difficult to verify splicing information. So, this filtering is rather subjective and can be skipped. I use it simply because of my gut feeling 🙂.

2.3 Pass2 STAR mapping

Now, we just execute almost the same mapping command as at step 2.1 but include add the information on the discovered splicing (--sjdbFileChrStartEnd). I also prefer to add read group information (--outSAMattrRGline) at this step. It is not necessary for reads counting but it may be useful in the future if I decide to use these STAR generated BAM files for other analyses.

mkdir Sample1pass2
cd Sample1pass2

STAR --runThreadN 20 \
--genomeDir /path/to/canFam3STAR \
--readFilesIn /path/to/Sample1_001_R1.fastq.gz,/path/to/Sample1_002_R1.fastq.gz /path/to/Sample1_001_R2.fastq.gz,/path/to/Sample1_002_R2.fastq.gz \
--readFilesCommand zcat  \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:Dog_MT2 \
--sjdbFileChrStartEnd Sample1_pass1.SJ.filtered.tab, ..., SampleN_pass1.SJ.filtered.tab \

3. Counting the number of reads per gene.

You can count the number of reads per gene on the fly during the STAR mapping if you provide it the option --quantMode GeneCounts. However, I prefer to count reads with htseq-count and use the option -m union to deal with overlapping features. You can see what the option -m union mean in the image below.

Different ways to counts non-uniquely mapped reads with htseq-count ( source).

And here is the command I use:

htseq-count -m union -s no -t gene -i ID -f bam input.bam canFam3.gff &> output.log
grep gene output.log | sed 's/gene://g' > counts.csv

The second line extracts only the lines with counts per gene and cleans it by removing the string gene:.

The resulting file looks like this:

ENSCAFG00000000001      209
ENSCAFG00000000002      1
ENSCAFG00000000003      93
ENSCAFG00000000004      531
ENSCAFG00000000005      432

Finally, you can merge all files into one table:

for i in *csv; do sed -i "1igene\t$i" $i ; done # add column names
N=$(($(ls -l *.csv | wc -l)*2)) # count number of files
paste *csv | cut -f 1,$(seq -s, 2 2 $N) > all_HTSeq.csv # merge and keep only one column with gene names

This table will have the following format:

gene                sample1  sample2  sample2 
ENSCAFG00000000001    209      235      167
ENSCAFG00000000002      0        4        7
ENSCAFG00000000003     57       10       38
ENSCAFG00000000004   1243     1298      156
ENSCAFG00000000005     23       67       49

Snakemake STAR pipeline

All the commands above (except the last one that can be run locally) can be put together into a Snakemake file:

SAMPLES, = glob_wildcards('/path/to/fastq/{sample}_L001_R1.fastq.gz') 

rule allout:
        input:
            directory('canFam3STAR'),
            expand('{sample}_pass1/SJ.out.tab', sample=SAMPLES),
            directory('SJ'),
            expand('SJ/{sample}_pass1SJ.filtered.tab', sample=SAMPLES),
            expand('{sample}_pass2/Aligned.sortedByCoord.out.bam', sample=SAMPLES),
            expand('{sample}_HTSeq_union_gff3_no_gene_ID.log', sample=SAMPLES),
            expand('{sample}_HTSeq.csv', sample=SAMPLES)
            
rule index:
        input:
            fa = 'canFam3.fa', # provide your reference FASTA file
            gtf = 'canFam3.gtf' # provide your GTF file
        output:
            directory('canFam3STAR') # you can rename the index folder
        threads: 20 # set the maximum number of available cores
        shell:
            'mkdir {output} && '
            'STAR --runThreadN {threads} '
            '--runMode genomeGenerate '
            '--genomeDir {output} '
            '--genomeFastaFiles {input.fa} '
            '--sjdbGTFfile {input.gtf} '
            '--sjdbOverhang 100'

rule pass1:
        input:
            R1L1 = 'fastq/{sample}/{sample}_L001_R1.fastq.gz', # may need adjustment if your fastq file name format is different
            R1L2 = 'fastq/{sample}/{sample}_L002_R1.fastq.gz', # note each sample has 4 fastq files ~ 2 lanes per file
            R2L1 = 'fastq/{sample}/{sample}_L001_R2.fastq.gz',
            R2L2 = 'fastq/{sample}/{sample}_L002_R2.fastq.gz',
            refdir = directory('canFam3STAR')
        params:
            outdir = '{sample}_pass1',
            rmbam = '{sample}_pass1/Aligned.out.bam'
        output:
            '{sample}_pass1/SJ.out.tab'
        threads: 20 # set the maximum number of available cores
        shell:
            'rm -rf {params.outdir} &&' # be careful with this. I don't know why, but Snakemake had problems without this cleaning.
            'mkdir {params.outdir} && ' # snakemake had problems finding output files with --outFileNamePrefix, so I used this approach instead
            'cd {params.outdir} && '
            'STAR --runThreadN {threads} '
            '--genomeDir {input.refdir} '
            '--readFilesIn {input.R1L1},{input.R1L2} {input.R2L1},{input.R2L2} '
            '--readFilesCommand zcat '
            '--outSAMtype BAM Unsorted && rm {params.rmbam} && cd ..'
            
rule SJdir:
        output:
            directory('SJ')
        threads: 1
        shell:
            'mkdir {output}'

rule filter:
        input:
            '{sample}_pass1/SJ.out.tab',
            directory('SJ')
        output:
            'SJ/{sample}_pass1SJ.filtered.tab'
        threads: 1
        shell:
            '''awk "{ { if (\$7 >= 3) print \$0 } }" {input[0]} > {input[0]}.filtered && '''
            'mv {input[0]}.filtered {output}'

rule pass2:
        input:
            R1L1 = 'fastq/{sample}/{sample}_L001_R1.fastq.gz',
            R1L2 = 'fastq/{sample}/{sample}_L002_R1.fastq.gz',
            R2L1 = 'fastq/{sample}/{sample}_L001_R2.fastq.gz',
            R2L2 = 'fastq/{sample}/{sample}_L002_R2.fastq.gz',
            SJfiles = 'SJ/{sample}_pass1SJ.filtered.tab',
            refdir = directory('canFam3STAR')
        params:
            outdir = '{sample}_pass2',
            id = '{sample}'
        output:
            '{sample}_pass2/Aligned.sortedByCoord.out.bam'
        threads: 20 # set the maximum number of available cores
        shell:
            'rm -rf {params.outdir} &&' # be careful with this. I don't know why, but Snakemake had problems without this cleaning.
            'mkdir {params.outdir} && '
            'cd {params.outdir} && '
            'STAR --runThreadN {threads} '
            '--genomeDir {input.refdir} '
            '--readFilesIn {input.R1L1},{input.R1L2} {input.R2L1},{input.R2L2} '
            '--readFilesCommand zcat '
            '--outSAMtype BAM SortedByCoordinate '
            '--sjdbFileChrStartEnd {input.SJfiles} '
            '--outSAMattrRGline ID:{params.id} '
            '--quantMode GeneCounts '

rule htseq:
        input:
            bam = '{sample}_pass2/Aligned.sortedByCoord.out.bam',
            gff = 'canFam3.gff3'
        output:
            '{sample}_HTSeq_union_gff3_no_gene_ID.log',
            '{sample}_HTSeq.csv'
        threads: 1
        shell:
            'htseq-count -m union -s no -t gene -i ID -r pos -f bam {input.bam} {input.gff} &> {output[0]} && '
            'grep ENS {output[0]} | sed "s/gene://g" > {output[1]}'

Read the comments within the code to find the line you need to change to adjust this Snakemake pipeline for your data.

Also, depending on your file location and Snakemake version, Snakemake may have problems finding files without the absolute path in file names. For example, instead of relative path fastq/{sample}_L001_R1.fastq.gz you may need to use the absolute path /home/username/RNA-Seq/fastq/{sample}_L001_R1.fastq.gz

Run Snakemake on a Slurm cluster (Uppmax)

I executed this Snakemake file on our Slurm cluster (Uppmax). To do that I created a Snakemake cluster config file cluster.yaml:

__default__:
  account: snic2019-x-xxx
  time: "00:01:00"
  n: 1
  partition: "core"
 
index:
  time: "05:00:00"
  n: 20
 
pass1:
  time: "01:00:00"
  n: 20
 
pass2:
  time: "02:00:00"
  n: 20

htseq:
  time: "05:00:00"

This config file is used during Snakemake job submission with --cluster-config cluster.yaml.

I first run this pipeline in a dry mode with the --dryrun option:

snakemake -s Snakefile -j 100 --dryrun --cluster-config cluster.yaml --cluster "sbatch -A {cluster.account} -t {cluster.time} -p {cluster.partition} -n {cluster.n}"

If everything works fine in a dry mode, you can run this command in a regular mode from a login node of the server. However, I prefer to create a sbatch file (see below) and submit this command as a job which in turn will submit all other jobs as defined in the Snakemake file.

#!/bin/bash -l
#SBATCH -A snic2019-x-xxx
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 1-00:00:00
#SBATCH -J sbatchSnakefile
#SBATCH -e sbatchSnakefile.err
#SBATCH -o sbatchSnakefile.out

snakemake -s Snakefile -j 100 --cluster-config cluster.yaml --cluster "sbatch -A {cluster.account} -t {cluster.time} -p {cluster.partition} -n {cluster.n}"

Conclusion

Snakemake is a great tool and I am very happy that I have finally started using it. A combination of STAR speed and Snakemake workflow efficiency makes RNA-Seq mapping pipeline truly fast, robust, and error-safe. This pipeline has already saved me some time with my pilot RNA-Seq experiment and it will save even more time when my new RNA-Seq data will arrive.

I hope I will also update my genotype calling pipeline with Snakemake workflow soon.

If you have any questions or suggestions, feel free to email me.