Jekyll2021-05-11T09:04:58+00:00https://evodify.com/feed.xmlDmytro KryvokhyzhaBioinformatics and Genomics Scientist at Uppsala University. Works with large scale genomic and transcriptomic data, and has a passion for Data Science.
R loops are slow: How to deal with that2019-10-01T00:00:00+00:002019-10-01T00:00:00+00:00https://evodify.com/r-loops-are-slow<p>When you start learning R, no one will tell you that R loops are slow. At least, I was not taught about this and I have not seen an explicit statement about this in R textbooks. Later on, you begin using R beyond trivial tasks and you discover that R loops often becomes a bottleneck of your scripts. You would wonder why it is so and how to deal with that. This is exactly what happened to me.</p>
<p>Indeed, R <code class="language-plaintext highlighter-rouge">for</code> loops are inefficient, especially if you use them wrong. Searching for <code class="language-plaintext highlighter-rouge">why R loops are slow</code> discovers that many users are wondering about this question. Below, I summarize my experience and online discussions regarding this issue by providing some trivial code examples.</p>
<h2 id="r-is-an-interpreted-language">R is an interpreted language</h2>
<p>This is what you need to keep in mind when you write in R or any other interpreted language. Make interpreted language easy for the end-user comes at the costs of processing such a code. There is a lot of extra computing steps to interpret the user-friendly code into computer code and execute it. That is why a compiled language is much faster as it doesn’t carry the extra baggage of interpreted language.</p>
<p>Does this mean you need to learn <em>C</em> or similar languages? Of course, not, though it won’t hurt to know some <em>C</em> :-). You just need to be aware of this fact and try to write your R code in a way that makes it efficient.</p>
<p>Below, I provide some examples that will help you understand when I am talking about. I also think these examples can be used as the best practices for R loops programming.</p>
<h2 id="keep-r-loops-code-minimal">Keep R loops code minimal</h2>
<p>Let’s have a look at the example when even some extra characters that do nothing impact the processing speed.</p>
<p>Create a matrix with random numbers:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="m">1000000</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Calculate row means:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loopmean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="m">1</span><span class="p">])){</span><span class="w">
</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">loopmeanD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loopmean</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.039 0.003 0.044</span><span class="w">
</span></code></pre></div></div>
<p>The function here doesn’t matter. I just picked up <code class="language-plaintext highlighter-rouge">mean</code> as the most trivial example. We are interested in the amount of time it takes to process this loop.</p>
<p>If we reuse the same code but add some extra brackets in <code class="language-plaintext highlighter-rouge">mean()</code>, it will take substantially longer to process:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loopmeanBrackets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="m">1</span><span class="p">])){</span><span class="w">
</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(((((((((((((((((</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,])))))))))))))))))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">loopmeanBracketsD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loopmeanBrackets</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed</span><span class="w">
</span><span class="c1"># 0.051 0.000 0.050</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">loopmeanBracketsD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<p>We have changed nothing in terms of math. It is still the same calculation as before. However, R needs to go through each <code class="language-plaintext highlighter-rouge">(</code> and <code class="language-plaintext highlighter-rouge">)</code> every loop cycle and this slows down the code a lot.</p>
<p>So, next time you write your loop, make it as minimal as possible in terms of character numbers.</p>
<h2 id="process-by-columns">Process by columns</h2>
<p><em>R</em> naturally process the data by columns faster than by row. If you need to loop through the columns, transform your data and loop through the columns:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loopmeanColumn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="m">2</span><span class="p">])){</span><span class="w">
</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">[,</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">loopmeanColumnD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loopmeanColumn</span><span class="p">(</span><span class="n">tm</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.037 0.000 0.036</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">loopmeanColumnD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<h2 id="allocate-memory">Allocate memory</h2>
<p><em>R</em> also processes loops faster when you allocate the memory for the output object. In this case, R just needs to fill in the cells in a vector instead of extending the vector every loop cycle.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vertorloopmean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">dim</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="m">1</span><span class="p">])){</span><span class="w">
</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">vertorloopmeanD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vertorloopmean</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.031 0.000 0.032 </span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">vertorloopmeanD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<h2 id="use-apply">Use <em>apply</em></h2>
<p>When you search online why <em>R</em> loops are slow, you are likely to find the advice to use <code class="language-plaintext highlighter-rouge">apply</code> because it is faster. I also thought that <code class="language-plaintext highlighter-rouge">apply</code> is faster than <code class="language-plaintext highlighter-rouge">for</code> loops until I did the small research for this blog-post. In fact, <code class="language-plaintext highlighter-rouge">apply</code> also loops through the data and often it seems to be a little faster than <code class="language-plaintext highlighter-rouge">for</code> loops because its code tends to be shorter:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">applymean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">applymeanD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.035 0.004 0.038 </span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">applymeanD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<p>Processing by columns is also fater for <code class="language-plaintext highlighter-rouge">apply</code>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">applymeanColumn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">tm</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">v</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">applymeanColumnD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">applymeanColumn</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.036 0.000 0.036</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">applymeanColumnD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<p>Please, see <a href="">the benchmarking of all loops below</a> to get more details on how <code class="language-plaintext highlighter-rouge">apply</code> compares to <code class="language-plaintext highlighter-rouge">for</code> loops. In this case, it actually was not faster than the <code class="language-plaintext highlighter-rouge">for</code> loop.</p>
<h2 id="compile-your-functions">Compile your functions</h2>
<p>You can improve the performance of your function by compiling it to byte code. This is especially beneficial when your function code is long.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">compiler</span><span class="p">)</span><span class="w">
</span><span class="n">loopmeanCompiled</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cmpfun</span><span class="p">(</span><span class="n">loopmean</span><span class="p">)</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">loopmeanCompiledD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loopmeanCompiled</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.035 0.000 0.035 </span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">loopmeanCompiledD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<h1 id="parallelize">Parallelize</h1>
<p><em>R</em> has several libraries that allow parallelizing the processing between the core of your processor.</p>
<p>I usually use <em>doParallel</em> library for that. It is not beneficial in this <code class="language-plaintext highlighter-rouge">mean</code> example, because it takes longer to split the processes between cores and collect the results that to run everything on one core. However, when each loop cycle is long enough, parallelizing helps a lot.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">doParallel</span><span class="p">)</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cores</span><span class="o">=</span><span class="m">12</span><span class="p">)</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">loopParallelD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="nf">dim</span><span class="p">(</span><span class="n">m</span><span class="p">)[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">.combine</span><span class="o">=</span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="n">i</span><span class="p">,]))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 1.173 0.157 1.042 </span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">loopParallelD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<h2 id="use-built-in-functions">Use Built-in functions</h2>
<p>Everything described above helps only marginally. You can get some performance improvments with these tricks but you will neber beat the built-in <em>R</em> functions that call <em>C</em> code directly without interpretation step. Just look at how much faster is the built-in function to calculate row means:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system.time</span><span class="p">(</span><span class="n">rowMeanD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rowMeans</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.002 0.000 0.002 </span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">rowMeanD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<p>So, before you write your function, make sure there is no <em>R</em> library with such function already.</p>
<h2 id="write-in-c">Write in <em>C++</em></h2>
<p>There is also an option to write your code in <em>C++</em> and compile it with <a href="https://cran.r-project.org/web/packages/Rcpp/index.html" target="_blank"><em>Rcpp</em></a> to <em>R</em> code. This will also result in a considerably faster code. But of course, you need to know some <em>C++</em> for that.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#include <Rcpp.h></span><span class="w">
</span><span class="n">using</span><span class="w"> </span><span class="n">namespace</span><span class="w"> </span><span class="n">Rcpp</span><span class="p">;</span><span class="w">
</span><span class="o">//</span><span class="p">[[</span><span class="n">Rcpp</span><span class="o">::</span><span class="n">export</span><span class="p">]]</span><span class="w">
</span><span class="n">NumericVector</span><span class="w"> </span><span class="n">cRowMeans</span><span class="p">(</span><span class="n">NumericMatrix</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">int</span><span class="w"> </span><span class="n">nrows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x.nrow</span><span class="p">();</span><span class="w">
</span><span class="n">NumericVector</span><span class="w"> </span><span class="n">v</span><span class="p">(</span><span class="n">nrows</span><span class="p">);</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">nrows</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">){</span><span class="w">
</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x.row</span><span class="p">(</span><span class="n">i</span><span class="p">));</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">return</span><span class="w"> </span><span class="n">v</span><span class="p">;</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">system.time</span><span class="p">(</span><span class="n">cRowMeansD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cRowMeans</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="c1"># user system elapsed </span><span class="w">
</span><span class="c1"># 0.004 0.000 0.004</span><span class="w">
</span><span class="n">identical</span><span class="p">(</span><span class="n">loopmeanD</span><span class="p">,</span><span class="w"> </span><span class="n">cRowMeansD</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] TRUE</span><span class="w">
</span></code></pre></div></div>
<h2 id="benchmarking-r-loops">Benchmarking <em>R</em> loops</h2>
<p>Using <code class="language-plaintext highlighter-rouge">system.time()</code> several times with the same function will produce little different results. Although the <code class="language-plaintext highlighter-rouge">system.time</code> presented above is comparable it does not fully reflect the reality. As I have mentioned above, only after benchmarking all these functions, I discovered that <code class="language-plaintext highlighter-rouge">apply</code> was not as fast as I expected. Moreover, in this particular code, it was slower than a simple <code class="language-plaintext highlighter-rouge">for</code> loop.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">microbenchmark</span><span class="p">)</span><span class="w">
</span><span class="n">mbm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">microbenchmark</span><span class="p">(</span><span class="s2">"LoopRowExtraBrackets"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loopmeanBrackets</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"ApplyRow"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">applymean</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"ApplyColumn"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">applymeanColumn</span><span class="p">(</span><span class="n">tm</span><span class="p">),</span><span class="w">
</span><span class="s2">"LoopRowCompiled"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loopmeanCompiled</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"LoopRow"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loopmean</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"LoopColumn"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loopmeanColumn</span><span class="p">(</span><span class="n">tm</span><span class="p">),</span><span class="w">
</span><span class="s2">"LoopRowToVector"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vertorloopmean</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"CLoop"</span><span class="o">=</span><span class="w"> </span><span class="n">cRowMeans</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="s2">"Built-in_rowMeans"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rowMeans</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w">
</span><span class="n">check</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'equal'</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="o">=</span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="n">mbm</span><span class="w">
</span><span class="c1"># Unit: milliseconds</span><span class="w">
</span><span class="c1"># expr min lq mean median uq max neval cld</span><span class="w">
</span><span class="c1"># LoopRowExtraBrackets 44.693969 49.214291 53.174126 51.161645 53.941782 93.298477 100 d</span><span class="w">
</span><span class="c1"># ApplyRow 38.594932 44.356220 50.847681 46.949665 51.282932 88.668781 100 d</span><span class="w">
</span><span class="c1"># ApplyColumn 38.211502 44.035419 51.626075 47.071399 52.140827 94.046727 100 d</span><span class="w">
</span><span class="c1"># LoopRow 35.798877 40.832460 43.676707 42.363606 44.313524 80.957104 100 c </span><span class="w">
</span><span class="c1"># LoopRowCompiled 33.665563 40.451894 42.942701 42.355379 44.258304 73.966924 100 c </span><span class="w">
</span><span class="c1"># LoopColumn 34.566808 39.875668 42.668796 41.743636 44.099745 76.010563 100 c </span><span class="w">
</span><span class="c1"># LoopRowToVector 32.187435 37.927207 40.912034 39.813388 42.197814 74.008110 100 c </span><span class="w">
</span><span class="c1"># RowLoopC 2.794117 3.721194 5.664805 4.260946 6.525059 49.985055 100 b </span><span class="w">
</span><span class="c1"># Built-in_rowMeans 1.571780 1.668413 1.815267 1.687677 1.791447 3.601554 100 a</span><span class="w">
</span></code></pre></div></div>
<p>And the visualization of these results:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">autoplot</span><span class="p">(</span><span class="n">mbm</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-10-01-r-loops-are-slow/benchmarks_R_loops.jpg" alt="benchmarks of R loops" /></p>
<h2 id="code">Code</h2>
<p>You can <a href="/assets/posts/2019-10-01-r-loops-are-slow/R_loops_are_slow.Rmd">download the R code</a> and test everything yourself.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We use <em>R</em> not because of its speed but rather because of its ease of use. The most efficient <em>R</em> code will never be faster than the alternative <em>C</em> code. But knowing the behavior of <em>R</em> I described above will help you to make your <em>R</em> loops the fastest within the limitation of <em>R</em> as an interpreted language.</p>You may have noticed that R loops are slow. You will find out why it is so and how to deal with that.AWK is awesome2019-09-17T00:00:00+00:002019-09-17T00:00:00+00:00https://evodify.com/awk-is-awesome<p><a href="https://en.wikipedia.org/wiki/AWK" target="_blank"><em>AWK</em></a> has been the most beneficial programming language I have ever learned. It took me only a day to learn most of it and it saved me several weeks if not months already. I use <em>AWK</em> almost every day.</p>
<p>It is better to see <em>AWK</em> in action once than to hear about it a thousand times. So, let’s start with the examples.</p>
<h2 id="table-summary">Table summary</h2>
<p>I usually use <em>AWK</em> to calculate some simple summary statistics for a table. For example, let’s assume you have a file <code class="language-plaintext highlighter-rouge">table.txt</code> with some numeric values:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CHR STAR END LOD SCORE
chr1 211829 211850 lod=31 333
chr1 211867 211871 lod=13 247
chr1 211877 211903 lod=66 408
chr1 211913 211927 lod=61 400
chr1 211971 211994 lod=60 399
chr1 211996 212024 lod=72 417
chr6 310311 310324 lod=16 268
chr6 312061 312066 lod=13 247
chr6 312100 312206 lod=376 580
chr6 312653 312728 lod=19 285
chr6 312908 313028 lod=348 573
chr6 313549 313788 lod=900 667
chr6 313589 313784 lod=747 648
</code></pre></div></div>
<h3 id="mean">Mean</h3>
<p>You can quickly get the mean <code class="language-plaintext highlighter-rouge">SCORE</code> value:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'{s+=$5} END {print s/(NR-1)}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/1-Mean-SCORE-column.jpg" alt="Mean of the SCORE column" /></p>
<p>where <code class="language-plaintext highlighter-rouge">s+=$5</code> sums up all values of the 5th column;
<code class="language-plaintext highlighter-rouge">NR</code> is a built-in variable that equals to the row number. I use <code class="language-plaintext highlighter-rouge">NR-1</code> because I skip the header.
The command after the <code class="language-plaintext highlighter-rouge">END</code> is executed when the end of the file is reached.</p>
<p>To see what <em>AWK</em> does line by line, run this command:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'BEGIN{print "SCORE", "SUM", "LINE_NUMBER"} {s+=$5; print $5, s, NR} END {print "mean:", s/(NR-1)}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/2-Mean-SCORE-column-lines.jpg" alt="How AWK calculates mean of the SCORE column line by line " /></p>
<p>But how to calculate the mean of the <code class="language-plaintext highlighter-rouge">LOD</code> columns that has <code class="language-plaintext highlighter-rouge">lod=</code> in front of each number?</p>
<p>You can use <em>AWK</em> to clean tha data and do the calculation:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'gsub( "lod=", "" , $4){s+=$4}END{print s/(NR-1)}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/3-Mean-LOD-column.jpg" alt="Mean of the LOD column" /></p>
<p><code class="language-plaintext highlighter-rouge">gsub( "lod=", "" , $4)</code> replaces <code class="language-plaintext highlighter-rouge">lod=</code> with an empty string before any calculation is done.</p>
<p>You can also limit the calculation to one chromosome:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'$1=="chr1" {n++; s+=$5} END {print s/n}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/4-Mean-SCORE-column_chr1.jpg" alt="Mean of the SCORE column for chromosome 1" /></p>
<p>We use the condition if (<code class="language-plaintext highlighter-rouge">$1=="chr1"</code>) do (<code class="language-plaintext highlighter-rouge">{n++; s+=$5}</code>). Also, <code class="language-plaintext highlighter-rouge">NR</code> is replaced with <code class="language-plaintext highlighter-rouge">n++</code> to count only the lines that meet the condition <code class="language-plaintext highlighter-rouge">$1=="chr1"</code></p>
<h3 id="min-and-max">Min and max</h3>
<p>Using the same principles, you can get the minimum and maximum values of the <code class="language-plaintext highlighter-rouge">SCORE</code> column:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'NR==2 || $5 < min {min=$5} END{ print min}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span><span class="n">awk</span><span class="w"> </span><span class="s1">'NR==2 || $5 > max {max=$5} END{ print max}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/5-Min-max-SCORE-column.jpg" alt="Min and max of the SCORE column" /></p>
<p><code class="language-plaintext highlighter-rouge">||</code> means <em>OR</em> statement in AWK.</p>
<p>You can combine these two commands in one:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'NR==2 {min=$5; max=$5} $5 > max {max=$5} $5 < min {min=$5} END {print "min: ", min, "\nmax: ", max}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/6-Min-max-SCORE-column-one-line.jpg" alt="Min and max of the SCORE column in one line" /></p>
<p><code class="language-plaintext highlighter-rouge">NR==2 {min=$5; max=$5}</code> assigns the initial values of <code class="language-plaintext highlighter-rouge">min</code> and <code class="language-plaintext highlighter-rouge">max</code> using the second row.
<code class="language-plaintext highlighter-rouge">$5 > max {max=$5}</code> and <code class="language-plaintext highlighter-rouge">$5 < min {min=$5}</code> are conditional statements that are checked one after another.</p>
<h3 id="mean-max-and-min-in-one-line">Mean, max, and min in one line</h3>
<p>You can also combine all three calculations in one line and get all statistics in one run:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">' NR==2 {min=$5; max=$5} $5 > max {max=$5} $5 < min {min=$5} {s+=$5} END {print "min: ", min, "\nmax: ", max, "\nmean: ", s/(NR-1)}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/7-Mean-min-max-SCORE-column-one-line.jpg" alt="Mean, min and max of the SCORE column in one line" /></p>
<h2 id="genotypes-summary">Genotypes summary</h2>
<p>There are more complicated cases where you can use AWK.</p>
<p>You may want to do some calculations of the genotype table generated by <a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php" target="_blank">VariantsToTable</a> from the GATK:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#CHROM POS REF 12.4.GT 13.16.GT 16.9.GT
scaffold_1 191 A ./. ./. A/A
scaffold_1 563 T T/T ./. T/A
scaffold_1 647 A C/C C/C A/C
scaffold_1 669 T T/T T/T T/T
scaffold_1 679 C C/A C/A C/A
scaffold_1 704 T T/C T/C T/C
scaffold_1 721 T C/C C/C C/C
scaffold_1 722 C C/T C/T C/T
scaffold_1 733 G G/T G/T G/*
</code></pre></div></div>
<p>For example, I often calculate the number heterozygous, homozygous sites and missing genotypes. To that end, I use this <em>AWK</em> script written in the <a href="https://github.com/evodify/genotype-files-manipulations/blob/master/summarizeTAB.awk">summarizeTAB.awk</a> file:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">NF</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">maxNF</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">NF</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w">
</span><span class="n">countN</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="n">countHomo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="n">countHetero</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="n">countNA</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="n">maxNF</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">NF</span><span class="p">;</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">NR</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">NF</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="p">;}</span><span class="w">
</span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">NF</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w">
</span><span class="p">{</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"N"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"./."</span><span class="p">)</span><span class="w"> </span><span class="n">countN</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span><span class="w">
</span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"A/A"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"T/T"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G/G"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"C/C"</span><span class="p">)</span><span class="w"> </span><span class="n">countHomo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span><span class="w">
</span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G/A"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"T/C"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"A/C"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G/T"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"C/G"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"A/T"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="err">\</span><span class="w">
</span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"A/G"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"C/T"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"C/A"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"T/G"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"G/C"</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="o">$</span><span class="n">i</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"T/A"</span><span class="p">)</span><span class="w"> </span><span class="err">\</span><span class="w">
</span><span class="n">countHetero</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span><span class="w">
</span><span class="k">else</span><span class="w"> </span><span class="n">countNA</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">++</span><span class="p">;</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">END</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">print</span><span class="w"> </span><span class="s2">"Sample"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Genotypes"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Heterozygots"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Homozygots"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Missing"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Unknown"</span><span class="p">;</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">maxNF</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">countHomo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">+</span><span class="n">countHetero</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="m">+0</span><span class="p">,</span><span class="w"> </span><span class="n">countHetero</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="m">+0</span><span class="p">,</span><span class="w"> </span><span class="n">countHomo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="m">+0</span><span class="p">,</span><span class="w"> </span><span class="n">countN</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="m">+0</span><span class="p">,</span><span class="w"> </span><span class="n">countNA</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="m">+0</span><span class="p">;</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>It loops through the columns starting from the 4th one and calculates the number of the number heterozygous, homozygous, missing, and unknown genotypes. These number are stored in corresponding variables.</p>
<p>When the script is too long to fit it in one line as in this case, you can write it into a file and tell <em>AWK</em> to execute it:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="o">-</span><span class="n">f</span><span class="w"> </span><span class="n">summarizeTAB.awk</span><span class="w"> </span><span class="n">geno.tab</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-17-awk-is-awesome/8-summarizeTAB.awk.jpg" alt="output of the summarizeTAB.awk script" /></p>
<h2 id="awk-vs-python"><em>AWK</em> vs <em>Python</em></h2>
<p>The <em>AWK</em> code is usually shorter and works faster than Python. I do not have a dramatic example when my <em>AWK</em> code is substantially shorter than the equivalent <em>Python</em> code. But there are <a href="https://sites.google.com/site/toawkornot/python-vs-awk" target="_blank">great examples from other <em>AWK</em> users</a>.</p>
<h2 id="be-careful-with-awk">Be careful with <em>AWK</em></h2>
<p>There is one key point you need to keep in mind when you work with AWK. It doesn’t throw an error when it encounters something unusual. Instead, <em>AWK</em> tried to guess how to handle it and proceeds silently. This can put you in danger.</p>
<p>Using the mean <code class="language-plaintext highlighter-rouge">SCORE</code> column example from above, you can see that <em>AWK</em> treated the characters in the header as <code class="language-plaintext highlighter-rouge">0</code>.</p>
<p>This would throw you an error in <em>Python</em>. A character string and numeric value cannot be summed. But <em>AWK</em> doesn’t give such an error.</p>
<p>You would have done a mistake in the mean calculation if you calculated the line numbers as <code class="language-plaintext highlighter-rouge">n++</code>. It would have counted the header too. That’s why I also deduced 1 from the number of rows specified with <code class="language-plaintext highlighter-rouge">NR</code>.</p>
<p>Similarly, if you have missing data points in a form of <code class="language-plaintext highlighter-rouge">NA</code>, you need to tell <em>AWK</em> to skip them:</p>
<div class="language-s highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">awk</span><span class="w"> </span><span class="s1">'NR>1 && $5!="NA" {s+=$5; n++; print $5, s, n} END {print "mean:", s/n}'</span><span class="w"> </span><span class="n">table.txt</span><span class="w">
</span></code></pre></div></div>
<p>I also used <code class="language-plaintext highlighter-rouge">NR>1</code> to skip the header.</p>
<p>So, you need to be aware of this behavior of <em>AWK</em> when you have a mixture of data types.</p>
<h2 id="where-to-learn-awk">Where to learn AWK</h2>
<p>If you want to learn AWK, I recommend the course <a href="https://sites.google.com/site/toawkornot/" target="_blank">“To awk or not to…”</a>. It was fantastic when I took it in 2017, and it has improved since then.</p>
<p>I also often visit <a href="http://www.grymoire.com/Unix/Awk.html" target="_blank">this <em>AWK</em> page</a>, for quick reference on the functions.</p>
<p><strong>If you have never used AWK, give it a try. It may change your life forever.</strong></p>AWK is awesome because it is the most rewarding programing language you can learn for bioinformatics and Data Science. There is little to learn and it is useful every day.Interpopulation comparison of Copy Number Variants2019-09-10T00:00:00+00:002019-09-10T00:00:00+00:00https://evodify.com/populations-cnv-comparison<p>I showed how to efficiently <a href="https://evodify.com/gatk-cnv-snakemake/">genotype Copy Number Variants with GATK and Snakemake</a>. As a continuation of the Copy Number Variation topic, I will share how I compared the Copy Number Variation along the genome between three different populations. If you also analyze the population genomic data, I hope you will find this post useful.</p>
<p>Although the <a href="https://evodify.com/gatk-cnv-snakemake/">GATK Copy Number Variants (CNVs) calling pipeline</a> utilizes the population variation during the CNVs calling in the cohort mode, it produces separate VCF files for each sample. The CNVs in such VCF files look similar to this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1</span>
chrN 45894001 CNV_45894001_46949000 N <DEL>,<DUP> <span class="nb">.</span> <span class="nb">.</span> <span class="nv">END</span><span class="o">=</span>46949000 GT:CN:NP:QA:QS:QSE:QSS 0:2:1055:47:3077:78:119
chrN 46949001 CNV_46949001_46956000 N <DEL>,<DUP> <span class="nb">.</span> <span class="nb">.</span> <span class="nv">END</span><span class="o">=</span>46956000 GT:CN:NP:QA:QS:QSE:QSS 2:4:7:6:9:15:8
chrN 46956001 CNV_46956001_55222000 N <DEL>,<DUP> <span class="nb">.</span> <span class="nb">.</span> <span class="nv">END</span><span class="o">=</span>55222000 GT:CN:NP:QA:QS:QSE:QSS 0:2:8263:17:3077:108:19
chrN 55222001 CNV_55222001_55223000 N <DEL>,<DUP> <span class="nb">.</span> <span class="nb">.</span> <span class="nv">END</span><span class="o">=</span>55223000 GT:CN:NP:QA:QS:QSE:QSS 1:0:1:493:493:493:493
</code></pre></div></div>
<p>If you compare the CNVs from different samples, most likely you will find that breaking points are not the same across your samples. This poses a problem of connecting the CNVs from different samples to estimate the interpopulation differences along the genome.</p>
<div class="image">
<figure class="aligncenter">
<img src="/assets/posts/2019-09-10-populations-cnv-comparison/1-CNVs-across-samples.jpg" alt="CNVs across samples in IGV" />
<figcaption class="caption">Variation in breaking points of CNVs across samples in IGV</figcaption>
</figure></div>
<p>To overcome this problem, we decided to bin each CNV segments according to breaking points it overlaps. This allowed us to merge CNVs of all samples into one large table where the genomic coordinates are the same:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CHROM POS END s1 s2 s3 s4 s5 s6 s7 s8
chrN 46939001 46949000 2 2 2 2 2 2 2 2
chrN 46949001 46951000 4 5 1 4 5 1 4 3
chrN 46951001 46955000 4 5 1 4 5 1 4 3
chrN 46955001 46956000 4 5 1 4 5 1 4 3
chrN 46956001 46957000 2 2 2 2 2 2 2 2
</code></pre></div></div>
<p>Such a table can be used to estimate various statistics along the genome for different populations. For example, I estimated <em>Vst</em> between populations (it’s like Fst but for CNVs.)</p>
<p>Let me show you everything step-by-step.</p>
<h2 id="visualize-the-cnv-variation">Visualize the CNV variation</h2>
<p>To make sure my CNV calls are good, I explored the CNV variation between my samples in <a href="https://software.broadinstitute.org/software/igv/" target="_blank">IGV</a>.</p>
<p>First, I extracted the CNV genotypes form the VCF files using GATK and converted the resulting tables into <code class="language-plaintext highlighter-rouge">seg</code> and <code class="language-plaintext highlighter-rouge">tab</code> formats using Snakemake:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Use Snakemake 4, GATK 4
</span>
<span class="n">CHROMOSOMES</span><span class="p">,</span> <span class="n">SAMPLES</span><span class="p">,</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s">'chr{i}_{sample}_segments_cohort.vcf'</span><span class="p">)</span>
<span class="n">REF</span> <span class="o">=</span> <span class="s">'/path/to/reference.fa'</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s">'chr{i}_{sample}_segments_cohort.seg'</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">,</span> <span class="n">i</span><span class="o">=</span><span class="n">CHROMOSOMES</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">toTable</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">ref</span> <span class="o">=</span> <span class="n">REF</span><span class="p">,</span>
<span class="n">vcf</span> <span class="o">=</span> <span class="s">'chr{i}_{sample}_segments_cohort.vcf'</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'chr{i}_{sample}_segments_cohort.table'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" VariantsToTable </span><span class="se">\
</span><span class="s"> -R {input.ref} </span><span class="se">\
</span><span class="s"> -V {input.vcf} </span><span class="se">\
</span><span class="s"> -F ID -GF CN </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">toSeg</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s">'chr{i}_{sample}_segments_cohort.table'</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">seg</span> <span class="o">=</span> <span class="s">'chr{i}_{sample}_segments_cohort.seg'</span><span class="p">,</span>
<span class="n">tab</span> <span class="o">=</span> <span class="s">'chr{i}_{sample}_segments_cohort.tab'</span>
<span class="n">params</span><span class="p">:</span>
<span class="s">'{sample}'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
sed 's/CNV_//g;s/_/</span><span class="se">\t</span><span class="s">/g' {input} | </span><span class="se">\
</span><span class="s"> awk -v s={params} 'BEGIN{ {print "CHROM</span><span class="se">\\</span><span class="s">tPOS</span><span class="se">\\</span><span class="s">tEND</span><span class="se">\\</span><span class="s">t"s".CN"} } NR>1 { {print $0} }' </span><span class="se">\
</span><span class="s"> > {output.tab} && </span><span class="se">\
</span><span class="s"> sed 's/CNV_//g;s/_/</span><span class="se">\t</span><span class="s">/g' {input} | </span><span class="se">\
</span><span class="s"> awk -v s={params} 'BEGIN{ { print s"</span><span class="se">\\</span><span class="s">tCHROM</span><span class="se">\\</span><span class="s">tPOS</span><span class="se">\\</span><span class="s">tEND</span><span class="se">\\</span><span class="s">t"s".CN" } } NR>1 { {print s"</span><span class="se">\\</span><span class="s">t"$0} }' </span><span class="se">\
</span><span class="s"> > {output.seg}
'''</span>
</code></pre></div></div>
<p>This produced three files per sample. Here is the example of these files:</p>
<p><em>chrN_sample1_segments_cohort.table</em>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ID sample1.CN
CNV_chrN_45894001_46949000 2
CNV_chrN_46949001_46956000 4
CNV_chrN_46956001_55222000 2
CNV_chrN_55222001_55223000 0
</code></pre></div></div>
<p><em>chrN_sample1_segments_cohort.tab</em>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CHROM POS END sample1.CN
chrN 45894001 46949000 2
chrN 46949001 46956000 4
chrN 46956001 55222000 2
chrN 55222001 55223000 0
</code></pre></div></div>
<p><em>chrN_sample1_segments_cohort.seg</em>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sample1 CHROM POS END sample1.CN
CFA010182 chrN 45894001 46949000 2
CFA010182 chrN 46949001 46956000 4
CFA010182 chrN 46956001 55222000 2
CFA010182 chrN 55222001 55223000 0
</code></pre></div></div>
<p>Then, you load the <code class="language-plaintext highlighter-rouge">*.seg</code> files into IGV and you will obtain a picture similar to this one:</p>
<div class="image">
<figure class="aligncenter">
<img src="/assets/posts/2019-09-10-populations-cnv-comparison/2-CNV_in_IGV.jpg" alt="CNVs across samples in IGV" />
<figcaption class="caption">Visualizing CNVs in IGV</figcaption>
</figure></div>
<p>In IGV, read indicates duplications, blue marks deletions, and white depicts the diploid state. The intensity of the color corresponds to number of gained or lost copies.</p>
<h2 id="bin-the-segments">Bin the segments</h2>
<p>To bin all the segments into the same set of segments across samples, I merged and sorted the coordinates from all samples:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat </span>chrN_<span class="k">*</span>_segments_cohort.tab | <span class="nb">cut</span> <span class="nt">-f</span> 1,2,3 | <span class="nb">grep</span> <span class="nt">-v</span> POS | <span class="nb">sort</span> <span class="nt">-V</span> <span class="nt">-u</span> <span class="nt">-k</span> 2,2 <span class="nt">-k</span> 3,3 | <span class="nb">awk</span> <span class="s1">'BEGIN{print"CHROM\tPOS\tEND"}{print $0}'</span> <span class="o">></span> CNV_intervals.bed
</code></pre></div></div>
<p>And created the reference interval file in <em>R</em>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s1">'chrN_CNV_intervals.bed'</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">breaks</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">d</span><span class="o">$</span><span class="n">POS</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">END</span><span class="p">)))</span><span class="w">
</span><span class="n">bins</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">CHROM</span><span class="o">=</span><span class="nf">rep</span><span class="p">(</span><span class="s1">'chrN'</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">breaks</span><span class="p">[</span><span class="m">-1</span><span class="p">])),</span><span class="w"> </span><span class="n">POS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">breaks</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">)</span><span class="m">+1</span><span class="p">,</span><span class="w"> </span><span class="n">END</span><span class="o">=</span><span class="n">breaks</span><span class="p">[</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">scipen</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">999</span><span class="p">)</span><span class="w"> </span><span class="c1"># disables scientific notation</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">bins</span><span class="p">,</span><span class="w"> </span><span class="s1">'CNV_intervals_bins.bed'</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s1">'\t'</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>You can visualize the original interval list and bins with this <em>R</em> code:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_intervals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">CHROM</span><span class="p">))</span><span class="w">
</span><span class="n">nLines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">POS</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">END</span><span class="p">)),</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nLines</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Merged intervals"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="o">=</span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="o">=</span><span class="s2">"Interval number"</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">lines</span><span class="p">){</span><span class="w">
</span><span class="n">segments</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">END</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">nLines</span><span class="m">+1</span><span class="o">-</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="o">$</span><span class="n">POS</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">nLines</span><span class="m">+1</span><span class="o">-</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gray"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">jpeg</span><span class="p">(</span><span class="s1">'chrN_CNV_intervals_bins.jpeg'</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">740</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">600</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mar</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">mfrow</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">cex</span><span class="o">=</span><span class="m">1.1</span><span class="p">)</span><span class="w">
</span><span class="n">plot_intervals</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="w">
</span><span class="n">plot_intervals</span><span class="p">(</span><span class="n">bins</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/posts/2019-09-10-populations-cnv-comparison/3-CNV_intervals_bins.jpeg" alt="Plots of original intervals and bins" /></p>
<h2 id="merge-all-cnv-files">Merge all CNV files</h2>
<p>Then, I used <a href="https://github.com/evodify/genotype-files-manipulations/blob/master/merge_CNVs_tabs.py" target="_blank">merge_CNVs_tabs.py</a> to merge all CNV files with <em>CNV_intervals_bins.bed</em>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for </span>i <span class="k">in</span> <span class="k">*</span>_segments_cohort.tab
<span class="k">do
</span>python ~/git/genotype-files-manipulations/merge_CNVs_tabs.py <span class="nt">-i</span> <span class="nv">$i</span> <span class="nt">-r</span> chrN_CNV_intervals_bins.bed <span class="nt">-o</span> <span class="nv">$i</span>.bin <span class="o">&&</span> <span class="nb">cut</span> <span class="nt">-f</span> 4 <span class="nv">$i</span>.bin <span class="o">></span> <span class="nv">$i</span>.bin.col4
<span class="k">done
</span><span class="nb">paste </span>sample1_segments_cohort.tab.bin tab/<span class="k">*</span>.bin.col4 <span class="o">></span> segments_cohort_bins.tab
</code></pre></div></div>
<p>The resulting files have the following format:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CHROM POS END s1 s2 s3 s4 s5 s6 s7 s8
chrN 46939001 46949000 2 2 2 2 2 2 2 2
chrN 46949001 46951000 4 5 1 4 5 1 4 3
chrN 46951001 46955000 4 5 1 4 5 1 4 3
chrN 46955001 46956000 4 5 1 4 5 1 4 3
chrN 46956001 46957000 2 2 2 2 2 2 2 2
</code></pre></div></div>
<h2 id="calculate-vst">Calculate Vst</h2>
<p>The obtained <em>segments_cohort_bins.tab</em> can be used to calculate various statistics. For example, you can calculate <em>Vst</em> in <em>R</em>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s1">'chrN_segments_cohort_bins.tab'</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">
</span><span class="n">dd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="p">[,</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)]</span><span class="w">
</span><span class="n">group</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">))</span><span class="w">
</span><span class="n">getVst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">groups</span><span class="p">,</span><span class="w"> </span><span class="n">comparison</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">groupLevels</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">levels</span><span class="p">(</span><span class="n">groups</span><span class="p">)</span><span class="w">
</span><span class="n">dat1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">na.omit</span><span class="p">(</span><span class="n">dat</span><span class="p">[</span><span class="n">groups</span><span class="o">==</span><span class="n">groupLevels</span><span class="p">[</span><span class="n">groupLevels</span><span class="o">==</span><span class="n">comparison</span><span class="p">[</span><span class="m">1</span><span class="p">]]])</span><span class="w">
</span><span class="n">dat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">na.omit</span><span class="p">(</span><span class="n">dat</span><span class="p">[</span><span class="n">groups</span><span class="o">==</span><span class="n">groupLevels</span><span class="p">[</span><span class="n">groupLevels</span><span class="o">==</span><span class="n">comparison</span><span class="p">[</span><span class="m">2</span><span class="p">]]])</span><span class="w">
</span><span class="n">Vtotal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">dat1</span><span class="p">,</span><span class="w"> </span><span class="n">dat2</span><span class="p">))</span><span class="w">
</span><span class="n">Vgroup</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">((</span><span class="n">var</span><span class="p">(</span><span class="n">dat1</span><span class="p">)</span><span class="o">*</span><span class="nf">length</span><span class="p">(</span><span class="n">dat1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">var</span><span class="p">(</span><span class="n">dat2</span><span class="p">)</span><span class="o">*</span><span class="nf">length</span><span class="p">(</span><span class="n">dat2</span><span class="p">)))</span><span class="w"> </span><span class="o">/</span><span class="w">
</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">dat1</span><span class="p">)</span><span class="o">+</span><span class="nf">length</span><span class="p">(</span><span class="n">dat2</span><span class="p">))</span><span class="w">
</span><span class="n">Vst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">((</span><span class="n">Vtotal</span><span class="o">-</span><span class="n">Vgroup</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Vtotal</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">Vst</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"NaN"</span><span class="p">){</span><span class="w">
</span><span class="n">Vst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">Vst</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">d</span><span class="o">$</span><span class="n">Vst_red_black</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">getVst</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"black"</span><span class="p">)))</span><span class="w">
</span><span class="n">d</span><span class="o">$</span><span class="n">Vst_red_blue</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">getVst</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">)))</span><span class="w">
</span><span class="n">d</span><span class="o">$</span><span class="n">Vst_blue_black</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">getVst</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"black"</span><span class="p">)))</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="s1">'chrN_segments_cohort_bins_Vst.csv'</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s1">'\t'</span><span class="p">,</span><span class="w"> </span><span class="n">quote</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The table <em>segments_cohort_bins_Vst.csv</em> will look like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CHROM POS END s1 s2 s3 s4 s5 s6 s7 s8 Vst_red_black Vst_red_blue Vst_blue_black
chrN 46939001 46949000 2 2 2 2 2 2 2 2 0 0.0000000 0.0
chrN 46949001 46951000 4 5 1 4 5 1 4 3 1 0.6923077 0.8
chrN 46951001 46955000 4 5 1 4 5 1 4 3 1 0.6923077 0.8
chrN 46955001 46956000 4 5 1 4 5 1 4 3 1 0.6923077 0.8
chrN 46956001 46957000 2 2 2 2 2 2 2 2 0 0.0000000 0.0
</code></pre></div></div>
<p>Exploring the distribution of <em>Vst</em> can identify genomic regions of hight divergence:</p>
<p><img src="/assets/posts/2019-09-10-populations-cnv-comparison/4-segments_cohort_bins_Vst.jpeg" alt="Example of Vst distribution along the genome" /></p>
<h2 id="final-thought">Final thought</h2>
<p>You can use the CNV table <em>chrN_segments_cohort_bins.tab</em> to calculate many other things.</p>
<p>We found to be the most parsimonious solution to bin the CNV segments to merge all samples into one table. If there is a better way to solve the problem of variation in breaking points of CNVs for the interpopulation comparison, please <a href="mailto:dmytro.kryvokhyzha@evobio.eu">let me know</a>.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>Guide on how to process Copy Number Variants generated with GATK 4 to compare samples from different populationsThe best free Research Data Repository2019-08-27T00:00:00+00:002019-08-27T00:00:00+00:00https://evodify.com/free-research-repository<p>You need to deposit your research data to a repository and you are lost in options. I have been in the same situation recently.</p>
<p>If your data is of specific type then the choice is obvious. You deposit that data to a data-type specific repository. For example, nucleic acid sequence data need to be uploaded to the <a href="https://www.ncbi.nlm.nih.gov/sra">Sequence Read Archive (SRA)</a>. Scripts and programs should be deposited to <a href="https://github.com/evodify">GitHub</a> or similar resource with a <a href="https://git-scm.com/book/en/v1/Getting-Started-About-Version-Control">version control system</a>. Usually, you need to make your best to use these repositories because this will increase the chance of your data to be found by other researchers. Here is an extensive <a href="https://www.nature.com/sdata/policies/repositories">list of data-type specific repositories</a>.</p>
<p>But if you also have some non-standard data formats, you need to use a generalist repository. The most popular ones are Dryad, FigShare, and Zenodo. These were the repositories I found first. Later, I also discovered the Open Science Framework (OSF) and it became my number one research data repository.</p>
<p>My key criteria when I was looking for the best repository for my scientific data were:</p>
<ul>
<li>Free</li>
<li>DOI</li>
<li>Ability to update files</li>
<li>Directory structure</li>
</ul>
<p>Publishing in open-access journals already costs a fortunate, so I wanted to use a free repository to avoid additional spending. A digital object identifier (DOI) is probably a must for any publication. It is especially useful if you publish a dataset without a link to any paper. A DOI makes it easier to cite the dataset. I also would like to have an option to edit or update the data after the initial deposit. Mistakes are always possible and it is better to be able to correct them. The amount of data grows enormously and usually my projects have many files structured in directories. I would like to keep this directories order in my repositories too. The OSF repository meets these requirements the best.</p>
<p>Let me briefly summarize my option on each of the repositories I tried.</p>
<h2 id="dryad">Dryad</h2>
<p><img src="/assets/posts/2019-08-28-free-research-repository/1-Dryad.jpg" alt="Dryad research data repository" /></p>
<p>Dryad is the most popular research data repository. It is recommended by many journals. I used it to publish <a href="https://doi.org/10.5061/dryad.q83pt">the supplementary data for my Molecular Ecology paper</a>. By publishing in Molecular Ecology, you get a link to deposit your data to Dryad for free.</p>
<p>However, it is not a free repository. You need to pay <strong>$120</strong> for a submission of up to <strong>20GB</strong>, and <strong>+$50</strong> for each additional <strong>10GB</strong>. On the other hand, such a business model guarantees long term existence of this repository.</p>
<p>I like it for its simple and easy to use interface. Uploading the data is very simple and fast. You get a DOI for your data and some simple metrics such as a number of page views and downloads. But you cannot edit anything after the submission. There is no directory structure support, so you can upload a directory only as an archive file.</p>
<p>Pros:</p>
<ul>
<li>popular</li>
<li>simple</li>
<li>DOI</li>
<li>metrics</li>
</ul>
<p>Cons:</p>
<ul>
<li>non-free</li>
<li>no edit/update after the submission</li>
<li>no directory structure support</li>
<li>not optimized for downloading many files at once</li>
</ul>
<h2 id="figshare">FigShare</h2>
<p><img src="/assets/posts/2019-08-28-free-research-repository/2-FigShare.jpg" alt="FigShare research results repository" /></p>
<p>FigShare is a great repository for visual content. It shows a <strong>preview</strong> of every file. If I recall correctly, this was the initial purpose of FigShare. Now, you can also use FigShare to upload any file types.</p>
<p>There is <strong>no limit on files size</strong> if you make them public. You can modify your files after the publication with a version control system.</p>
<p>I think FigShare should be used <strong>only to share posters, slides, and figures</strong>. It is not convenient for sharing dozens of files. You can use collections and project, to unite many files. But there is no easy way to download many files. The interface of the repository is also not simple. You often need to navigate several windows to access a file.</p>
<p>Pros:</p>
<ul>
<li>popular</li>
<li>free</li>
<li>DOI</li>
<li>unlimited space</li>
<li>image preview</li>
</ul>
<p>Cons:</p>
<ul>
<li>optimized only for single visual file sharing</li>
<li>complicated to use</li>
<li>no directory structure support</li>
<li>not optimized for downloading many files at once</li>
</ul>
<h2 id="zenodo">Zenodo</h2>
<p><img src="/assets/posts/2019-08-28-free-research-repository/3-Zenodo.jpg" alt="Zenodo research data repository" /></p>
<p>Zenodo is <strong>good in many regards</strong>. It is free. There is a version control system. The DOI is provided. You can meter page views and downloads.</p>
<p>The file size limit is 50GB per dataset but you can have an unlimited number of datasets.</p>
<p>However, you <strong>cannot create folders</strong> with files. You can upload each folder as a separate dataset or compress each folder into an archive and upload it. But this is not an ideal solution.</p>
<p>Pros:</p>
<ul>
<li>popular</li>
<li>free</li>
<li>DOI</li>
<li>simple interface</li>
<li>version control system</li>
</ul>
<p>Cons:</p>
<ul>
<li>no directory structure support</li>
<li>not optimized for downloading many files at once</li>
<li>50GB limit per dataset</li>
</ul>
<h2 id="open-science-framework">Open Science Framework</h2>
<p><img src="/assets/posts/2019-08-28-free-research-repository/4-OSF.jpg" alt="Open Science Framework repository" /></p>
<p>OSF is <strong>my favorite repository</strong> to store my research data. It is surprisingly <strong>not very popular</strong>. It took a while until I found it. I believe its popularity will grow as it is an amazing repository for scientific data.</p>
<p>OSF is free. You get a DOI for your repository. There is a version control system. It <strong>supports directory structure</strong> in repositories. You can update your files after the publication and the history of the repository is tracked.</p>
<p>The default file size limit is 5 GB. But you can extend this limit with <a href="https://help.osf.io/hc/en-us/articles/360019737894-FAQs#what-is-the-cap-on-data-per-user-or-per-project" target="_blank">add-ons</a>.</p>
<p>The OSF <strong>interface is more advanced</strong> than in other repositories. I consider it an advantage. But it is little too advanced and some user may find it difficult to use. So, I will still list it in the cons.</p>
<p>Pros:</p>
<ul>
<li>free</li>
<li>DOI</li>
<li>version control system</li>
<li>supports directory structure</li>
<li>optimized for downloading many files at once</li>
</ul>
<p>Cons:</p>
<ul>
<li>not popular</li>
<li>advanced interface</li>
<li>5GB limit per file (no number of files limit)</li>
</ul>
<p>I have not explored the funding of other repositories but OSF is secured by <strong>funding for 50+</strong> years. The chance it will disappear is very small.</p>
<h2 id="mendeley">Mendeley</h2>
<p><img src="/assets/posts/2019-08-28-free-research-repository/5-Mendeley-data.jpg" alt="Mendeley repository for scientific data" /></p>
<p>Mendeley is known as a digital library app with great reference tools. Recently, it also launched the Mendeley Data service.
I found out about this Mendeley Data repository while writing this blog post.</p>
<p>It is a simple repository. <strong>If you already use Mendeley</strong> and you do not want to bother with other options, go ahead and use Mendeley Data.</p>
<p>You can see its pros and cons below. I only would like to emphasize that there is a moderation step to publish your data. So, be ready to wait sometime before your data becomes public.</p>
<p>Pros:</p>
<ul>
<li>popular</li>
<li>simple</li>
<li>DOI</li>
<li>supports directory structure</li>
<li>optimized for downloading all files at once</li>
</ul>
<p>Cons:</p>
<ul>
<li>no version control system</li>
<li>moderation</li>
<li>10 GB per dataset</li>
</ul>
<h2 id="summary">Summary</h2>
<p>This is not a comprehensive review. I just evaluate these repositories from my requirements. For example, you may need to check the funding of free repositories to make sure they won’t disappear soon. I also did not pay attention to license types these repositories support because I usually release my data into the public domain anyway.</p>
<p>If you think there is something crucial I missed, please <a href="mailto:dmytro.kryvokhyzha@evobio.eu">let me know</a> and I will add it.</p>I compare the most popular repository for research data: Dryad, Zenodo, FigShare, Open Science Framework, and Mendeley.Snakemake checkpoint tutorial2019-08-16T00:00:00+00:002019-08-16T00:00:00+00:00https://evodify.com/snakemake-checkpoint-tutorial<p>If you want to use Snakemake to run some programs that output an unknown number of files, you need to tell Snakemake about that. If you use Snakemake 4, you can do that by marking the output with <code class="language-plaintext highlighter-rouge">dynamic()</code>. If you upgraded to Snakemake 5, you better use <code class="language-plaintext highlighter-rouge">checkpoint</code>. Using <code class="language-plaintext highlighter-rouge">dynamic()</code> will work in Snakemake 5, but you will see a message saying that dynamic output is deprecated and will be fully replaced by checkpoints in Snakemake 6.</p>
<p>This post shows how to use both <code class="language-plaintext highlighter-rouge">dynamic()</code> and <code class="language-plaintext highlighter-rouge">checkpoint</code>.</p>
<p>You probably better focus on <code class="language-plaintext highlighter-rouge">checkpoint</code> because this is a more up-to-date solution. But <code class="language-plaintext highlighter-rouge">checkpoint</code> may not work correctly sometimes. For example, I tested it with the GATK <code class="language-plaintext highlighter-rouge">IntervalListTools</code> and <a href="https://stackoverflow.com/questions/57432036/snakemake-checkpoint-exited-with-non-zero-exit-code" target="_blank">it did not work correctly</a>, while <code class="language-plaintext highlighter-rouge">dynamic()</code> worked fine with the exactly same command. Thus, knowing both approaches can be helpful.</p>
<h2 id="checkpoint">Checkpoint</h2>
<p><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution" target="_blank">Checkpoint</a> function was introduced in Snakemake 5 and it will completely replace <code class="language-plaintext highlighter-rouge">dynamic()</code> in Snakemake 6. So, if you have not tried it, it is time to learn it.</p>
<p>Here is a dummy code that shows how <code class="language-plaintext highlighter-rouge">checkpoint</code> works:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rule</span> <span class="n">final_output</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s">'scatter_copy_head_collect/all.txt'</span>
<span class="c1"># generate random number of files
</span><span class="n">checkpoint</span> <span class="n">scatter</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">directory</span><span class="p">(</span><span class="s">'scatter'</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
mkdir {output}
N=$(( $RANDOM % 10))
for j in $(seq 1 $N); do echo -n $j > {output}/$j.txt; done
'''</span>
<span class="c1"># process these unknown number of files
</span><span class="n">rule</span> <span class="n">scatter_copy</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy/{i}_copy.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter/{i}.txt'</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cp -f {input.txt} {output.txt}
echo -n "_copy" >> {output.txt}
'''</span>
<span class="c1"># process scatter_copy output
</span><span class="n">rule</span> <span class="n">scatter_copy_head</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy_head/{i}_head.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy/{i}_copy.txt'</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cp -f {input.txt} {output.txt}
echo "_head" >> {output.txt}
'''</span>
<span class="c1"># collect the results of processing unknown number of files
# and merge them together into one file:
</span>
<span class="k">def</span> <span class="nf">aggregate_input</span><span class="p">(</span><span class="n">wildcards</span><span class="p">):</span>
<span class="s">'''
aggregate the file names of the random number of files
generated at the scatter step
'''</span>
<span class="n">checkpoint_output</span> <span class="o">=</span> <span class="n">checkpoints</span><span class="p">.</span><span class="n">scatter</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="o">**</span><span class="n">wildcards</span><span class="p">).</span><span class="n">output</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">expand</span><span class="p">(</span><span class="s">'scatter_copy_head/{i}_head.txt'</span><span class="p">,</span>
<span class="n">i</span><span class="o">=</span><span class="n">glob_wildcards</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">checkpoint_output</span><span class="p">,</span> <span class="s">'{i}.txt'</span><span class="p">)).</span><span class="n">i</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">scatter_copy_head_collect</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">combined</span> <span class="o">=</span> <span class="s">'scatter_copy_head_collect/all.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">aggregate_input</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cat {input} > {output.combined}
'''</span>
</code></pre></div></div>
<p>Explore the outputs, to understand how this pipeline works:</p>
<p><img src="/assets/posts/2019-08-16-snakemake-checkpoint-tutorial/snakemake-checkpoint-output.jpg" alt="unknown output files of snakemake with checkpoint rule" /></p>
<h2 id="dynamic">Dynamic</h2>
<p><a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files" target="_blank">Dynamic output</a> is outdated approach but it seems to be more stable and reliable at the moment. So, if you experience some problems with <code class="language-plaintext highlighter-rouge">checkpoint</code>, in most cases, you can write the same pipeline with <code class="language-plaintext highlighter-rouge">dynamic()</code>.</p>
<p>This is the same pipeline as above but it utilizes <code class="language-plaintext highlighter-rouge">dynamic()</code> instead of <code class="language-plaintext highlighter-rouge">checkpoint</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rule</span> <span class="n">final_output</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="s">'scatter_copy_head_collect/all.txt'</span>
<span class="c1"># this was a checkpoint step above:
</span><span class="n">rule</span> <span class="n">scatter</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">dynamic</span><span class="p">(</span><span class="s">'scatter/{i}.txt'</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
N=$(( $RANDOM % 10))
for j in $(seq 1 $N); do echo -n $j > scatter/$j.txt; done
'''</span>
<span class="c1"># this rule is not different from checkpoint
</span><span class="n">rule</span> <span class="n">scatter_copy</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy/{i}_copy.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter/{i}.txt'</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cp -f {input.txt} {output.txt}
echo -n "_copy" >> {output.txt}
'''</span>
<span class="c1"># this rule is not different from checkpoint either:
</span><span class="n">rule</span> <span class="n">scatter_copy_head</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy_head/{i}_head.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="s">'scatter_copy/{i}_copy.txt'</span><span class="p">,</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cp -f {input.txt} {output.txt}
echo "_head" >> {output.txt}
'''</span>
<span class="c1"># to collect all files, you need to tell Snakemake that input is dynamic:
</span><span class="n">rule</span> <span class="n">scatter_copy_head_collect</span><span class="p">:</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">combined</span> <span class="o">=</span> <span class="s">'scatter_copy_head_collect/all.txt'</span><span class="p">,</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">indivfiles</span> <span class="o">=</span> <span class="n">dynamic</span><span class="p">(</span><span class="s">'scatter_copy_head/{i}_head.txt'</span><span class="p">)</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">gathered</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">indivfiles</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
cat {params.gathered} > {output.combined}
'''</span>
</code></pre></div></div>
<h2 id="final-thoughts">Final thoughts</h2>
<p>Checkpoints are claimed to be more powerful that <code class="language-plaintext highlighter-rouge">dynamic()</code> by the Snakemake developers. I believe they are right but my impression is that <code class="language-plaintext highlighter-rouge">dynamic()</code> is easier to use. Maybe I have not fully comprehended <code class="language-plaintext highlighter-rouge">checkpoint</code> yet.</p>
<p>Besides, as I mentioned above I was not able to make it work with GATK. So, I will try to use <code class="language-plaintext highlighter-rouge">checkpoint</code> but I may also step back to <code class="language-plaintext highlighter-rouge">dynamic()</code> too.</p>
<p>Finally, I would like to acknowledge this <a href="https://stackoverflow.com/a/56451259/2317701" target="_blank">Stackoverflow answer</a> that inspired me to write this tutorial.</p>This Snakemake checkpoint tutorial shows how to run Snakemake when the number of outputs is dynamic, e.g. file names are unknown until the rule is executed.Call germline Copy Number Variants with GATK in Snakemake2019-08-15T00:00:00+00:002019-08-15T00:00:00+00:00https://evodify.com/gatk-cnv-snakemake<p>I needed to call copy number variants (CNVs) in my dog dataset. I had different tools on my radar including <a href="https://github.com/Illumina/manta" target="_blank">Manata</a>, <a href="https://github.com/arq5x/lumpy-sv" target="_blank">LUMPY</a>, <a href="https://github.com/abyzovlab/CNVnator" target="_blank">CNVnator</a>, and <a href="http://software.broadinstitute.org/software/genomestrip/node_CNVPipelineOverview.html" target="_blank">GenomeSTRiP</a>. Among these tools, I liked Manata for its incredible speed. But it lacked the cohort mode calling which I thought was preferable for my population-level data. Only GenomeSTRiP had the cohort calling mode. I have not run GenomeSTRiP myself, but I talked to a person who tried it and he told me it was not the easiest tool to set up and run. I also recall GATK had a beta version that could call CNVs. Checking the GATK website revealed that this functionality has been released already. So, I decided to proceed with trusted GATK for calling germline copy number variants in my dataset.</p>
<p>The <a href="https://gatkforums.broadinstitute.org/gatk/discussion/11684" target="_blank">GATK documentation for this pipeline</a> is in BETA for the moment of writing this post but it is enough to run the pipeline. I tested it and had no obvious problems. I am not going to describe each step of this pipeline in details as you can read about them. I will briefly list the steps and provide the Snakemake code to execute this pipeline.</p>
<h2 id="requirements">Requirements</h2>
<p>You will need <strong>GATK 4</strong> in GATK <strong>Conda environment</strong> and <strong>Snakemake 4</strong>.</p>
<h3 id="gatk-python-environment">GATK Python environment</h3>
<p>I run this pipeline with <strong>GATK 4.1.2.0</strong>. To call CNVs with GATK 4, you need to load a Python environment with <em>gcnvkernel</em> module. I use <a href="https://software.broadinstitute.org/gatk/documentation/article?id=12836" target="_blank">Conda installation</a> for that:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda <span class="nb">env </span>create <span class="nt">-f</span> /path/to/gatk/gatkcondaenv.yml
conda init bash <span class="c"># restart shell to take effect</span>
conda activate gatk
</code></pre></div></div>
<h3 id="snakemake">Snakemake</h3>
<p>I started writing this pipeline in Snakemake 5. I used recently introduced <a href="https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution" target="_blank">checkpoints</a> to handle unknown output (see the scattering step below). But I encountered a problem which <a href="https://stackoverflow.com/questions/57432036/snakemake-checkpoint-exited-with-non-zero-exit-code" target="_blank">I was not able to fix</a>. So, I downgraded to <strong>Snakemake 4.3.1</strong> and used the older <code class="language-plaintext highlighter-rouge">dynamic()</code> function for scattering. Everything worked fine.</p>
<h2 id="steps-to-call-copy-number-variants-with-gatk">Steps to call copy number variants with GATK</h2>
<p>These steps are described here only for a quick reference. For a detailed description of each step and options, read the <a href="https://gatkforums.broadinstitute.org/gatk/discussion/11684" target="_blank">GATK guide</a>.</p>
<h3 id="bin-intervals">Bin intervals</h3>
<p><a href="https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_PreprocessIntervals.php" target="_blank">PreprocessIntervals</a> takes a reference fasta file as input and creates a binned interval lists. If you want to process only a subset of the genome, specify it with the option <code class="language-plaintext highlighter-rouge">-L</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> PreprocessIntervals <span class="se">\</span>
<span class="nt">-R</span> canFam3.fa <span class="se">\</span>
<span class="nt">--padding</span> 0 <span class="se">\</span>
<span class="nt">--bin-length</span> 1000 <span class="se">\</span>
<span class="nt">-L</span> chr35:100000-2000000 <span class="se">\</span>
<span class="nt">-imr</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-O</span> interval_chr35.interval_list
</code></pre></div></div>
<p>Bin size should negatively correlate with coverage, e.g. higher coverage data can have smaller bins. The default bin length of 1000 is recommended for 30x data.</p>
<h3 id="count-reads-per-bin">Count reads per bin</h3>
<p>This step counts reads overlapping each interval. It takes the interval list from the previous step and a BAM file as input and outputs a read counts table. The output can be in a human-readable TSV format (option <code class="language-plaintext highlighter-rouge">--format TSV</code>) or HDF5 (default) which is faster to process by GATK.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> CollectReadCounts <span class="se">\</span>
<span class="nt">-R</span> canFam3.fa <span class="se">\</span>
<span class="nt">-imr</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-L</span> interval_chr35.interval_list <span class="se">\</span>
<span class="nt">-I</span> sample1.bam <span class="se">\</span>
<span class="nt">-O</span> sample1_chr35.hdf5
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">OVERLAPPING_ONLY</code> prevents the merging of abutting intervals as recommended by the GATK team.</p>
<h3 id="annotate-and-filter-intervals-optional">Annotate and Filter intervals (Optional)</h3>
<p>This step helps to remove problematic regions in the cohort calling mode. However, the pipeline should work fine without any interval filtering.</p>
<p>You can annotate intervals with GC content, mappability, and segmental duplication information:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> AnnotateIntervals <span class="se">\</span>
<span class="nt">-R</span> canFam3.fa <span class="se">\</span>
<span class="nt">-L</span> interval_chr35.interval_list <span class="se">\</span>
<span class="nt">--mappability-track</span> canFam3_mappability.bed.gz <span class="se">\</span>
<span class="nt">--segmental-duplication-track</span> canFam3_segmental_duplication.bed.gz <span class="se">\</span>
<span class="nt">--interval-merging-rule</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-O</span> annotated_intervals_chr35.tsv
</code></pre></div></div>
<p>The information on mappability and segmental duplication need to be provided.</p>
<p>The GATK team recommends <strong>generating mappability</strong> with <a href="https://bitbucket.org/hoffmanlab/umap/src/default/" target="_blank">Umap and Bismap</a>. I also used <a href="https://evodify.com/">GEM to generate mappability</a>.</p>
<p>To <strong>obtain segmental duplication</strong> information, I tried to run <a href="https://github.com/vpc-ccg/sedef" target="_blank">SEDEF</a> and <a href="https://github.com/delehef/asgart" target="_blank">ASGART</a> on the CamFam3 genome. Unfortunately, my attempts were unsuccessful: both programs crashed without a clear error message.</p>
<p>So, I annotated my data only with GC content and mappability.</p>
<p>Annotated intervals are then filtered based on tunable thresholds:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> FilterIntervals <span class="se">\</span>
<span class="nt">-L</span> interval_chr35.interval_list <span class="se">\</span>
<span class="nt">--annotated-intervals</span> annotated_intervals_chr35.tsv <span class="se">\</span>
<span class="nt">-I</span> sample1_chr35.hdf5 <span class="se">\</span>
<span class="nt">-I</span> sample2_chr35.hdf5 <span class="se">\</span>
<span class="nt">--minimum-gc-content</span> 0.1 <span class="se">\</span>
<span class="nt">--maximum-gc-content</span> 0.9 <span class="se">\</span>
<span class="nt">--minimum-mappability</span> 0.9 <span class="se">\</span>
<span class="nt">--maximum-mappability</span> 1.0 <span class="se">\</span>
<span class="nt">--minimum-segmental-duplication-content</span> 0.0 <span class="se">\</span>
<span class="nt">--maximum-segmental-duplication-content</span> 0.5 <span class="se">\</span>
<span class="nt">--low-count-filter-count-threshold</span> 5 <span class="se">\</span>
<span class="nt">--low-count-filter-percentage-of-samples</span> 90.0 <span class="se">\</span>
<span class="nt">--extreme-count-filter-minimum-percentile</span> 1.0 <span class="se">\</span>
<span class="nt">--extreme-count-filter-maximum-percentile</span> 99.0 <span class="se">\</span>
<span class="nt">--extreme-count-filter-percentage-of-samples</span> 90.0 <span class="se">\</span>
<span class="nt">--interval-merging-rule</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-O</span> gcfiltered_chr35.interval_list
</code></pre></div></div>
<h3 id="call-contig-ploidy">Call contig ploidy</h3>
<p>This step is needed to generate global baseline coverage and noise data for the subsequent steps:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> DetermineGermlineContigPloidy <span class="se">\</span>
<span class="nt">-L</span> interval_chr35.interval_list <span class="se">\</span>
<span class="nt">-I</span> sample1_chr35.hdf5 <span class="se">\</span>
<span class="nt">-I</span> sample2_chr35.hdf5 <span class="se">\</span>
<span class="nt">--contig-ploidy-priors</span> ploidy_priors.tsv <span class="se">\</span>
<span class="nt">--output-prefix</span> dog <span class="se">\</span>
<span class="nt">--interval-merging-rule</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-O</span> ploidy-calls_chr35
</code></pre></div></div>
<p>You need to provide ploidy prior probabilities. Here is an example of priors I used:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CONTIG_NAME PLOIDY_PRIOR_0 PLOIDY_PRIOR_1 PLOIDY_PRIOR_2 PLOIDY_PRIOR_3
chr35 0.01 0.01 0.97 0.01
chrX 0.01 0.49 0.49 0.01
</code></pre></div></div>
<p>If you have the information on the sex of your sample, it is advised to compare it with the ploidy call results.</p>
<h3 id="scatter-intervals">Scatter intervals</h3>
<p>GATK 4 utilizes a <a href="https://software.broadinstitute.org/gatk/documentation/article?id=11059" target="_blank">new approach for parallelization</a> of processes that requires scattering your data. This step does exactly that. It splits the interval list into shards which can be processed in parallel. The results of these scattered processes are collected at the later step.</p>
<p>To scatter the intervals into ~5K intervals, run:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> scatter_chr35
gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> IntervalListTools <span class="se">\</span>
<span class="nt">--INPUT</span> interval_chr35.interval_list <span class="se">\</span>
<span class="nt">--SUBDIVISION_MODE</span> INTERVAL_COUNT <span class="se">\</span>
<span class="nt">--SCATTER_CONTENT</span> 15000 <span class="se">\</span>
<span class="nt">--OUTPUT</span> scatter_chr35
</code></pre></div></div>
<p>It is recommended to have at least ~10–50Mbp genomic coverage per scatter. So, scatters of ~15K with ~1K bins would have ~15Mb coverage.</p>
<h3 id="call-copy-number-variants">Call copy number variants</h3>
<p>This step detects both rare and common CNVs on a scattered shard:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> GermlineCNVCaller <span class="se">\</span>
<span class="nt">--run-mode</span> COHORT <span class="se">\</span>
<span class="nt">-L</span> scatter_chr35/fragment/scattered.interval_list <span class="se">\</span>
<span class="nt">-I</span> sample1_chr35.hdf5 <span class="se">\</span>
<span class="nt">-I</span> sample2_chr35.hdf5 <span class="se">\</span>
<span class="nt">--contig-ploidy-calls</span> ploidy-calls_chr35/dogs-calls <span class="se">\</span>
<span class="nt">--annotated-intervals</span> annotated_intervals_chr35.tsv <span class="se">\</span>
<span class="nt">--output-prefix</span> fragment <span class="se">\</span>
<span class="nt">--interval-merging-rule</span> OVERLAPPING_ONLY <span class="se">\</span>
<span class="nt">-O</span> cohort-calls_chr35
</code></pre></div></div>
<p>You need to run this command on each fragment produced by <code class="language-plaintext highlighter-rouge">IntervalListTools</code> from the <a href="#scatter-intervals">Scattering step</a>. This can be easely achived with Snakemake as you will see below.</p>
<p>To increase the sensitivity of calls, you need to fine-tune different parameters. For details visit <a href="https://gatkforums.broadinstitute.org/gatk/discussion/11684#4.1" target="_blank">this GATK page</a></p>
<h3 id="call-copy-number-segments">Call copy number segments</h3>
<p>This step collects the results from scattered shards and calls copy number state per sample for intervals and segments in the VCF format:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gatk <span class="nt">--java-options</span> <span class="s2">"-Xmx8G"</span> PostprocessGermlineCNVCalls <span class="se">\</span>
<span class="nt">--model-shard-path</span> cohort-calls_chr35/frag_temp_0001_of_3-model <span class="se">\</span>
<span class="nt">--model-shard-path</span> cohort-calls_chr35/frag_temp_0002_of_3-model <span class="se">\</span>
<span class="nt">--model-shard-path</span> cohort-calls_chr35/frag_temp_0003_of_3-model <span class="se">\</span>
<span class="nt">--calls-shard-path</span> cohort-calls_chr35/frag_temp_0001_of_3-calls <span class="se">\</span>
<span class="nt">--calls-shard-path</span> cohort-calls_chr35/frag_temp_0002_of_3-calls <span class="se">\</span>
<span class="nt">--calls-shard-path</span> cohort-calls_chr35/frag_temp_0003_of_3-calls <span class="se">\</span>
<span class="nt">--sequence-dictionary</span> <span class="s1">'/path/to/reference/canFam3.dict'</span> <span class="se">\</span>
<span class="nt">--allosomal-contig</span> chrX <span class="se">\</span>
<span class="nt">--contig-ploidy-calls</span> ploidy-calls_chr35/dogs-calls <span class="se">\</span>
<span class="nt">--sample-index</span> 0 <span class="se">\</span>
<span class="nt">--output-genotyped-intervals</span> chr35_sample1_intervals_cohort.vcf.gz <span class="se">\</span>
<span class="nt">--output-genotyped-segments</span> chr35_sample1_segments_cohort.vcf.gz
</code></pre></div></div>
<p>You need to provide a sample index with <code class="language-plaintext highlighter-rouge">--sample-index</code>. The first sample in your input list has index 0, the second one is 1, etc.</p>
<p>Here is an example of genotyped-segments in VCF:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
13 chr35 100000 CNV_chr35_100000_309999 N <DEL>,<DUP> . . END=309999 GT:CN:NP:QA:QS:QSE:QSS 0:2:208:94:3077:98:136
14 chr35 310000 CNV_chr35_310000_311999 N <DEL>,<DUP> . . END=311999 GT:CN:NP:QA:QS:QSE:QSS 1:1:2:159:284:50:98
15 chr35 312000 CNV_chr35_312000_1999999 N <DEL>,<DUP> . . END=1999999 GT:CN:NP:QA:QS:QSE:QSS 0:2:1603:50:3077:131:50
</code></pre></div></div>
<h2 id="gatk-cnv-pipeline-in-snakemake">GATK CNV pipeline in Snakemake</h2>
<p>All the commands above can be executed as a distributed pipeline with Snakemake. For example, processing two chromosomes and two samples would look like this:</p>
<p><img src="/assets/posts/2019-08-15-gatk-cnv-snakemake/gatk_CNV_in_Snakemake.jpg" alt="Calling copy number variants with GATK in Snakemake" /></p>
<p>You can adapt the code below for your needs. Just change the list of input file names and chromosomes numbers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SAMPLES</span><span class="p">,</span> <span class="o">=</span> <span class="n">glob_wildcards</span><span class="p">(</span><span class="s">'/path/to/BAMs/{sample}_merged_markDupl_BQSR.bam'</span><span class="p">)</span>
<span class="n">CHRN</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">39</span><span class="p">))</span>
<span class="n">CHRN</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">'X'</span><span class="p">)</span>
<span class="n">CHR</span> <span class="o">=</span> <span class="n">CHRN</span>
<span class="n">REF</span> <span class="o">=</span> <span class="s">'/path/to/reference/canFam3.fa'</span>
<span class="n">DICT</span> <span class="o">=</span> <span class="s">'/path/to/reference/canFam3.dict'</span>
<span class="n">MAP</span> <span class="o">=</span> <span class="s">'canFam3_mappability_150.merged.bed.gz'</span>
<span class="n">SEGDUP</span> <span class="o">=</span> <span class="s">'segmental_duplication.bed.gz'</span>
<span class="n">rule</span> <span class="nb">all</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">expand</span><span class="p">(</span><span class="s">'chr{j}_{sample}_intervals_cohort.vcf.gz'</span><span class="p">,</span> <span class="n">j</span><span class="o">=</span><span class="n">CHR</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">),</span>
<span class="n">expand</span><span class="p">(</span><span class="s">'chr{j}_{sample}_segments_cohort.vcf.gz'</span><span class="p">,</span> <span class="n">j</span><span class="o">=</span><span class="n">CHR</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">)</span>
<span class="n">rule</span> <span class="n">make_intervals</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">REF</span>
<span class="n">params</span><span class="p">:</span>
<span class="s">'chr{j}'</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'interval_chr{j}.interval_list'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" PreprocessIntervals </span><span class="se">\
</span><span class="s"> -R {input} </span><span class="se">\
</span><span class="s"> --padding 0 </span><span class="se">\
</span><span class="s"> -L {params} </span><span class="se">\
</span><span class="s"> -imr OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">count_reads</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">ref</span> <span class="o">=</span> <span class="n">REF</span><span class="p">,</span>
<span class="n">bam</span> <span class="o">=</span> <span class="s">'{sample}_merged_markDupl_BQSR.bam'</span><span class="p">,</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'interval_chr{j}.interval_list'</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'{sample}_chr{j}.hdf5'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" CollectReadCounts </span><span class="se">\
</span><span class="s"> -R {input.ref} </span><span class="se">\
</span><span class="s"> -imr OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -L {input.interval} </span><span class="se">\
</span><span class="s"> -I {input.bam} </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">annotate</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">ref</span> <span class="o">=</span> <span class="n">REF</span><span class="p">,</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'interval_chr{j}.interval_list'</span><span class="p">,</span>
<span class="n">mappability</span> <span class="o">=</span> <span class="n">MAP</span><span class="p">,</span>
<span class="n">segduplication</span> <span class="o">=</span> <span class="n">SEGDUP</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'annotated_intervals_chr{j}.tsv'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" AnnotateIntervals </span><span class="se">\
</span><span class="s"> -R {input.ref} </span><span class="se">\
</span><span class="s"> -L {input.interval} </span><span class="se">\
</span><span class="s"> --mappability-track {input.mappability} </span><span class="se">\
</span><span class="s"> --segmental-duplication-track {input.segduplication} </span><span class="se">\
</span><span class="s"> --interval-merging-rule OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">filter_intervals</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'interval_chr{j}.interval_list'</span><span class="p">,</span>
<span class="n">annotated</span> <span class="o">=</span> <span class="s">'annotated_intervals_chr{j}.tsv'</span><span class="p">,</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">expand</span><span class="p">(</span><span class="s">'{sample}_{chromosome}.hdf5'</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">,</span> <span class="n">chromosome</span><span class="o">=</span><span class="s">'chr{j}'</span><span class="p">),</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'gcfiltered_chr{j}.interval_list'</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">files</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">' -I '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">samples</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" FilterIntervals </span><span class="se">\
</span><span class="s"> -L {input.interval} </span><span class="se">\
</span><span class="s"> --annotated-intervals {input.annotated} </span><span class="se">\
</span><span class="s"> -I {params.files} </span><span class="se">\
</span><span class="s"> --interval-merging-rule OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">determine_ploidy</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'gcfiltered_chr{j}.interval_list'</span><span class="p">,</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">expand</span><span class="p">(</span><span class="s">'{sample}_{chromosome}.hdf5'</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">,</span> <span class="n">chromosome</span><span class="o">=</span><span class="s">'chr{j}'</span><span class="p">),</span>
<span class="n">prior</span> <span class="o">=</span> <span class="s">'ploidy_priors.tsv'</span><span class="p">,</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">prefix</span> <span class="o">=</span> <span class="s">'dogs'</span><span class="p">,</span>
<span class="n">files</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">' -I '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">samples</span><span class="p">)</span>
<span class="n">output</span><span class="p">:</span>
<span class="s">'ploidy-calls_chr{j}'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" DetermineGermlineContigPloidy </span><span class="se">\
</span><span class="s"> -L {input.interval} </span><span class="se">\
</span><span class="s"> -I {params.files} </span><span class="se">\
</span><span class="s"> --contig-ploidy-priors {input.prior} </span><span class="se">\
</span><span class="s"> --output-prefix {params.prefix} </span><span class="se">\
</span><span class="s"> --interval-merging-rule OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -O {output}
'''</span>
<span class="n">rule</span> <span class="n">scattering</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'gcfiltered_chr{j}.interval_list'</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">dynamic</span><span class="p">(</span><span class="s">'scatter_chr{j}/{fragment}/scattered.interval_list'</span><span class="p">)</span>
<span class="n">params</span><span class="p">:</span>
<span class="s">'scatter_chr{j}'</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
mkdir -p {params} # needed because Snakemake fails creating this directory automatically
gatk --java-options "-Xmx8G" IntervalListTools </span><span class="se">\
</span><span class="s"> --INPUT {input.interval} </span><span class="se">\
</span><span class="s"> --SUBDIVISION_MODE INTERVAL_COUNT </span><span class="se">\
</span><span class="s"> --SCATTER_CONTENT 15000 </span><span class="se">\
</span><span class="s"> --OUTPUT {params}
'''</span>
<span class="n">rule</span> <span class="n">cnvcall</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">interval</span> <span class="o">=</span> <span class="s">'scatter_chr{j}/{fragment}/scattered.interval_list'</span><span class="p">,</span>
<span class="n">sample</span> <span class="o">=</span> <span class="n">expand</span><span class="p">(</span><span class="s">"{sample}_{chromosome}.hdf5"</span><span class="p">,</span> <span class="n">sample</span><span class="o">=</span><span class="n">SAMPLES</span><span class="p">,</span> <span class="n">chromosome</span><span class="o">=</span><span class="s">'chr{j}'</span><span class="p">),</span>
<span class="n">annotated</span> <span class="o">=</span> <span class="s">'annotated_intervals_chr{j}.tsv'</span><span class="p">,</span>
<span class="n">ploidy</span> <span class="o">=</span> <span class="s">'ploidy-calls_chr{j}'</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">modelf</span> <span class="o">=</span> <span class="s">"cohort-calls_chr{j}/frag_{fragment}-model"</span><span class="p">,</span>
<span class="n">callsf</span> <span class="o">=</span> <span class="s">"cohort-calls_chr{j}/frag_{fragment}-calls"</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">outdir</span> <span class="o">=</span> <span class="s">'cohort-calls_chr{j}'</span><span class="p">,</span>
<span class="n">outpref</span> <span class="o">=</span> <span class="s">'frag_{fragment}'</span><span class="p">,</span>
<span class="n">files</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">" -I "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">sample</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" GermlineCNVCaller </span><span class="se">\
</span><span class="s"> --run-mode COHORT </span><span class="se">\
</span><span class="s"> -L {input.interval} </span><span class="se">\
</span><span class="s"> -I {params.files} </span><span class="se">\
</span><span class="s"> --contig-ploidy-calls {input.ploidy}/dogs-calls </span><span class="se">\
</span><span class="s"> --annotated-intervals {input.annotated} </span><span class="se">\
</span><span class="s"> --output-prefix {params.outpref} </span><span class="se">\
</span><span class="s"> --interval-merging-rule OVERLAPPING_ONLY </span><span class="se">\
</span><span class="s"> -O {params.outdir}
'''</span>
<span class="k">def</span> <span class="nf">sampleindex</span><span class="p">(</span><span class="n">sample</span><span class="p">):</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">SAMPLES</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
<span class="k">return</span> <span class="n">index</span>
<span class="n">rule</span> <span class="n">process_cnvcalls</span><span class="p">:</span>
<span class="nb">input</span><span class="p">:</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">dynamic</span><span class="p">(</span><span class="s">"cohort-calls_chr{j}/frag_{fragment}-model"</span><span class="p">),</span>
<span class="n">calls</span> <span class="o">=</span> <span class="n">dynamic</span><span class="p">(</span><span class="s">"cohort-calls_chr{j}/frag_{fragment}-calls"</span><span class="p">),</span>
<span class="nb">dict</span> <span class="o">=</span> <span class="n">DICT</span><span class="p">,</span>
<span class="n">ploidy</span> <span class="o">=</span> <span class="s">'ploidy-calls_chr{j}'</span>
<span class="n">output</span><span class="p">:</span>
<span class="n">intervals</span> <span class="o">=</span> <span class="s">'chr{j}_{sample}_intervals_cohort.vcf.gz'</span><span class="p">,</span>
<span class="n">segments</span> <span class="o">=</span> <span class="s">'chr{j}_{sample}_segments_cohort.vcf.gz'</span>
<span class="n">params</span><span class="p">:</span>
<span class="n">index</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">:</span> <span class="n">sampleindex</span><span class="p">(</span><span class="n">wildcards</span><span class="p">.</span><span class="n">sample</span><span class="p">),</span>
<span class="n">modelfiles</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">" --model-shard-path "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">model</span><span class="p">),</span>
<span class="n">callsfiles</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">wildcards</span><span class="p">,</span> <span class="nb">input</span><span class="p">:</span> <span class="s">" --calls-shard-path "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">input</span><span class="p">.</span><span class="n">calls</span><span class="p">)</span>
<span class="n">shell</span><span class="p">:</span>
<span class="s">'''
gatk --java-options "-Xmx8G" PostprocessGermlineCNVCalls </span><span class="se">\
</span><span class="s"> --model-shard-path {params.modelfiles} </span><span class="se">\
</span><span class="s"> --calls-shard-path {params.callsfiles} </span><span class="se">\
</span><span class="s"> --sequence-dictionary {input.dict} </span><span class="se">\
</span><span class="s"> --allosomal-contig chrX </span><span class="se">\
</span><span class="s"> --contig-ploidy-calls {input.ploidy}/dogs-calls </span><span class="se">\
</span><span class="s"> --sample-index {params.index} </span><span class="se">\
</span><span class="s"> --output-genotyped-intervals {output.intervals} </span><span class="se">\
</span><span class="s"> --output-genotyped-segments {output.segments}
'''</span>
</code></pre></div></div>
<p>If you need to run Snakemake on a cluster, <a href="https://evodify.com/rna-seq-star-snakemake/#run-snakemake-on-a-slurm-cluster-uppmax">I explained how to do that</a> previously.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>Although the documentation for copy number variants calling with GATK is in beta, it is sufficient to perform the CNV analysis. GATK is easy to install and it is reasonably fast. GATK now scatters the data during some steps to improve the efficacy. This approach is especially worthy if you run GATK on a Spark cluster. <a href="https://evodify.com/genomic-spark-tutorial/">This is where large scale genomics is moving</a>. However, if you do not have access to a full-scale Spark cluster, you can use GATK with this Snakemake pipeline on a cluster that has some job scheduler like SLURM, for example.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>This pipeline calls germline copy number variants (CNV) with GATK 4 and Snakemake. It uses the cohort mode, so the CNV are inferred from all samples together.Estimate genome mappability with GEM library2019-08-13T00:00:00+00:002019-08-13T00:00:00+00:00https://evodify.com/gem-mappability<p><a href="http://dx.plos.org/10.1371/journal.pone.0030377" target="_blank">GEM mappability</a> was the most popular program to estimate genome mappability a few years ago. However, a lot of things have changed since that time. Not only published tutorials don’t work anymore, but even finding GEM with the mappability option is not that easy.</p>
<p>The link in the original paper doesn’t work anymore. Moreover, if you google <code class="language-plaintext highlighter-rouge">GEM mappability</code>, you will find out that <a href="https://github.com/smarco/gem3-mapper/issues/7" target="_blank">mappability was removed from GEM</a>. I faced these and some other issues when I tried to get a mappability track for my data with GEM. Therefore, I would like to share scripts and commands I used to get GEM mappability in 2019.</p>
<h2 id="download-gem-library">Download GEM library</h2>
<p>As I mentioned before, the mappability option has been removed from GEM. This removal was intended to be temporarily in 2018. But mappability is still not there in the mid-2019. So, downloading GEM 3 from <a href="https://github.com/smarco/gem3-mapper" target="_blank">its Github page</a> won’t help you. Luckily, previous versions are still available at <a href="https://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/" target="_blank">Sourceforge.net</a>. I downloaded <em>GEM-binaries-Linux-x86_64-core_i3-20130406-045632.tbz2</em>.</p>
<p>Extract the downloaded archive and make all files in the <em>bin</em> folder executable. You GEM library is ready!</p>
<h2 id="estimate-gem-mappability">Estimate GEM mappability</h2>
<p>To get mappability in GEM format run these commands:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gem-indexer <span class="nt">-T</span> 10 <span class="nt">-i</span> canFam3.fa <span class="nt">-o</span> canFam3_gem_index
gem-mappability <span class="nt">-T</span> 10 <span class="nt">-I</span> canFam3_gem_index.gem <span class="nt">-l</span> 150 <span class="nt">-o</span> canFam3_mappability_150
</code></pre></div></div>
<p>I used a 150bp kmer size because my data was generated with 150bp read length. Also, I run it on 10 cores (<code class="language-plaintext highlighter-rouge">-T 10</code>). You can change these options to fit your needs.</p>
<h2 id="convert-gem-mappability-to-bed">Convert GEM mappability to BED</h2>
<p>GEM mappability file may not be suitable input for many programs. For example, <a href="https://software.broadinstitute.org/gatk/" target="_blank">GATK</a> takes mappability data in a BED file. BED files are also easy to convert to many other formats.</p>
<p>I found this <a href="https://github.com/xuefzhao/Reference.Mappability" target="_blank">Github repository</a> that shows how to convert GEM mappability to BED format:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gem-2-wig <span class="nt">-I</span> canFam3_gem_index.gem <span class="nt">-i</span> canFam3_mappability_150.mappability <span class="nt">-o</span> canFam3_mappability_150
wigToBigWig canFam3_mappability_150.wig canFam3_mappability_150.sizes canFam3_mappability_150.bw
bigWigToBedGraph canFam3_mappability_150.bw canFam3_mappability_150.bedGraph
bedGraphTobed canFam3_mappability_150.bedGraph canFam3_mappability_150.bed 0.3
</code></pre></div></div>
<p>In these commands:
<code class="language-plaintext highlighter-rouge">gem-2-wig</code> is part of the GEM library.
<code class="language-plaintext highlighter-rouge">wigToBigWig</code> and <code class="language-plaintext highlighter-rouge">bigWigToBedGraph</code> can be downloaded from <a href="http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/" target="_blank">here</a>.
<code class="language-plaintext highlighter-rouge">bedGraphTobed</code> is available in the <a href="https://github.com/xuefzhao/Reference.Mappability/tree/master/Scripts" target="_blank">same Github repository</a>.</p>
<h2 id="merge-overlapping-intervals-in-bed">Merge overlapping intervals in BED</h2>
<p>Some programs including GATK require overlapping mappability intervals to be merged. You can achieve that with my <a href="https://github.com/evodify/genotype-files-manipulations/blob/master/combine_overlapping_BEDintervals.py" target="_blank">python script</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python ~/git/genotype-files-manipulations/combine_overlapping_BEDintervals.py <span class="nt">-i</span> canFam3_mappability_150.bed <span class="nt">-o</span> canFam3_mappability_150.merged.bed <span class="nt">-v</span> 0
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">-v</code> defines the overhang size between intervals.</p>
<h2 id="gatk-index">GATK Index</h2>
<p>Since I mentioned GATK many times across this post, I also add these two commands to compress and index mappability data for GATK:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bgzip canFam3_mappability_150.merged.bed
gatk IndexFeatureFile <span class="nt">-F</span> canFam3_mappability_150.merged.bed.gz
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>I believe GEM estimation of genome mappability is still valid in 2019. Finding the correct version of GEM and a few other scripts was not straightforward, but otherwise this approach is fast and simple. Luckily, you do not need to do all the work I have done :-)</p>
<p>If you want to use some of the latest approaches for mappability estimation, try <a href="https://bitbucket.org/hoffmanlab/umap/src/default/">Umap and Bismap</a>. Also, keep checking the latest version of <a href="https://github.com/smarco/gem3-mapper/">GEM</a>, maybe it already has the mappability option at the time you are reading this post.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>This post will guide you on how to get GEM mappability and convert it to BED file. Links to GEM library and all necessary conversion scripts are provided.Creating a duty schedule in R2019-08-06T00:00:00+00:002019-08-06T00:00:00+00:00https://evodify.com/duty-schedule-in-r<p>As a person who possesses some programming skills, I try to automate everything I can. Recently, I became responsible for creating a kitchen duty schedule at work. So, I wrote an <strong>R script</strong> that takes a list of people as input and outputs a PDF with the schedule and I would like to share it with you.</p>
<h2 id="schedule-requirements">Schedule requirements</h2>
<p>The duty assumes one person cleans the <strong>kitchen</strong> for a week and another person makes <strong><a href="https://en.wikipedia.org/wiki/Coffee_culture#Sweden" target="_blank">fika</a></strong> on that week. It is also essential to take into account that <strong>the same person should not be responsible for both</strong> kitchen and fika during the same year. The frequency of being in the schedule list should also be <strong>fairly distributed among people</strong>.</p>
<h2 id="generate-a-schedule-table">Generate a schedule table</h2>
<p>First, you need to <strong>load the list</strong> of people, and extract the names:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"people-list.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="w">
</span></code></pre></div></div>
<p>Then, <strong>randomly pick</strong> a few people (in my case it was 9) who will be assigned to the kitchen duty:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Kitchen</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">names</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Do the same to assign the fika duty but make sure people from the kitchen duty list are <strong>excluded</strong>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Fika</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="o">!</span><span class="p">(</span><span class="n">names</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">Kitchen</span><span class="p">)],</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>After the people lists are created, generate the <strong>start and end dates</strong> as well as <strong>week numbers</strong> for these lists:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"19/08/19"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%d/%m/%y"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"week"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">end</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"23/08/19"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%d/%m/%y"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"week"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">week</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">strftime</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%V"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>In the end, merge these list into a <strong>table</strong>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Kitchen</span><span class="p">,</span><span class="w"> </span><span class="n">Fika</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">week</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="s2">"kitchen-schedule_week34-42.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Everything seems to be done. One could just load this table into a spreadsheet editor, format it to a nice look and print. But why waste time on this manual work if you can automate this step too.</p>
<h2 id="plot-a-table-in-r">Plot a table in R</h2>
<p>Instead of manually formatting the obtained table in a spreadsheet editor, you can add a few more lines R code and get a <strong>print-ready table</strong>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="s2">"kitchen-schedule_week34-42.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">11.69</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">8.27</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tableGrob</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="n">rows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">theme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ttheme_default</span><span class="p">(</span><span class="n">base_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w">
</span><span class="n">padding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w"> </span><span class="s2">"mm"</span><span class="p">)))</span><span class="w">
</span><span class="n">grid.newpage</span><span class="p">()</span><span class="w">
</span><span class="n">grid.draw</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>In the end, you will obtain a PDF page of <strong>A4 size</strong> with this kind of table:</p>
<p><img src="/assets/posts/2019-08-06-duty-schedule-in-r/schedule-table-in-r.jpeg" alt="A schedule table generated in R" /></p>
<h2 id="generating-new-schedule-tables">Generating new schedule tables</h2>
<p>Next time you generate a schedule table, you just need to <strong>exclude</strong> the people who were assigned some <strong>duties before</strong>:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"people-list.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">toexlcude</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s1">'kitchen-schedule_week34-42.csv'</span><span class="p">,</span><span class="w">
</span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="p">[</span><span class="o">!</span><span class="p">(</span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">toexlcude</span><span class="o">$</span><span class="n">Kitchen</span><span class="p">,</span><span class="w"> </span><span class="n">toexlcude</span><span class="o">$</span><span class="n">Fika</span><span class="p">))]</span><span class="w">
</span></code></pre></div></div>
<p>The rest of the code is the same as above. If you have several duty lists with the names you need to exclude, just merge them before applying the exclusion.</p>
<h2 id="full-code">Full Code</h2>
<p>All the code put together:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">gridExtra</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s2">"people-list.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="n">file.exists</span><span class="p">(</span><span class="s1">'previous_kitchen-schedule.csv'</span><span class="p">)){</span><span class="w">
</span><span class="n">toexlcude</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="s1">'previous_kitchen-schedule.csv'</span><span class="p">,</span><span class="w">
</span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="p">[</span><span class="o">!</span><span class="p">(</span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">toexlcude</span><span class="o">$</span><span class="n">Kitchen</span><span class="p">,</span><span class="w"> </span><span class="n">toexlcude</span><span class="o">$</span><span class="n">Fika</span><span class="p">))]</span><span class="w">
</span><span class="p">}</span><span class="k">else</span><span class="p">{</span><span class="w">
</span><span class="n">names</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Kitchen</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">names</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">Fika</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="o">!</span><span class="p">(</span><span class="n">names</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">Kitchen</span><span class="p">)],</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"21/10/19"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%d/%m/%y"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"week"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">end</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">as.Date</span><span class="p">(</span><span class="s2">"25/10/19"</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%d/%m/%y"</span><span class="p">),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"week"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">week</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">strftime</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%V"</span><span class="p">)</span><span class="w">
</span><span class="n">dd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Kitchen</span><span class="p">,</span><span class="w"> </span><span class="n">Fika</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">week</span><span class="p">)</span><span class="w">
</span><span class="n">write.table</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="s2">"kitchen-schedule_week43-51.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"\t"</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">sample</span><span class="p">(</span><span class="n">d</span><span class="o">$</span><span class="n">Name</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">pdf</span><span class="p">(</span><span class="s2">"kitchen-schedule_week43-51.pdf"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">11.69</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">8.27</span><span class="p">)</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tableGrob</span><span class="p">(</span><span class="n">dd</span><span class="p">,</span><span class="w"> </span><span class="n">rows</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">theme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ttheme_default</span><span class="p">(</span><span class="n">base_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w">
</span><span class="n">padding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w"> </span><span class="s2">"mm"</span><span class="p">)))</span><span class="w">
</span><span class="n">grid.newpage</span><span class="p">()</span><span class="w">
</span><span class="n">grid.draw</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
</span><span class="n">dev.off</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<h2 id="final-thoughts">Final thoughts</h2>
<p>If you will ever be asked to <strong>volunteer for creating duty schedules</strong>, do not hesitate to agree. It will cost you so little. Just modify this script for your needs and generate a duty schedule in R with one click.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>By creating a duty schedule in R you avoid bias and save time. You just provide a list of people and get a PDF of the duty schedule.How to change Docker storage location2019-06-23T11:43:00+00:002019-06-23T11:43:00+00:00https://evodify.com/change-docker-storage-location<p>It happened to me several times that I didn’t have enough space in my root partition to store Docker containers and I had to move the Docker default storage location to another partition. In this post, I wrote down how to do that for my readership and future myself :)</p>
<p>Docker containers are relatively large (> 1G) and by default Docker stores all containers in <code class="language-plaintext highlighter-rouge">/var/lib/docker</code>, which is located in the root partition of your Linux system. I usually have separate root and home partitions, and given that Linux doesn’t take much space, I allocate 15-30G for my root partition. This happened not to be enough to work with Docker and I had to move the Docker storage location to another larger partition. However, it turned out not to be easy.</p>
<h2 id="do-not-do-this-to-move-docker-storage-location">Do NOT do this to move Docker storage location</h2>
<p>These two solutions could have worked in the past as you may often find them online, but neither of them worked for me with Ubuntu-based Linux distros in 2018-2019 (Docker version > 17).</p>
<h3 id="1-symlink">1. Symlink</h3>
<p>The first obvious idea was to symlink the default storage location to another partition:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo ln</span> <span class="nt">-s</span> /mnt/newlocation /var/lib/docker
</code></pre></div></div>
<h3 id="2-docker_opts">2. DOCKER_OPTS</h3>
<p>Another often posted solution is to stop Docker:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl stop docker
</code></pre></div></div>
<p>Edit the <code class="language-plaintext highlighter-rouge">/etc/default/docker</code> file by adding the new location with the <code class="language-plaintext highlighter-rouge">-g</code> in the <code class="language-plaintext highlighter-rouge">DOCKER_OPTS</code> line:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">DOCKER_OPTS</span><span class="o">=</span><span class="s2">"-dns 8.8.8.8 -dns 8.8.4.4 -g /mnt/newlocation"</span>
</code></pre></div></div>
<p>Then start Docker again:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl start docker
</code></pre></div></div>
<p>After that Docker should use <code class="language-plaintext highlighter-rouge">/mnt/newlocation</code> as a new storage location.</p>
<p><strong>UPDATE</strong>: It seems <strong>DOCKER_OPTS</strong> solution may work if you add the <code class="language-plaintext highlighter-rouge">$DOCKER_OPTS</code> variable to the <em>systemd</em> script <code class="language-plaintext highlighter-rouge">/lib/systemd/system/docker.service</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">ExecStart</span><span class="o">=</span>/usr/bin/dockerd <span class="nt">-H</span> fd:// <span class="nv">$DOCKER_OPTS</span>
</code></pre></div></div>
<p>However, I have not tested it because the solution I describe below is simpler and probably more correct.</p>
<h2 id="change-docker-storage-location-the-right-way">Change Docker storage location: THE RIGHT WAY!</h2>
<p>Luckily, the right way to change Docker storage location was not more complicated than the two non-working options I have described above.</p>
<p>You need to create a JSON file <code class="language-plaintext highlighter-rouge">/etc/docker/daemon.json</code> with the content pointing to the new storage location:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">{</span>
<span class="s2">"data-root"</span>: <span class="s2">"/mnt/newlocation"</span>
<span class="o">}</span>
</code></pre></div></div>
<p>You can read more about <code class="language-plaintext highlighter-rouge">daemon.json</code> in <a href="https://docs.docker.com/config/daemon/#docker-daemon-directory" target="_blank">Docker docs</a>.</p>
<p>Then, restart Docker or reboot the system:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl restart docker
</code></pre></div></div>
<p>If you get any error during the restart, pay attention to spaces in <code class="language-plaintext highlighter-rouge">daemon.json</code>. JSON files are sensitive to indentation and an extra or lacking space may cause an error. If Docker restarts fine, this new setting will make Docker place all new containers to the new location. However, old containers will stay in <code class="language-plaintext highlighter-rouge">/etc/default/docker</code>. I recommend removing all old containers:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker system prune <span class="nt">-a</span>
</code></pre></div></div>
<p>And downloaded them again:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull <container-name>
</code></pre></div></div>
<h2 id="final-thoughts">Final thoughts</h2>
<p>It is unfortunate that this simple solution with <code class="language-plaintext highlighter-rouge">daemon.json</code> was not the first I found when I tried to fix the “not enough space” issue due to Docker images taking too much space. So, I hope this blog post will save time some users who need to change their Docker storage location.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>You can easily change the Docker default storage location by creating the daemon.json file and pointing to another location in that file.RNA-Seq STAR mapping with Snakemake2019-04-18T16:47:00+00:002019-04-18T16:47:00+00:00https://evodify.com/rna-seq-star-snakemake<p>I have described my pipelines for genotype calling in both <a href="https://evodify.com/gatk-in-non-model-organism/">non-model</a> and <a href="https://evodify.com/genomic-variant-calling-pipeline/">model organisms</a>. I also showed how one can automate a genotype calling pipeline with <a href="https://evodify.com/genomic-variant-calling-pipeline/">automatically generated sbatch scripts that handle dependencies between jobs for the Slurm Workload Manager</a>. I used a python script for that but I mentioned that probably it was not the most efficient way and using Nextflow or <strong>Snakemake would probably be a better option</strong>. I finally got my hands on Snakemake when I was working on my RNA-Seq mapping pipeline. You can read the description of this pipeline below and <strong>you can also get my Snakemake file</strong> at the end of this post to run this pipeline with your data.</p>
<h2 id="rna-seq-star-mapping-pipeline">RNA-Seq STAR mapping pipeline</h2>
<p>There are many different mapping software for RNA-Seq data. The choice is always difficult.
For example, I used <a href="https://dx.doi.org/10.1101%2Fgr.111120.110" target="_blank">stampy</a> for RNA-seq mapping in my <a href="/research/">Capsella project</a>. The reason behind this choice was that we performed an <a href="http://dx.plos.org/10.1371/journal.pgen.1007949" target="_blank">allele-specific expression analysis</a> with the DNA count data as a null distribution. Therefore, to keep the consistency between the two datasets, I used the same aligner. In addition, stampy is not a bad aligner for RNA-Seq data and my favorite aligner for divergent reads in the <a href="https://evodify.com/gatk-in-non-model-organism/">genotyping pipeline</a>.</p>
<p>However, for my current dog projects, I choose to use <a href="https://github.com/alexdobin/STAR" target="_blank">STAR aligner</a>. It is a splicing aware aligner, and what is particularly important for large projects, it is one of the fastest aligners. I also use STAR in the multi-sample 2-pass mapping mode that better maps spliced reads (See STAR documentation).</p>
<p>The whole pipeline consists of STAR 2-pass alignment and reads counting with HTSeq:</p>
<ol>
<li>
<p><a href="#1-index-the-reference-genome">Index the reference genome</a></p>
</li>
<li>
<p><a href="#2-run-the-mapping">Map reads to the reference genome (2-pass mode)</a></p>
<p>2.1. <a href="#21-pass1-star-mapping">Standard STAR mapping.</a></p>
<p>2.2. <a href="#22-filter-and-collect-the-splicing-information">Collect the junctions information from all samples.</a></p>
<p>2.3. <a href="#23-pass2-star-mapping">Use new junctions from all samples for the 2nd pass mapping.</a></p>
</li>
<li>
<p><a href="#3-counting-the-number-of-reads-per-gene">Count the number of reads mapped to each gene</a>.</p>
</li>
</ol>
<p>All these STAR mapping steps can be automated with Snakemake as <a href="#snakemake-star-pipeline">you will see below</a>.</p>
<h3 id="1-index-the-reference-genome">1. Index the reference genome</h3>
<p>STAR needs to use its own index files during mapping. These index files are quite large. For example, for the dog reference genome, all STAR index files weight 23Gb, while the actual FASTA file is only 2.3Gb. But I believe that it is these large index files that allow STAR to perform alignment so fast.</p>
<p>So, to index the reference, you need to execute this code:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>canFam3STAR
STAR <span class="nt">--runThreadN</span> 20 <span class="se">\</span>
<span class="nt">--runMode</span> genomeGenerate <span class="se">\</span>
<span class="nt">--genomeDir</span> canFam3STAR <span class="se">\</span>
<span class="nt">--genomeFastaFiles</span> canFam3.fa <span class="se">\</span>
<span class="nt">--sjdbGTFfile</span> canFam3.gtf <span class="se">\</span>
<span class="nt">--sjdbOverhang</span> 100
</code></pre></div></div>
<p>I think these options are self-explanatory. <code class="language-plaintext highlighter-rouge">--runThreadN</code> indicates the number of cores to be used. <code class="language-plaintext highlighter-rouge">--sjdbOverhang</code> can be specified as ReadLength-1. You can also 100 which is recommended as a generally good value in the STAR documentation. <code class="language-plaintext highlighter-rouge">canFam3</code> is the reference name for both FASTA and GTF file. You need to change this name for your reference in all commands below.</p>
<p>If you have only GFF annotation, you can convert GFF to GTF with <a href="http://cole-trapnell-lab.github.io/cufflinks/file_formats/" target="_blank">Cufflinks</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gffread canFam3.1.92.gff3 <span class="nt">-T</span> <span class="nt">-o</span> canFam3.gtf
</code></pre></div></div>
<h3 id="2-run-the-mapping">2. Run the mapping</h3>
<p>You can run the <a href="#21-pass1-star-mapping">standard 1-pass STAR mapping</a> and the results should be good overall. However, given that STAR is very fast, running the 2-pass mode does not take too long and it can improve the mapping to novel junctions. Basically, you run the 1-pass STAR mapping to discover junctions information, then you collect and filter that information from all samples and run the 2-pass using that information.</p>
<h4 id="21-pass1-star-mapping">2.1. Pass1 STAR mapping</h4>
<p>The first pass of STAR mapping is a standard run that outputs an alignment and splice junction information.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>Sample1_pass1
<span class="nb">cd </span>Sample1_pass1
STAR <span class="nt">--runThreadN</span> 20 <span class="se">\</span>
<span class="nt">--genomeDir</span> /path/to/canFam3STAR <span class="se">\</span>
<span class="nt">--readFilesIn</span> /path/to/Sample1_001_R1.fastq.gz,/path/to/Sample1_002_R1.fastq.gz /path/to/Sample1_001_R2.fastq.gz,/path/to/Sample1_002_R2.fastq.gz <span class="se">\</span>
<span class="nt">--readFilesCommand</span> zcat <span class="se">\</span>
<span class="nt">--outSAMtype</span> BAM Unsorted
</code></pre></div></div>
<p>Again, most of the options are self-explanatory. <code class="language-plaintext highlighter-rouge">--readFilesCommand zcat</code> is needed to extract <em>gz</em> compressed reads. <code class="language-plaintext highlighter-rouge">--outSAMtype</code> will output an unsorted BAM instead of a default SAM. This saves disk space. If you have your sample sequences in several lanes, you can list these files with comma separation in <code class="language-plaintext highlighter-rouge">--readFilesIn</code> as I did above.</p>
<p>This command will produce several output files, among which we are mostly interested in the splice junction information file <code class="language-plaintext highlighter-rouge">SJ.out.tab</code> that will be used in the next step. So. I discard the alignment BAM file because it takes too much disk space.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">rm </span>Sample1_pass1/Aligned.out.bam
</code></pre></div></div>
<h4 id="22-filter-and-collect-the-splicing-information">2.2 Filter and collect the splicing information</h4>
<p>To filter poorly supported junctions, I keep only the junctions that are supported by at least 3 uniquely mapped reads:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>pass1SJ
<span class="k">for </span>i <span class="k">in </span>Sample<span class="k">*</span>pass1/SJ.out.tab
<span class="k">do
</span><span class="nb">awk</span> <span class="s1">'{ if ($7 >= 3) print $0}'</span> <span class="nv">$i</span> <span class="o">></span> <span class="nv">$i</span>.filtered
<span class="nb">mv</span> <span class="nv">$i</span>.filtered pass1SJ/
<span class="k">done
</span>rename SJ.out.tab.filtered SJ.filtered.tab pass1SJ/<span class="k">*</span>.filtered
</code></pre></div></div>
<p>I think it is really difficult to verify splicing information. So, this filtering is rather subjective and can be skipped. I use it simply because of my gut feeling 🙂.</p>
<h4 id="23-pass2-star-mapping">2.3 Pass2 STAR mapping</h4>
<p>Now, we just execute almost the same mapping command as at <a href="#21-pass1-star-mapping">step 2.1</a> but include add the information on the discovered splicing (<code class="language-plaintext highlighter-rouge">--sjdbFileChrStartEnd</code>). I also prefer to add read group information (<code class="language-plaintext highlighter-rouge">--outSAMattrRGline</code>) at this step. It is not necessary for reads counting but it may be useful in the future if I decide to use these STAR generated BAM files for other analyses.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir </span>Sample1pass2
<span class="nb">cd </span>Sample1pass2
STAR <span class="nt">--runThreadN</span> 20 <span class="se">\</span>
<span class="nt">--genomeDir</span> /path/to/canFam3STAR <span class="se">\</span>
<span class="nt">--readFilesIn</span> /path/to/Sample1_001_R1.fastq.gz,/path/to/Sample1_002_R1.fastq.gz /path/to/Sample1_001_R2.fastq.gz,/path/to/Sample1_002_R2.fastq.gz <span class="se">\</span>
<span class="nt">--readFilesCommand</span> zcat <span class="se">\</span>
<span class="nt">--outSAMtype</span> BAM SortedByCoordinate <span class="se">\</span>
<span class="nt">--outSAMattrRGline</span> ID:Dog_MT2 <span class="se">\</span>
<span class="nt">--sjdbFileChrStartEnd</span> Sample1_pass1.SJ.filtered.tab, ..., SampleN_pass1.SJ.filtered.tab <span class="se">\</span>
</code></pre></div></div>
<h3 id="3-counting-the-number-of-reads-per-gene">3. Counting the number of reads per gene.</h3>
<p>You can count the number of reads per gene on the fly during the STAR mapping if you provide it the option <code class="language-plaintext highlighter-rouge">--quantMode GeneCounts</code>. However, I prefer to count reads with <code class="language-plaintext highlighter-rouge">htseq-count</code> and use the option <code class="language-plaintext highlighter-rouge">-m union</code> to deal with overlapping features. You can see what the option <code class="language-plaintext highlighter-rouge">-m union</code> mean in the image below.</p>
<figure class="caption"><img src="/assets/posts/2019-04-18-rna-seq-star-snakemake/htseq-count_union_option.jpeg" alt="Count reads with htseq-count and the option union" />
<figcaption class="caption">Different ways to counts non-uniquely mapped reads with htseq-count ( <a href="https://htseq.readthedocs.io/en/release_0.11.1/count.html" target="_blank">source</a>).</figcaption>
</figure>
<p>And here is the command I use:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>htseq-count <span class="nt">-m</span> union <span class="nt">-s</span> no <span class="nt">-t</span> gene <span class="nt">-i</span> ID <span class="nt">-f</span> bam input.bam canFam3.gff &> output.log
<span class="nb">grep </span>gene output.log | <span class="nb">sed</span> <span class="s1">'s/gene://g'</span> <span class="o">></span> counts.csv
</code></pre></div></div>
<p>The second line extracts only the lines with counts per gene and cleans it by removing the string <code class="language-plaintext highlighter-rouge">gene:</code>.</p>
<p>The resulting file looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENSCAFG00000000001 209
ENSCAFG00000000002 1
ENSCAFG00000000003 93
ENSCAFG00000000004 531
ENSCAFG00000000005 432
</code></pre></div></div>
<p>Finally, you can merge all files into one table:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for </span>i <span class="k">in</span> <span class="k">*</span>csv<span class="p">;</span> <span class="k">do </span><span class="nb">sed</span> <span class="nt">-i</span> <span class="s2">"1igene</span><span class="se">\t</span><span class="nv">$i</span><span class="s2">"</span> <span class="nv">$i</span> <span class="p">;</span> <span class="k">done</span> <span class="c"># add column names</span>
<span class="nv">N</span><span class="o">=</span><span class="k">$((</span><span class="si">$(</span><span class="nb">ls</span> <span class="nt">-l</span> <span class="k">*</span>.csv | <span class="nb">wc</span> <span class="nt">-l</span><span class="si">)</span><span class="o">*</span><span class="m">2</span><span class="k">))</span> <span class="c"># count number of files</span>
<span class="nb">paste</span> <span class="k">*</span>csv | <span class="nb">cut</span> <span class="nt">-f</span> 1,<span class="si">$(</span><span class="nb">seq</span> <span class="nt">-s</span>, 2 2 <span class="nv">$N</span><span class="si">)</span> <span class="o">></span> all_HTSeq.csv <span class="c"># merge and keep only one column with gene names</span>
</code></pre></div></div>
<p>This table will have the following format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gene sample1 sample2 sample2
ENSCAFG00000000001 209 235 167
ENSCAFG00000000002 0 4 7
ENSCAFG00000000003 57 10 38
ENSCAFG00000000004 1243 1298 156
ENSCAFG00000000005 23 67 49
</code></pre></div></div>
<h2 id="snakemake-star-pipeline">Snakemake STAR pipeline</h2>
<p>All the commands above (except the last one that can be run locally) can be put together into a <a href="/assets/posts/2019-04-18-rna-seq-star-snakemake/Snakefile">Snakemake file</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SAMPLES, <span class="o">=</span> glob_wildcards<span class="o">(</span><span class="s1">'/path/to/fastq/{sample}_L001_R1.fastq.gz'</span><span class="o">)</span>
rule allout:
input:
directory<span class="o">(</span><span class="s1">'canFam3STAR'</span><span class="o">)</span>,
<span class="nb">expand</span><span class="o">(</span><span class="s1">'{sample}_pass1/SJ.out.tab'</span>, <span class="nv">sample</span><span class="o">=</span>SAMPLES<span class="o">)</span>,
directory<span class="o">(</span><span class="s1">'SJ'</span><span class="o">)</span>,
<span class="nb">expand</span><span class="o">(</span><span class="s1">'SJ/{sample}_pass1SJ.filtered.tab'</span>, <span class="nv">sample</span><span class="o">=</span>SAMPLES<span class="o">)</span>,
<span class="nb">expand</span><span class="o">(</span><span class="s1">'{sample}_pass2/Aligned.sortedByCoord.out.bam'</span>, <span class="nv">sample</span><span class="o">=</span>SAMPLES<span class="o">)</span>,
<span class="nb">expand</span><span class="o">(</span><span class="s1">'{sample}_HTSeq_union_gff3_no_gene_ID.log'</span>, <span class="nv">sample</span><span class="o">=</span>SAMPLES<span class="o">)</span>,
<span class="nb">expand</span><span class="o">(</span><span class="s1">'{sample}_HTSeq.csv'</span>, <span class="nv">sample</span><span class="o">=</span>SAMPLES<span class="o">)</span>
rule index:
input:
fa <span class="o">=</span> <span class="s1">'canFam3.fa'</span>, <span class="c"># provide your reference FASTA file</span>
gtf <span class="o">=</span> <span class="s1">'canFam3.gtf'</span> <span class="c"># provide your GTF file</span>
output:
directory<span class="o">(</span><span class="s1">'canFam3STAR'</span><span class="o">)</span> <span class="c"># you can rename the index folder</span>
threads: 20 <span class="c"># set the maximum number of available cores</span>
shell:
<span class="s1">'mkdir {output} && '</span>
<span class="s1">'STAR --runThreadN {threads} '</span>
<span class="s1">'--runMode genomeGenerate '</span>
<span class="s1">'--genomeDir {output} '</span>
<span class="s1">'--genomeFastaFiles {input.fa} '</span>
<span class="s1">'--sjdbGTFfile {input.gtf} '</span>
<span class="s1">'--sjdbOverhang 100'</span>
rule pass1:
input:
R1L1 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L001_R1.fastq.gz'</span>, <span class="c"># may need adjustment if your fastq file name format is different</span>
R1L2 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L002_R1.fastq.gz'</span>, <span class="c"># note each sample has 4 fastq files ~ 2 lanes per file</span>
R2L1 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L001_R2.fastq.gz'</span>,
R2L2 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L002_R2.fastq.gz'</span>,
refdir <span class="o">=</span> directory<span class="o">(</span><span class="s1">'canFam3STAR'</span><span class="o">)</span>
params:
outdir <span class="o">=</span> <span class="s1">'{sample}_pass1'</span>,
rmbam <span class="o">=</span> <span class="s1">'{sample}_pass1/Aligned.out.bam'</span>
output:
<span class="s1">'{sample}_pass1/SJ.out.tab'</span>
threads: 20 <span class="c"># set the maximum number of available cores</span>
shell:
<span class="s1">'rm -rf {params.outdir} &&'</span> <span class="c"># be careful with this. I don't know why, but Snakemake had problems without this cleaning.</span>
<span class="s1">'mkdir {params.outdir} && '</span> <span class="c"># snakemake had problems finding output files with --outFileNamePrefix, so I used this approach instead</span>
<span class="s1">'cd {params.outdir} && '</span>
<span class="s1">'STAR --runThreadN {threads} '</span>
<span class="s1">'--genomeDir {input.refdir} '</span>
<span class="s1">'--readFilesIn {input.R1L1},{input.R1L2} {input.R2L1},{input.R2L2} '</span>
<span class="s1">'--readFilesCommand zcat '</span>
<span class="s1">'--outSAMtype BAM Unsorted && rm {params.rmbam} && cd ..'</span>
rule SJdir:
output:
directory<span class="o">(</span><span class="s1">'SJ'</span><span class="o">)</span>
threads: 1
shell:
<span class="s1">'mkdir {output}'</span>
rule filter:
input:
<span class="s1">'{sample}_pass1/SJ.out.tab'</span>,
directory<span class="o">(</span><span class="s1">'SJ'</span><span class="o">)</span>
output:
<span class="s1">'SJ/{sample}_pass1SJ.filtered.tab'</span>
threads: 1
shell:
<span class="s1">'''awk "{ { if (\$7 >= 3) print \$0 } }" {input[0]} > {input[0]}.filtered && '''</span>
<span class="s1">'mv {input[0]}.filtered {output}'</span>
rule pass2:
input:
R1L1 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L001_R1.fastq.gz'</span>,
R1L2 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L002_R1.fastq.gz'</span>,
R2L1 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L001_R2.fastq.gz'</span>,
R2L2 <span class="o">=</span> <span class="s1">'fastq/{sample}/{sample}_L002_R2.fastq.gz'</span>,
SJfiles <span class="o">=</span> <span class="s1">'SJ/{sample}_pass1SJ.filtered.tab'</span>,
refdir <span class="o">=</span> directory<span class="o">(</span><span class="s1">'canFam3STAR'</span><span class="o">)</span>
params:
outdir <span class="o">=</span> <span class="s1">'{sample}_pass2'</span>,
<span class="nb">id</span> <span class="o">=</span> <span class="s1">'{sample}'</span>
output:
<span class="s1">'{sample}_pass2/Aligned.sortedByCoord.out.bam'</span>
threads: 20 <span class="c"># set the maximum number of available cores</span>
shell:
<span class="s1">'rm -rf {params.outdir} &&'</span> <span class="c"># be careful with this. I don't know why, but Snakemake had problems without this cleaning.</span>
<span class="s1">'mkdir {params.outdir} && '</span>
<span class="s1">'cd {params.outdir} && '</span>
<span class="s1">'STAR --runThreadN {threads} '</span>
<span class="s1">'--genomeDir {input.refdir} '</span>
<span class="s1">'--readFilesIn {input.R1L1},{input.R1L2} {input.R2L1},{input.R2L2} '</span>
<span class="s1">'--readFilesCommand zcat '</span>
<span class="s1">'--outSAMtype BAM SortedByCoordinate '</span>
<span class="s1">'--sjdbFileChrStartEnd {input.SJfiles} '</span>
<span class="s1">'--outSAMattrRGline ID:{params.id} '</span>
<span class="s1">'--quantMode GeneCounts '</span>
rule htseq:
input:
bam <span class="o">=</span> <span class="s1">'{sample}_pass2/Aligned.sortedByCoord.out.bam'</span>,
gff <span class="o">=</span> <span class="s1">'canFam3.gff3'</span>
output:
<span class="s1">'{sample}_HTSeq_union_gff3_no_gene_ID.log'</span>,
<span class="s1">'{sample}_HTSeq.csv'</span>
threads: 1
shell:
<span class="s1">'htseq-count -m union -s no -t gene -i ID -r pos -f bam {input.bam} {input.gff} &> {output[0]} && '</span>
<span class="s1">'grep ENS {output[0]} | sed "s/gene://g" > {output[1]}'</span>
</code></pre></div></div>
<p>Read the comments within the code to find the line you need to change to adjust this Snakemake pipeline for your data.</p>
<p>Also, depending on your file location and Snakemake version, Snakemake may have problems finding files without the absolute path in file names. For example, instead of relative path <code class="language-plaintext highlighter-rouge">fastq/{sample}_L001_R1.fastq.gz</code> you may need to use the absolute path <code class="language-plaintext highlighter-rouge">/home/username/RNA-Seq/fastq/{sample}_L001_R1.fastq.gz</code></p>
<h3 id="run-snakemake-on-a-slurm-cluster-uppmax">Run Snakemake on a Slurm cluster (Uppmax)</h3>
<p>I executed this <code class="language-plaintext highlighter-rouge">Snakemake</code> file on our Slurm cluster (<a href="http://www.uppmax.uu.se/" target="_blank">Uppmax</a>). To do that I created a Snakemake cluster config file <a href="/assets/posts/2019-04-18-rna-seq-star-snakemake/cluster.yaml">cluster.yaml</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__default__:
account: snic2019-x-xxx
<span class="nb">time</span>: <span class="s2">"00:01:00"</span>
n: 1
partition: <span class="s2">"core"</span>
index:
<span class="nb">time</span>: <span class="s2">"05:00:00"</span>
n: 20
pass1:
<span class="nb">time</span>: <span class="s2">"01:00:00"</span>
n: 20
pass2:
<span class="nb">time</span>: <span class="s2">"02:00:00"</span>
n: 20
htseq:
<span class="nb">time</span>: <span class="s2">"05:00:00"</span>
</code></pre></div></div>
<p>This config file is used during Snakemake job submission with <code class="language-plaintext highlighter-rouge">--cluster-config cluster.yaml</code>.</p>
<p>I first run this pipeline in a dry mode with the <code class="language-plaintext highlighter-rouge">--dryrun</code> option:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>snakemake <span class="nt">-s</span> Snakefile <span class="nt">-j</span> 100 <span class="nt">--dryrun</span> <span class="nt">--cluster-config</span> cluster.yaml <span class="nt">--cluster</span> <span class="s2">"sbatch -A {cluster.account} -t {cluster.time} -p {cluster.partition} -n {cluster.n}"</span>
</code></pre></div></div>
<p>If everything works fine in a dry mode, you can run this command in a regular mode from a login node of the server. However, I prefer to create a sbatch file (see below) and submit this command as a job which in turn will submit all other jobs as defined in the <code class="language-plaintext highlighter-rouge">Snakemake</code> file.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash -l</span>
<span class="c">#SBATCH -A snic2019-x-xxx</span>
<span class="c">#SBATCH -p core</span>
<span class="c">#SBATCH -n 1</span>
<span class="c">#SBATCH -t 1-00:00:00</span>
<span class="c">#SBATCH -J sbatchSnakefile</span>
<span class="c">#SBATCH -e sbatchSnakefile.err</span>
<span class="c">#SBATCH -o sbatchSnakefile.out</span>
snakemake <span class="nt">-s</span> Snakefile <span class="nt">-j</span> 100 <span class="nt">--cluster-config</span> cluster.yaml <span class="nt">--cluster</span> <span class="s2">"sbatch -A {cluster.account} -t {cluster.time} -p {cluster.partition} -n {cluster.n}"</span>
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>Snakemake is a great tool and I am very happy that I have finally started using it. A combination of STAR speed and Snakemake workflow efficiency makes RNA-Seq mapping pipeline truly fast, robust, and error-safe. This pipeline has already saved me some time with my pilot RNA-Seq experiment and it will save even more time when my new RNA-Seq data will arrive.</p>
<p>I hope I will also update <a href="https://evodify.com/genomic-variant-calling-pipeline/">my genotype calling pipeline</a> with Snakemake workflow soon.</p>
<p><em>If you have any questions or suggestions, feel free to <a href="mailto:dmytro.kryvokhyzha@evobio.eu">email me</a></em>.</p>STAR mapping with Snakemake can save you a lot of time. STAR is a fast RNA-Seq aligner, whereas Snakemake provides automatic, reproducible, and scalable pipelining.