I would like to share with you my automatic genomic variant calling pipeline. Using such genomic variant calling pipeline becomes essential when a project scales to dozens and hundreds of genomes.
As probably any beginner, I used to process my genomic data with manual interference at every step. So, I would submit mapping jobs for all samples on a computing cluster, when they all done I would submit mark duplicates jobs etc. Moreover, I would also manually write sbatch scripts (my cluster UPPMAX uses the Slurm Workload Manager). It was not efficient.
Well, I used replacements (with
sed) and loops (with
for i in x; do ...) to reduce the amount of work, but there were many manual steps. I managed to process 24-31 small Capsella genomes (~200Mb) this way during my PhD projects. Now, I work with the dog genome which is much bigger (~2.5Gb) and I also need to analyze many more samples (82 genomes at the moment). So, I had to write this genomic variant calling pipeline to make my workflow as automatic as possible.