logique car on ne supprime pas de donné...
Arguments ok
#+begin_src
gatk --java-options "-Xmx3g" MarkDuplicates \
--INPUT sorted.bam \
--OUTPUT marked_dups.bam \
--METRICS_FILE marked_dups.bam.metrics \
--TMP_DIR . \
--REFERENCE_SEQUENCE genomeRef.fna \
#+end_src
#+begin_src
$gatk MarkDuplicates \
-I $tmpDir/$post_cleanSam \
-O $tmpDir/$post_markDuplicate \
-M $tmpDir/"$sample"_marked_dup.metrix \
--CREATE_INDEX true \
--VERBOSITY WARNING
#+end_src
***** TODO baserecalibrator
options ok
#+begin_src
gatk --java-options "-Xmx3g" BaseRecalibrator \
--input marked_dups.bam \
--output 63003856_S135.table \
--reference genomeRef.fna \
\
--known-sites dbSNP_common.vcf.gz \
--tmp-dir . \
#+end_src
#+begin_src
$gatk BaseRecalibrator \
-I $tmpDir/$post_markDuplicate \
-R $genomeRef \
--known-sites $dbsnpDir/dbSNP_common.vcf.gz \
-O $tmpDir/$post_BaseRecalibrator
#+end_src
$ cd /Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
$ sed 's/sample/63003856_S135/' 63003856_S135.table > 63003856_S135.table2
Les fichiers n'ont pas le même nombre d'erreurs mais assez proches. Sur le premier table, 3 score
#:GATKTable:3:94:%d:%d:%d:;
#:GATKTable:Quantized:Quality quantization map
QualityScore Count QuantizedScore
11 298878631 11
25 542282996 25
34 12846268833 34
vs (référence0)
11 298877785 11
25 542282089 25
34 12846264839 34
***** TODO applybqsr
options ok
#+begin_src
gatk --java-options "-Xmx3g" ApplyBQSR \
--input marked_dups.bam \
--output 63003856_S135.bam \
--reference genomeRef.fna \
--bqsr-recal-file 63003856_S135.table \
\
--tmp-dir . \
#+end_src
#+begin_src
$gatk ApplyBQSR -R $genomeRef \
-I $tmpDir/$post_markDuplicate \
--bqsr-recal-file $tmpDir/$post_BaseRecalibrator \
-O $bamDir/$post_ApplyBQSR \
--verbosity WARNING
#+end_src
missing file
***** TODO haplotypecaller
options ok
#+begin_src
gatk --java-options "-Xmx3g" HaplotypeCaller \
--input 63003856_S135.bam \
--output 63003856_S135.vcf.gz \
--reference genomeRef.fna \
--dbsnp dbSNP.gz \
\
\
--tmp-dir . \
--max-mnp-distance 2
#+end_src
#+begin_src
$gatk --java-options "-Xmx32g" HaplotypeCaller \
-R $genomeRef \
-I $bamDir/$post_ApplyBQSR \
-O $vcfDir/$post_haplotypecaller \
-D "$dbsnpDir"/GCF_000001405.39.gz \
--max-mnp-distance 2 \
--verbosity WARNING
#+end_src
$ grep '^NC' out/63003856_S135/variantCalling/haplotypecaller/63003856_S135.vcf | wc -l
1631935
$ grep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135 .vcf | wc -l
1506894
**** variant calling
***** TODO filterDepth : 21 en trop
Nouvelle version (correcton bug markdupicates)
#+begin_src sh
grep '^NC' out/63003856_S135/variantCalling/filter-depth.vcf |wc -l
82054
#+end_src
Alexis
#+begin_src sh
zgrep '^NC' /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf | wc -l
82033
#+end_src
Ne vient pas du filtre sur la profondeur:
bcftools filter -i 'FORMAT/AD[0:1]<=10' 63003856_S135_DP_over_30.vcf
bcftools filter -i 'FORMAT/DP<=30' 63003856_S135_DP_over_30.vcf
Idem pour notre version. Rien ne sort.
On compare le nombre de lignes
#+begin_src sh
bgzip out/63003856_S135/variantCalling/filter-depth.vcf
tabix /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf.gz
tabix out/63003856_S135/variantCalling/filter-depth.vcf.gz
bcftools isec out/63003856_S135/variantCalling/filter-depth.vcf.gz /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf.gz -p compare-filter-depth
find compare-filter-depth/ -type f -exec wc -l {} \;
84763 compare-filter-depth/sites.txt
710 compare-filter-depth/0001.vcf
8 compare-filter-depth/README.txt
85339 compare-filter-depth/0002.vcf
85340 compare-filter-depth/0003.vcf
725 compare-filter-depth/0000.vcf
#+end_src
***** TODO exclude SNP + consensual : 34 en trop !!
$ grep '^NC' out/63003856_S135/variantCalling/filter-polymorphisms.vcf | wc -l
8898
vs
$ grep '^NC' /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf | wc -l
8864
***** TODO Filter technical variants