Goopd reproducibility for "Prod" and test version on helios But there's still a delta with the "reference" version
LD54JIMB46YE2IEWVIGFMN25ACP6YZMR4OIU6XJUND5HBJJOPU4QC
*** TODO 63003856_S135
**** Notes
Actuellement
| Étapes | Production | Test |
|------------------+------------+-----------|
| bwa mem | 128077207 | 128077211 |
| haplotype caller | 1506894 | 1631935 |
| filterdepth | 82033 | 82054 |
#
**** TODO Comparer les versions (log)
**** TODO Comprendre la différence nouvelle version et prod
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
#sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
#sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > missing-from-old
comm -23 notpatho.new notpatho.old > missing-from-new
wc -l missing-from-old
wc -l missing-from-new
#+end_src
#+RESULTS:
| 75 | missing-from-old |
| 6 | missing-from-new |
Il manque 75 variants et on a 6 en trop
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@missing-from-old' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
| 16 | NC_000001.11 |
| 2 | NC_000002.12 |
| 18 | NC_000003.12 |
| 7 | NC_000004.12 |
| 1 | NC_000005.10 |
| 5 | NC_000006.12 |
| 3 | NC_000007.14 |
| 2 | NC_000009.12 |
| 1 | NC_000010.11 |
| 5 | NC_000011.10 |
| 3 | NC_000015.10 |
| 1 | NC_000016.10 |
| 4 | NC_000017.11 |
| 2 | NC_000019.10 |
| 1 | NC_000020.11 |
| 3 | NC_000022.11 |
| 1 | NC_000023.11 |
| 2 | NT_113891.3 |
| 2 | NT_167245.2 |
| 2 | NT_167247.2 |
| 1 | NT_167248.2 |
| 2 | NT_167249.2 |
| 18 | NT_187532.1 |
| 1 | NT_187562.1 |
| 1 | NT_187633.1 |
| 18 | NT_187649.1 |
| 18 | NT_187678.1 |
| 18 | NT_187688.1 |
| 12 | NT_187689.1 |
| 18 | NT_187690.1 |
| 18 | NT_187691.1 |
| 1 | NW_013171807.1 |
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
*** TODO Comparer les versions
*** TODO 63003856_S135
**** Notes : bonne reproductibilité sur le cluster mais diff avec la version "de prod"
tester sequential puis version spécifique bwa mem
Note: clinvar est plus ancien dans la version d'Alexis, cela explique les 8884 de la version pseudo-prod (via ID not clinvar patho)
Attention : le nombre de lignes = celles sans commentaires (et pas just NC... car il devrait y avoir les mitochondries !)
$ find . -name filter-depth.vcf -exec sh -c 'echo {}; grep -c -v '^#' {}' \;
Attention, on a un vcf.gz pour la version de test !!! ne pas utiliser le vcf
| | prod | pseudo-prod | pseudo-prod | test | test | test |
|---------------------+------------+-------------+-------------+------------+------------+------------|
| | | parallel | sequential | parallel | parallel | sequential |
| gatk | | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.3.0 | 4.2.3.1 |
| clinvar | 2022-09-03 | 2022-09-03 | 2022-09-03 | 2022-12-03 | 2022-12-03 | 2022-12-03 |
|---------------------+------------+-------------+-------------+------------+------------+------------|
| bwa mem | 128077207 | 128077211 | 128077210 | 128077211 | 128077211 | 128077210 |
| cleanSam | 128077207 | | 128077210 | 128077211 | 128077211 | 128077210 |
| applybqsr | | 128077211 | | | | |
| haplotypecaller | 1528059 | 1528093 | 1528089 | 1528093 | 1528093 | 1528093 |
| DP_over_30 | 84708 | 84724 | 84725 | 84724 | 84724 | 84725 |
| not_SNP | 11362 | 11377 | 11384 | NA | NA | NA |
| consensual_sequence | 8864 | 8884 | 8886 | 8898 | 8898 | 8900 |
***** Convention
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
On a vérifié que grep -c et grep | wc -l donnent le même résultat
#
**** Gatk 4.2.4 (même version qu'alexis)
***** TODO Variant calling
****** TODO haplotypecaller: mieux mais non identique !
**** DONE Gatk 4.2.4 (même version qu'alexis)
CLOSED: [2023-01-07 Sat 00:05]
***** DONE Variant calling
CLOSED: [2023-01-07 Sat 00:05]
****** DONE haplotypecaller: mieux mais non identique !
CLOSED: [2023-01-07 Sat 00:05]
**** STats (save)
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
2 méthodes complémentaires
$ grep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135.vcf | wc -l
1506894
$ grep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135.vcf -c
1506894
Attention, on a un vcf.gz pour la version de test !!! ne pas utiliser le vcf
$ find . -name 63003856_S135.vcf.gz -exec sh -c "echo {}; zgrep '^NC' {} | wc -l " \;
./63003856_S135-gatk4.2.4.1/variantCalling/haplotypecaller/63003856_S135.vcf.gz
1506931
./63003856_S135-gatk-4.3.0.0/variantCalling/haplotypecaller/63003856_S135.vcf.gz
1506931
./63003856_S135-sequential-gatk-4.2.4.1/variantCalling/haplotypecaller/63003856_S135.vcf.gz
1506919
$ find . -name filter-depth.vcf -exec grep -H -c '^NC' {} \;
./63003856_S135-gatk4.2.4.1/variantCalling/filter-depth.vcf:82054
./63003856_S135-gatk-4.3.0.0/variantCalling/filter-depth.vcf:82054
./63003856_S135-sequential-gatk-4.2.4.1/variantCalling/filter-depth.vcf:82050
| | prod | pseudo-prod | pseudo-prod | test | test | test |
|------------------------------------------------------+-----------+-------------+-------------+----------+----------+------------|
| | | parallel | sequential | parallel | parallel | sequential |
| gatk | | 4.2.4.1 | | 4.2.4.1 | 4.3.0 | 4.2.3.1 |
|------------------------------------------------------+-----------+-------------+-------------+----------+----------+------------|
| bwa mem | 128077207 | 128077211 | | | | |
| cleanSam | 128077207 | | | | | |
| applybqsr | | 128077211 | | | | |
| haplotypecaller | 1506894 | 1506931 | | 1506931 | 1506931 | 1506919 |
| DP_over_30 | 82033 | 82054 | | 82054 | 82054 | 82050 |
| DP_over_30_not_SNP | 8864 | 8884 | | | | |
| DP_over_30_not_SNP_consensual_sequence | 8864 | 8884 | | 8898 | 8898 | 8900 |
| DP_over_30_not_SNP_consensual_sequence_not_technical | 6478 | | | | | |
tester sequential puis version spécifique bwa mem
On a une différence sur les ID clinvar not patho !!
au final, explique la différence avec le pseudo prod et test (mais pas le prod)
$ grep -c '^NC' ours.recode.vcf
8898
$ grep -c '^NC' lol.recode.recode.vcf
8884