apraga/org - Change DEVV6CAGRPU667LK6ZCLQRWD2KVUCISNZRXPYFKZ6F76BYZLIDXQC

Which sequencer ?

Created by Alexis Praga on June 26, 2023

DEVV6CAGRPU667LK6ZCLQRWD2KVUCISNZRXPYFKZ6F76BYZLIDXQC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 4 [3.35]

B:BD[2.8221] → [2.8221:9022]

∅:D[2.9022] → [4.8221:8456]

B:BD[4.8221] → [4.8221:8456]

∅:D[4.8456] → [5.16413:16868]

B:BD[5.16413] → [5.16413:16868]

∅:D[5.16868] → [6.8265:8931]

B:BD[6.8265] → [6.8265:8931]

∅:D[6.8931] → [7.16138:17667]

∅:D[8.16787] → [7.16138:17667]

B:BD[7.16138] → [7.16138:17667]

∅:D[7.17667] → [9.8259:8295]

B:BD[9.8259] → [9.8259:8295]

∅:D[9.8295] → [10.17784:17837]

B:BD[10.17784] → [10.17784:17837]

∅:D[10.17837] → [11.23419:24829]

B:BD[11.23419] → [11.23419:24829]

∅:D[11.24829] → [12.24592:27599]

B:BD[12.24592] → [12.24592:27599]

ED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** TODO cache vep
SCHEDULED: <2023-06-17 Sat>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/t
est_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 
ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../
database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la 
classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_s
rc
 #+RESULTS:
 : 518832 common-
notpatho-alexis.txt
On en a 6 de plus que la versi
on d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 cli
nvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
ge

[2.8221]

[12.27599]

ED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** TODO cache vep
SCHEDULED: <2023-07-03 Mon>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
ge

Replacement in projects/bisonex.org at line 8 [3.35]

B:BD[2.16414] → [2.16414:24606]

 py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** TODO T2T :T2T:
Toutes les ressourcs sont décrites ici
https://github.com/marbl/CHM13
Détails sur le pipeline
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&c=CP068277.2&g=hub_3267197_hgLiftOver
*** TODO Alignement
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input="/Work/Groups/bisonex/data/giab/*_R{1,2}_001.fastq.gz" --id=NA12878-T2T -bg
SCHEDULED: <2023-06-14 Wed>
*** TODO Haplotypecaller
SCHEDULED: <2023-06-15 Thu>
*** Liftover pipelines
:PROPERTIES:
:ID:       d2280207-3f65-4a31-a291-41fa9a9658c2
:END:
Contient les chain files
** TODO Indicateurs qualité
SCHEDULED: <2023-07-02 Sun>
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** TODO vérifier si normalisation
SCHEDULED: <2023-06-25 Sun>
** TODO Rajouter vérification hgvs
SCHEDULED: <2023-06-25 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../scrip

[2.16414]

[2.24606]

 py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** TODO T2T :T2T:
Toutes les ressourcs sont décrites ici
https://github.com/marbl/CHM13
Détails sur le pipeline
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&c=CP068277.2&g=hub_3267197_hgLiftOver
*** DONE Alignement
CLOSED: [2023-06-26 Mon 19:42]
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input="/Work/Groups/bisonex/data/giab/*_R{1,2}_001.fastq.gz" --id=NA12878-T2T -bg
SCHEDULED: <2023-06-14 Wed>
*** DONE Haplotypecaller
CLOSED: [2023-06-26 Mon 19:42] SCHEDULED: <2023-06-15 Thu>
*** TODO Filtres
SCHEDULED: <2023-07-03 Mon>
*** Liftover pipelines
:PROPERTIES:
:ID:       d2280207-3f65-4a31-a291-41fa9a9658c2
:END:
Contient les chain files
** TODO Indicateurs qualité
SCHEDULED: <2023-07-02 Sun>
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** TODO vérifier si normalisation
SCHEDULED: <2023-07-03 Mon>
** TODO Rajouter vérification hgvs
SCHEDULED: <2023-07-03 Mon>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../scrip

Replacement in projects/bisonex.org at line 31 [3.35]

B:BD[12.234980] → [12.234980:235048]

B:BD[12.235048] → [13.25156:26911]

B:BD[13.26911] → [4.33059:40229]

B:BD[4.40229] → [2.25408:32799]

                                              | --create-output-bam-
index true                                                          |
| --create-output-bam-md5 false                                                                                 | --create-output-bam-md5 false                                                           |
| --create-output-variant-index true                                                                            | --create-output-variant-index true                                                      |
| --create-output-variant-md5 false                                                                             | --create-output-variant-md5 false                                                       |
| --max-variants-per-shard 0                                                                                    | --max-variants-per-shard 0                                                              |
| --lenient false                                                                                               | --lenient false                                                                         |
| --add-output-sam-program-record true                                                                          | --add-output-sam-program-record true                                                    |
| --add-output-vcf-command-line true                                                                            | --add-output-vcf-command-line true                                                      |
| --cloud-prefetch-buffer 40                                                                                    | --cloud-prefetch-buffer 40                                                              |
| --cloud-index-prefetch-buffer -1          
                                                                    | --cloud-index-prefetch-buffer -1                                                        |
| --disable-bam-index-caching false                                                                             | --disable-bam-index-caching false                                                       |
| --sites-only-vcf-output false                                                                                 | --sites-only-vcf-output false                                                           |
| --help false                                                                                                  | --help false                                                                            |
| --version false                                                                                               | --version false                                                                         |
| --showHidden false                                                                                            | --showHidden false                                                                      |
| --QUIET false                                                                                                 | --QUIET false                                                                           |
| --use-jdk-deflater false                                                                                      | --use-jdk-deflater false                                                                |
| --use-jdk-inflater false                                                                                      | --use-jdk-inflater false                                                                |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.9934 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** DONE télécharger données avec Nextflow :hg38:
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarap
idcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** TODO NA12878 :na12878:hg38:
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchma

[12.234980]

[2.32799]

                                              | --create-output-bam-index true                                                          |
| --create-output-bam-md5 false                                                                                 | --create-output-bam-md5 false                                                           |
| --create-output-variant-index true                                                                            | --create-output-variant-index true                                                      |
| --create-output-variant-md5 false                                                                             | --create-output-variant-md5 false                                                       |
| --max-variants-per-shard 0                                                                                    | --max-variants-per-shard 0                                                              |
| --lenient false                                                                                               | --lenient false                                                                         |
| --add-output-sam-program-record true                                                                          | --add-output-sam-program-record true                                                    |
| --add-output-vcf-command-line true                                                                            | --add-output-vcf-command-line true                                                      |
| --cloud-prefetch-buffer 40                                                                                    | --cloud-prefetch-buffer 40                                                              |
| --cloud-index-prefetch-buffer -1                                                                              | --cloud-index-prefetch-buffer -1                                                        |
| --disable-bam-index-caching false                                                                             | --disable-bam-index-caching false                                                       |
| --sites-only-vcf-output false                                                                                 | --sites-only-vcf-output false                                                           |
| --help false                                                                                                  | --help false                                                                            |
| --version false                                                                                               | --version false                                                                         |
| --showHidden false                                                                                            | --showHidden false                                                                      |
| --QUIET false                                                                                                 | --QUIET false                                                                           |
| --use-jdk-deflater false                                                                                      | --use-jdk-deflater false                                                                |
| --use-jdk-inflater false                                                                                      | --use-jdk-inflater false                                                                |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB : exome :giab:
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.9934 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** TODO télécharger données avec Nextflow :hg38:
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** TODO Télécharger FASTQ directement avec aws (via SRA)
SCHEDULED: <2023-06-27 Tue>
***** Remarques
Numéro d'accession : https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08365-3/tables/1
Fastq disponible via SRA. Avec AWS, on peut accéder au fastq directement.
(Sinon il faut convertir SRA -> Fastq avec le toolkit : compliqué à configurer)
Exemple: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR2962669&display=data-access
Avantage:
- pas de conversion BAM -> fASTQ
- détail des capture
- capture en hg38 sur site du constructeur !!
- capture semblable pour ashkenazi
Inconvénient :
- NA12878 : discordance pour le nombre de paires de bases : NA12878 = 49G (donc 24G de fastq)
- capture non disponible en ligne (site agilent)
***** Liste des runs :
https://www.ncbi.nlm.nih.gov/sra
Cherche avec numéro patient. On a le choix entre plusieurs séquenceurs Illumina
- NovaSeq 6000 TruSeq capture SRX11061536
- NovaSeq 6000 IDT capture SRX11061526
- NovaSeq 6000 Agilent SureSelect v7 capture SRX11061516
- HiSeq 4000 TruSeq capture SRX11061506
- HiSeq 4000 IDT capture SRX11061496
- HiSeq 4000 Agilent SureSelect v7 capture SRX11061486
**** TODO NA12878 :na12878:hg38:
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchma

Replacement in projects/bisonex.org at line 37 [3.35]

B:BD[14.7460] → [14.7460:8221]

B:BD[14.8221] → [2.33561:50473]

∅:D[2.50473] → [14.16414:18778]

B:BD[14.16414] → [14.16414:18778]

∅:D[14.18778] → [15.16414:20953]

B:BD[15.16414] → [15.16414:20953]

 | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |   
  1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** TODO Comparer avec UCSCS liftover
SCHEDULED: <2023-06-25 Sun>
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-04-10 Mon>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              19653          19501       6410      21657     0.7526       0.4757     0.5830
     None              19653          19501       6410      21657     0.7526       0.4757     0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
Agilent
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              37241          36965        449       4069     0.9880       0.9015     0.9428
     None              37248          36972        461       4062     0.9877       0.9017     0.9427
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| INDEL | PASS   |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| SNP   | ALL    |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
| SNP   | PASS   |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG004 :hg38:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO Refaire avec génome "prêt  l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
nextflow run workflows/compareVCF.nf -profile standard,helios --output=validation/na12878-grch38 --test.query=out/NA12878-GRCh38/callVariant/haplotypecaller/NA12878-GRCh38.vcf.gz --test.compare=happy,vcfeval -lib lib/ --test.id=HG001 --genome=GRCh38
***** GRCh38
****** vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    3.000              45121          44771       2961       5402     0.9380       0.8931
 0.9150
     None              45130          44780       2981       5393     0.9376       0.8933
 0.9149
***** happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        4871 |     3636 |     1235 |        7498 |     1629 |      2185 |   195 |   363 |      0.746459 |         0.693394 |       0.291411 |        0.718948 |                        |                        |        1.6174985978687606 |         3.083517699115044 |
| INDEL | PASS   |        4871 |     3636 |     1235 |        7498 |     1629 |      2185 |   195 |   363 |      0.746459 |         0.693394 |       0.291411 |        0.718948 |                        |                        |        1.6174985978687606 |         3.083517699115044 |
| SNP   | ALL    |       46032 |    41994 |     4038 |       48715 |     1438 |      5282 |   299 |    48 |      0.912278 |         0.966892 |       0.108427 |        0.938791 |      2.529551552318896 |      2.375987252320909 |        1.6206857273037931 |        1.6987584524997228 |
| SNP   | PASS   |       46032 |    41994 |     4038 |       48715 |     1438 |      5282 |   299 |    48 |      0.912278 
|         0.966892 |       0.108427 |        0.938791 |      2.529551552318896 |      2.375987252320909 |        1.6206857273037931 |        1.6987584524997228 |
A comparer avec la version patchée génome mais mal alignée
[[id:cd79a77c-a0b6-4bb1-9e08-fe08dc89e3aa][Résultats finaux]]
Un peu plus de vrais positifs, logique
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVaria
ntRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCentogene = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de centogene
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org

[14.7460]

[15.20953]

 | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-04-10 Mon>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              19653          19501       6410      21657     0.7526       0.4757     0.5830
     None              19653          19501       6410      21657     0.7526       0.4757     0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
Agilent
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              37241          36965        449       4069     0.9880       0.9015     0.9428
     None              37248          36972        461       4062     0.9877       0.9017     0.9427
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| INDEL | PASS   |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| SNP   | ALL    |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
| SNP   | PASS   |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG004 :hg38:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO Refaire avec génome "prêt  l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCentogene = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de centogene
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org

Deletion in notes/medecine/20230528235213-maladies_infectieuses.org at line 6 [16.3503]
B:BD[16.3641] → [16.3641:3665]
```
* Maladies infectieuses
```