apraga/org - Change OEYACZNB7JKYKXLFHOU6DKLEWARDT53GSEFLBA6TGDLZTNCB4ZXQC

Hemostasis notes

Created by Alexis Praga on November 26, 2023

OEYACZNB7JKYKXLFHOU6DKLEWARDT53GSEFLBA6TGDLZTNCB4ZXQC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 2 [4.35]

B:BD[3.5112] → [2.414:8606]

ub.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
    https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
    .//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
  Dépendences
  s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP :spip:
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann :T2T:
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** KILL Mise à jour T2T :T2T:
*** WAIT Corriger PR
SCHEDULED: <2023-11-26 Sun>
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** TODO BioDBBBigFile
:PROPERTIES:
:ORDERED:  t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Venrsion de kent déjà packagée : forcer version  335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[https://github.com/NixOS/nixpkgs/pull/206991][Restore building kent 404]]
CLOSED: [2023-05-06 Sat 17:40]
Review faite <2023-03-26 Sun> , atteinte merge]
Relancé <2023-05-06 Sat>
Kent 446 n'a pas ce problème donc PR inutile
***** DONE [[https://github.com/NixOS/nixpkgs/pull/223411][Ajouter les header to package]] (inc folder)
CLOSED: [2023-05-08 Mon 10:18] SCHEDULED: <2023-05-07 Sun>
Review à faire
https://github.com/NixOS/nixpkgs/pull/223411
Corrigé et plus besoin de la PR précédente
***** KILL [[https://github.com/NixOS/nixpkgs/pull/186462][BioDBBBigFile]] avec ces 2 changements
CLOSED: [2023-07-02 Sun 11:20]
**** KILL Version de kent déjà packagée : 404
CLOSED: [2023-03-27 Mon 16:43]
Compile mais les tests de passent pas
**** DONE Modifier selon PR https://github.com/NixOS/nixpkgs/pull/186462
CLOSED: [2023-07-30 Sun 22:01] SCHEDULED: <2023-07-30 Sun 20:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] =>  1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
**** WAIT Corriger PR biobigfile
SCHEDULED: <2023-11-28 Tue>
/Entered on/ [2023-10-15 Sun 17:21]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** TODO PR python 3 upstream
SCHEDULED: <2023-12-02 Sat>
*** TODO nixpkgs en l'état
SCHEDULED: <2023-12-02 Sat>
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam  chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID:       5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view   /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande

[3.5112]

[5.9436]

ub.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
    https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
    .//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
  Dépendences
  s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP :spip:
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann :T2T:
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** KILL Mise à jour T2T :T2T:
*** WAIT Corriger PR
SCHEDULED: <2023-12-03 Sun>
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** TODO BioDBBBigFile
:PROPERTIES:
:ORDERED:  t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Venrsion de kent déjà packagée : forcer version  335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[https://github.com/NixOS/nixpkgs/pull/206991][Restore building kent 404]]
CLOSED: [2023-05-06 Sat 17:40]
Review faite <2023-03-26 Sun> , atteinte merge]
Relancé <2023-05-06 Sat>
Kent 446 n'a pas ce problème donc PR inutile
***** DONE [[https://github.com/NixOS/nixpkgs/pull/223411][Ajouter les header to package]] (inc folder)
CLOSED: [2023-05-08 Mon 10:18] SCHEDULED: <2023-05-07 Sun>
Review à faire
https://github.com/NixOS/nixpkgs/pull/223411
Corrigé et plus besoin de la PR précédente
***** KILL [[https://github.com/NixOS/nixpkgs/pull/186462][BioDBBBigFile]] avec ces 2 changements
CLOSED: [2023-07-02 Sun 11:20]
**** KILL Version de kent déjà packagée : 404
CLOSED: [2023-03-27 Mon 16:43]
Compile mais les tests de passent pas
**** DONE Modifier selon PR https://github.com/NixOS/nixpkgs/pull/186462
CLOSED: [2023-07-30 Sun 22:01] SCHEDULED: <2023-07-30 Sun 20:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] =>  1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
**** WAIT Corriger PR biobigfile
SCHEDULED: <2023-11-28 Tue>
/Entered on/ [2023-10-15 Sun 17:21]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** TODO PR python 3 upstream
SCHEDULED: <2023-12-02 Sat>
*** TODO nixpkgs en l'état
SCHEDULED: <2023-12-02 Sat>
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam  chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID:       5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view   /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande

Replacement in projects/bisonex.org at line 30 [4.35]

B:BD[5.17629] → [2.8607:16799]


*** DONE Biblio performance aligneur <(biblio aligneur)> <(aligneur)>
CLOSED: [2023-10-13 Fri 17:40] SCHEDULED: <2023-10-01 Sun>
*** DONE Figure: nombre d'articles citant les principaux aligneur par année
CLOSED: [2023-10-11 Wed 23:54] SCHEDULED: <2023-10-03 Tue>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
*** DONE Figure: nombre d'articles citant les principaux aligneur
CLOSED: [2023-10-12 Thu 23:58] SCHEDULED: <2023-10-12 Thu>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
On se base sur
** Appel de variant
*** DONE Biblio <(biblio appel variant)> <(appel variant)>
CLOSED: [2023-11-25 Sat 23:29] SCHEDULED: <2023-11-25 Sat 11:00>
*** TODO Finir biblio avec comparatifs [0/2]
SCHEDULED: <2023-11-26 Sun>
- [ ] [[file:~/research/bisonex/thesis/biblio.org::#Kumaran_2019][Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data]]
- [ ] [[file:~/research/bisonex/thesis/biblio.org::*Comparaison de pipeline][Comparaison de pipeline]]
*** KILL Figure: nombre de publication par appel de variant
CLOSED: [2023-11-25 Sat 19:00] SCHEDULED: <2023-11-07 Tue>
/Entered on/ [2023-09-19 Tue 08:43]
Impossible d'utiliser pubmed car certains sont sur arxiv
** TODO Figure: nombre d'exomes par années
SCHEDULED: <2023-12-02 Sat>
/Entered on/ [2023-09-19 Tue 08:43]
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-pr

[5.17629]

[2.16799]


*** DONE Biblio performance aligneur <(biblio aligneur)> <(aligneur)>
CLOSED: [2023-10-13 Fri 17:40] SCHEDULED: <2023-10-01 Sun>
*** DONE Figure: nombre d'articles citant les principaux aligneur par année
CLOSED: [2023-10-11 Wed 23:54] SCHEDULED: <2023-10-03 Tue>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
*** DONE Figure: nombre d'articles citant les principaux aligneur
CLOSED: [2023-10-12 Thu 23:58] SCHEDULED: <2023-10-12 Thu>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
On se base sur
** Appel de variant
*** DONE Biblio <(biblio appel variant)> <(appel variant)>
CLOSED: [2023-11-25 Sat 23:29] SCHEDULED: <2023-11-25 Sat 11:00>
*** TODO Finir biblio avec comparatifs [0/2]
SCHEDULED: <2023-11-26 Sun 13:00>
- [ ] [[file:~/research/bisonex/thesis/biblio.org::#Kumaran_2019][Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data]]
- [ ] [[file:~/research/bisonex/thesis/biblio.org::*Comparaison de pipeline][Comparaison de pipeline]]
*** KILL Figure: nombre de publication par appel de variant
CLOSED: [2023-11-25 Sat 19:00] SCHEDULED: <2023-11-07 Tue>
/Entered on/ [2023-09-19 Tue 08:43]
Impossible d'utiliser pubmed car certains sont sur arxiv
** TODO Figure: nombre d'exomes par années
SCHEDULED: <2023-12-02 Sat>
/Entered on/ [2023-09-19 Tue 08:43]
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-pr

Replacement in projects/bisonex.org at line 52 [4.35]

B:BD[6.19428] → [6.19428:23102]

B:BD[6.23102] → [7.14910:16415]

B:BD[7.16415] → [8.49940:57842]

B:BD[8.57842] → [9.60304:63607]

             | --showHidden false                                                                      |
| --QUIET false                                                                                                 | --QUIET false                                                                           |
| --use-jdk-deflater false                                                                                      | --use-jdk-deflater false                                                                |
| --use-jdk-inflater false                                                                                      | --use-jdk-inflater false                                                                |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB : exome :giab:
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.99
34 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** TODO télécharger données avec Nextflow :hg38:
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome
/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE Télécharger FASTQ directement avec aws (via SRA)
CLOSED: [2023-06-30 Fri 22:30] SCHEDULED: <2023-06-27 Tue>
***** Remarques
Numéro d'accession : https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08365-3/tables/1
Fastq disponible via SRA. Avec AWS, on peut accéder au fastq directement.
(Sinon il faut convertir SRA -> Fastq avec le toolkit : compliqué à configurer)
Exemple: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR2962669&display=data-access
Avantage:
- pas de conversion BAM -> fASTQ
- détail des capture
- capture en hg38 sur site du constructeur !!
- capture semblable pour ashkenazi
Inconvénient :
- NA12878 : discordance pour le nombre de paires de bases : NA12878 = 49G (donc 24G de fastq)
- capture non disponible en ligne (site agilent)
- format SRA (le lien pour les fastq n'est pas gratuit): utiliser HTTP ou leur toolkit (télécharge au format SRA puis convertit en fastq). Exemple: pour avoir 2 fastq 
  fastq-dump --split-files --gzip SRR2962669
***** Liste des runs :
https://www.ncbi.nlm.nih.gov/sra
Cherche avec numéro patient. On a le choix entre plusieurs séquenceurs Illumina
- NovaSeq 6000 TruSeq capture SRX11061536
- NovaSeq 6000 IDT capture SRX11061526
- NovaSeq 6000 Agilent SureSelect v7 capture SRX11061516
- HiSeq 4000 TruSeq capture SRX11061506
- HiSeq 4000 IDT capture SRX11061496
- HiSeq 4000 Agilent SureSelect v7 capture SRX11061486
Note: SRX = expérience, SRR = run
          Important:
          - ne pas compresser la sortie avec fasta-dump directement (lent++)
          - Fasterq-dump est plus rapide
          
  Note trueseq non disponible ?
  hg19 : https://www.biostars.org/p/144554/
IDT: lequel
https://www.idtdna.com/pages/products/next-generation-sequencing/workflow/xgen-ngs-hybridization-capture/pre-designed-hyb-cap-panels/exome-hyb-panel-v2
***** DONE HiSeq 4000 + agilent sureselect :sra:
CLOSED: [2023-06-28 Wed 22:06] SCHEDULED: <2023-06-28 Wed>
        - [ ] HG001 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061486 SRR14724513	
        - [ ] HG002 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061487 SRR14724512
        - [ ] HG003 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061488 SRR14724511
        - [ ] HG004 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061489 SRR14724510
          
        Other
 - HG005 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061491 SRR14724508
           - HG006 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061492 SRR14724507
           - HG007 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061493 SRR14724506
******* DONE Capture agilent sureselect
CLOSED: [2023-06-30 Fri 22:30] SCHEDULED: <2023-06-28 Wed>
**** TODO Lift T2T :T2T:
#+begin_quote
We performed liftover using the GATK release 4.1.9 LiftoverVcf (Picard Version 2.23.3) tool with the default parameters. This successfully lifts over variants that map exactly from GRCh38 to T2T-CHM13v2.0 but does not recover variants with swapped reference and alternative alleles. To recover variants with swapped reference/alternative alleles, we ran LiftoverVCF again, with the RECOVER_SWAPPED_REF_ALT flag. Notably, this feature does not recover multiallelic variants, so to recover these variants, we first separated them into multiple biallelic variants, performed liftover using the RECOVER_SWAPPED_REF_ALT tag, and converted them back to their multiallelic representations.
#+end_quote
***** KILL Liftovervcf avec valeur par défaut
CLOSED: [2023-07-02 Sun 23:09] SCHEDULED: <2023-06-30 Fri>
HG002 : il manque la moitié des valeurs
hg001
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ less HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.lifted.vcf.gz
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.lifted.vcf.gz
2168972
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.unlifted.vcf.gz
1862374
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
4031346
***** DONE liftover bed
CLOSED: [2023-07-02 Sun 23:09] SCHEDULED: <2023-06-30 Fri>
792 of 217488 intervals failed (0.364158%) to liftover, encompassing 219109 of 35718732 bases (0.613429%).
 wc -l capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed
217488 capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed
w
c -l work/e4/9981dc539a2373c2beeaa0affc3497/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38.interval_list
On a donc perdu 1000 zones
***** DONE Liftovervcf avec variant échangé référence/alternative ?
CLOSED: [2023-07-02 Sun 23:09]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${di

[6.19428]

[9.63607]

             | --showHidden false                                                                      |
| --QUIET false                                                                                                 | --QUIET false                                                                           |
| --use-jdk-deflater false                                                                                      | --use-jdk-deflater false                                                                |
| --use-jdk-inflater false                                                                                      | --use-jdk-inflater false                                                                |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB : exome :giab:
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.9934 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** DONE télécharger données avec Nextflow :hg38:
CLOSED: [2023-11-26 Sun 12:30]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-11-26 Sun 12:29]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* KILL Fastq hiseq sans trimming
CLOSED: [2023-11-26 Sun 12:29]
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE Télécharger FASTQ directement avec aws (via SRA)
CLOSED: [2023-06-30 Fri 22:30] SCHEDULED: <2023-06-27 Tue>
***** Remarques
Numéro d'accession : https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08365-3/tables/1
Fastq disponible via SRA. Avec AWS, on peut accéder au fastq directement.
(Sinon il faut convertir SRA -> Fastq avec le toolkit : compliqué à configurer)
Exemple: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR2962669&display=data-access
Avantage:
- pas de conversion BAM -> fASTQ
- détail des capture
- capture en hg38 sur site du constructeur !!
- capture semblable pour ashkenazi
Inconvénient :
- NA12878 : discordance pour le nombre de paires de bases : NA12878 = 49G (donc 24G de fastq)
- capture non disponible en ligne (site agilent)
- format SRA (le lien pour les fastq n'est pas gratuit): utiliser HTTP ou leur toolkit (télécharge au format SRA puis convertit en fastq). Exemple: pour avoir 2 fastq 
  fastq-dump --split-files --gzip SRR2962669
***** Liste des runs :
https://www.ncbi.nlm.nih.gov/sra
Cherche avec numéro patient. On a le choix entre plusieurs séquenceurs Illumina
- NovaSeq 6000 TruSeq capture SRX11061536
- NovaSeq 6000 IDT capture SRX11061526
- NovaSeq 6000 Agilent SureSelect v7 capture SRX11061516
- HiSeq 4000 TruSeq capture SRX11061506
- HiSeq 4000 IDT capture SRX11061496
- HiSeq 4000 Agilent SureSelect v7 capture SRX11061486
Note: SRX = expérience, SRR = run
          Important:
          - ne pas compresser la sortie avec fasta-dump directement (lent++)
          - Fasterq-dump est plus rapide
          
  Note trueseq non disponible ?
  hg19 : https://www.biostars.org/p/144554/
IDT: lequel
https://www.idtdna.com/pages/products/next-generation-sequencing/workflow/xgen-ngs-hybridization-capture/pre-designed-hyb-cap-panels/exome-hyb-panel-v2
***** DONE HiSeq 4000 + agilent sureselect :sra:
CLOSED: [2023-06-28 Wed 22:06] SCHEDULED: <2023-06-28 Wed>
        - [ ] HG001 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061486 SRR14724513	
        - [ ] HG002 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061487 SRR14724512
        - [ ] HG003 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061488 SRR14724511
        - [ ] HG004 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061489 SRR14724510
          
        Other
 - HG005 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061491 SRR14724508
           - HG006 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061492 SRR14724507
           - HG007 with Illumina HiSeq 4000 Agilent SureSelect v7 capture SRX11061493 SRR14724506
******* DONE Capture agilent sureselect
CLOSED: [2023-06-30 Fri 22:30] SCHEDULED: <2023-06-28 Wed>
**** KILL Lift T2T :T2T:
CLOSED: [2023-11-26 Sun 12:30]
#+begin_quote
We performed liftover using the GATK release 4.1.9 LiftoverVcf (Picard Version 2.23.3) tool with the default parameters. This successfully lifts over variants that map exactly from GRCh38 to T2T-CHM13v2.0 but does not recover variants with swapped reference and alternative alleles. To recover variants with swapped reference/alternative alleles, we ran LiftoverVCF again, with the RECOVER_SWAPPED_REF_ALT flag. Notably, this feature does not recover multiallelic variants, so to recover these variants, we first separated them into multiple biallelic variants, performed liftover using the RECOVER_SWAPPED_REF_ALT tag, and converted them back to their multiallelic representations.
#+end_quote
***** KILL Liftovervcf avec valeur par défaut
CLOSED: [2023-07-02 Sun 23:09] SCHEDULED: <2023-06-30 Fri>
HG002 : il manque la moitié des valeurs
hg001
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ less HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.lifted.vcf.gz
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.lifted.vcf.gz
2168972
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.unlifted.vcf.gz
1862374
[apraga@mesointeractive b946d0e6bc8d0f220eb1ad1649c20d]$ zgrep -c '^chr' HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
4031346
***** DONE liftover bed
CLOSED: [2023-07-02 Sun 23:09] SCHEDULED: <2023-06-30 Fri>
792 of 217488 intervals failed (0.364158%) to liftover, encompassing 219109 of 35718732 bases (0.613429%).
 wc -l capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed
217488 capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed
wc -l work/e4/9981dc539a2373c2beeaa0affc3497/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38.interval_list
On a donc perdu 1000 zones
***** DONE Liftovervcf avec variant échangé référence/alternative ?
CLOSED: [2023-07-02 Sun 23:09]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${di

Replacement in projects/bisonex.org at line 60 [4.35]

B:BD[10.20967] → [10.20967:24606]

B:BD[10.24606] → [11.33282:37835]

      26289 |      103 |      3845 |    77 |     4 |      0.965854 |         0.995411 |       0.146259 |         0.98041 |     2.9282199219412863 |     2.7752583237657866 |        1.6348301800775018 |        1.8423330808354075 |
| SNP   | PASS   |       23136 |    22346 |      790 |       26289 |      103 |      3845 |    77 |     4 |      0.965854 |         0.995411 |       0.146259 |         0.98041 |     2.9282199219412863 |     2.7752583237657866 |        1.6348301800775018 |        1.8423330808354075 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src sh
ID="HG002-SRX11061487_SRR14724512-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG002 --genome=GRCh38
#+end_src
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-07-30 Sun 14:26]
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC
.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         644 |      538 |      106 |         914 |       32 |       337 |     7 |    19 |      0.835404 |         0.944541 |       0.368709 |        0.886626 |                        |                        |        1.7444933920704846 |        2.3138686131386863 |
| INDEL | PASS   |         644 |      538 |      106 |         914 |       32 |       337 |     7 |    19 |      0.835404 |         0.944541 |       0.368709 |        0.886626 |                        |                        |        1.7444933920704846 |        2.3138686131386863 |
| SNP   | ALL    |       23126 |    22271 |      855 |       26405 |      135 |      4002 |    90 |    20 |      0.963029 |         0.993974 |       0.151562 |        0.978257 |      2.949462182004439 |     2.7766657134686876 |        1.6080333972695475 |        1.8465106245280984 |
| SNP   | PASS   |       23126 |    22271 |      855 |       26405 |      135 |      4002 |    90 |    20 |      0.963029 |         0.993974 |       0.151562 |        0.978257 |      2.949462182004439 |     2.7766657134686876 |        1.6080333972695475 |        1.8465106245280984 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src
ID="HG003-SRX11061488_SRR14724511-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG003 --genome=GRCh38
#+end_src
**** TODO HG004 :hg38:hg004:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:39] SCHEDULED: <2023-07-23 Sun>
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         588 |      511 |       77 |         873 |       15 |       332 |     7 |     8 |      0.869048 |         0.972274 |       0.380298 |        0.917767 |                        |                        |        1.6111111111111112 |                 2.3984375 |
| INDEL | PASS   |         588 |      511 |       77 |         873 |       15 |       332 |     7 |     8 |      0.869048 |

[10.20967]

[11.37835]

      26289 |      103 |      3845 |    77 |     4 |      0.965854 |         0.995411 |       0.146259 |         0.98041 |     2.9282199219412863 |     2.7752583237657866 |        1.6348301800775018 |        1.8423330808354075 |
| SNP   | PASS   |       23136 |    22346 |      790 |       26289 |      103 |      3845 |    77 |     4 |      0.965854 |         0.995411 |       0.146259 |         0.98041 |     2.9282199219412863 |     2.7752583237657866 |        1.6348301800775018 |        1.8423330808354075 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src sh
ID="HG002-SRX11061487_SRR14724512-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG002 --genome=GRCh38
#+end_src
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-07-30 Sun 14:26]
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         644 |      538 |      106 |         914 |       32 |       337 |     7 |    19 |      0.835404 |         0.944541 |       0.368709 |        0.886626 |                        |                        |        1.7444933920704846 |        2.3138686131386863 |
| INDEL | PASS   |         644 |      538 |      106 |         914 |       32 |       337 |     7 |    19 |      0.835404 |         0.944541 |       0.368709 |        0.886626 |                        |                        |        1.7444933920704846 |        2.3138686131386863 |
| SNP   | ALL    |       23126 |    22271 |      855 |       26405 |      135 |      4002 |    90 |    20 |      0.963029 |         0.993974 |       0.151562 |        0.978257 |      2.949462182004439 |     2.7766657134686876 |        1.6080333972695475 |        1.8465106245280984 |
| SNP   | PASS   |       23126 |    22271 |      855 |       26405 |      135 |      4002 |    90 |    20 |      0.963029 |         0.993974 |       0.151562 |        0.978257 |      2.949462182004439 |     2.7766657134686876 |        1.6080333972695475 |        1.8465106245280984 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src
ID="HG003-SRX11061488_SRR14724511-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG003 --genome=GRCh38
#+end_src
**** DONE HG004 :hg38:hg004:
CLOSED: [2023-11-26 Sun 12:30]
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:39] SCHEDULED: <2023-07-23 Sun>
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         588 |      511 |       77 |         873 |       15 |       332 |     7 |     8 |      0.869048 |         0.972274 |       0.380298 |        0.917767 |                        |                        |        1.6111111111111112 |                 2.3984375 |
| INDEL | PASS   |         588 |      511 |       77 |         873 |       15 |       332 |     7 |     8 |      0.869048 |

Replacement in projects/bisonex.org at line 62 [4.35]

B:BD[12.17018] → [12.17018:17453]

B:BD[12.17453] → [13.49674:49719]

B:BD[13.49719] → [14.9278:33374]

84686 |                  1.591810 |                  1.816145 |
******** Résumé
T2T
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL |         413 |      246 |      167 |         751 |      289 |       215 |     2 |    93 |      0.595642 |         0.460821 |
| SNP   |       11236 |    10985 |      251 |       23597 |     9771 |    
  2841 |    26 |    58 |      0.977661 |     
    0.529245 |
Hg38
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |      0.890710 |         0.885510 |
| SNP   |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
 #+begin_src sh
 bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed  | wc -l
 #+end_src
 204280
 T2T
 #+begin_src sh
 bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed  | wc -l
 #+end_src
 204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
    HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
    HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
     \
    --reference chm13v2.0.fa \
    --threads 6 \
     \
    -T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
    --false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
     \
    -o HG001
#+end_src
****** DONE Corriger FILTER : mieux mais toujours trop de négatifs. 3/4 SNP retrouvés
CLOSED: [2023-07-08 Sat 15:19] SCHEDULED: <2023-07-08 Sat>
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
   1. lister 10 reads contenant le variant
   2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
   3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023]  mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-12-02 Sat>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] =>  0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] =>  0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-12-02 Sat>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** DONE Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
***** DONE Liftover capture
CLOSED: [2023-08-06 Sun 18:30] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
 nextflow run -profile standard,helios workflows/lift-nextera-capture.nf  -lib lib
#+end_src
Vérification rapide : ok
***** DONE Run
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
 nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_NA12878-63118093_S260-GRCh38/callVariant/haplotypecaller/2300346867_NA12878-63118093_S260-GRCh38.vcf.gz --outdir=out/2300346867_NA12878-63118093_S260-GRCh38/happy-nextera-lifted/ --compare=happy -lib lib --capture=capture/nexterarapidcapture_exome_targetedregions_v1.2-nochrM_lifted.bed  --id=HG001 --genome=GRCh38
#+end_src
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
 #+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed  --id=HG001 --genome=GRCh38
 #+end_src
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         423 |      395 |       28 |         915 |      108 |       405 |     4 |    13 |      0.933806 |         0.788235 |       0.442623 |        0.854868 |                        |                        |        1.7012987012987013 |        2.7916666666666665 |
| INDEL | PASS   |         423 |      395 |       28 |         915 |      108 |       405 |     4 |    13 |      0.933806 |         0.788235 |       0.442623 |        0.854868 |                        |                        |        1.7012987012987013 |        2.7916666666666665 |
| SNP   | ALL    |       20984 |    20600 |      384 |       26080 |      780 |      4703 |    62 |    10 |        0.9817 |         0.963512 |        0.18033 |        0.972521 |     3.0499710592321048 |     2.7596541786743516 |          1.58256372367935 |        1.8978207694018234 |
| SNP   | PASS   |       20984 |    20600 |      384 |       26080 |      780 |      4703 |    62 |    10 |        0.9817 |         0.963512 |        0.18033 |        0.972521 |     3.0499710592321048 |     2.7596541786743516 |          1.58256372367935 |        1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed  --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         328 |      313 |       15 |         722 |       95 |       309 |     4 |    13 |      0.954268 |         0.769976 |       0.427978 |        0.852273 |                        |                        |        1.8584070796460177 |        2.8967391304347827 |
| INDEL | PASS   |         328 |      313 |       15 |         722 |       95 |       309 |     4 |    13 |      0.954268 |         0.769976 |       0.427978 |        0.852273 |                        |                        |        1.8584070796460177 |        2.8967391304347827 |
| SNP   | ALL    |       19198 |    18962 |      236 |       23381 |      684 |      3738 |    48 |    10 |      0.987707 |         0.965178 |       0.159873 |        0.976313 |     3.1034188034188035 |      2.859264147830391 |        1.5669565217391304 |        1.8578767123287672 |
| SNP   | PASS   |       19198 |    18962 |      236 |       23381 |      684 |      3738 |    48 |    10 |      0.987707 |         0.965178 |       0.159873 |        0.976313 |     3.1034188034188035 |      2.859264147830391 |        1.5669565217391304 |        1.8578767123287672 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
#+begin_src sh
ID="2300346867_NA12878-63118093_S260-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --outdir=out/${ID}/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed  --id=HG001 --genome=GRCh38 -bg
#+end_src
**** DONE Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
**** DONE Vérifier l'impact gatk 4.3 - 4.4 : aucun
CLOSED: [2023-08-05 Sat 09:25]
**** DONE Figure comparant les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Mail Paul sur  les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** DONE Mail cento pour demande le type de capture
CLOSED: [2023-10-07 Sat 17:59]
/Entered on/ [2023-08-07 Mon 20:40]
Twist exome
*** PROJ Comparer happy et happy-vcfeval :giab:
** TODO Données syndip (CHM-eval) : non car génome ! :syndip:
https://github.com/lh3/CHM-eval
*** KILL Données officielles : non car génome !!
CLOSED: [2023-11-19 Sun 23:43]
**** KILL Run ERR1341793
CLOSED: [2023-11-19 Sun 23:43] SCHEDULED: <2023-11-18 Sat>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
**** KILL Run ERR1341796
CLOSED: [2023-11-19 Sun 23:43] SCHEDULED: <2023-11-18 Sat>
*** TODO Données exome Broad institute (nextflow)
SCHEDULED: <2023-11-25 Sat 21:00>
https://console.cloud.google.com/storage/browser/broad-public-datasets/CHM1_CHM13_WES;tab=objects?pli=1&prefix=&forceOnObjectsSortingFiltering=false
*** TODO Télécharger VCF
SCHEDULED: <2023-11-26 Sun>
https://github.com/lh3/CHM-eval/releases
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCento = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** KILL Extraire liste des CNVs
CLOSED: [2023-08-12 Sat 15:54]
SCHEDULED: <2023-04-17 Mon>
**** KILL Simuscop :simuscop:
CLOSED: [2023-08-12 Sat 15:54]
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-si

[12.17018]

[14.33374]

84686 |                  1.591810 |                  1.816145 |
******** Résumé
T2T
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL |         413 |      246 |      167 |         751 |      289 |       215 |     2 |    93 |      0.595642 |         0.460821 |
| SNP   |       11236 |    10985 |      251 |       23597 |     9771 |      2841 |    26 |    58 |      0.977661 |         0.529245 |
Hg38
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |      0.890710 |         0.885510 |
| SNP   |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
 #+begin_src sh
 bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed  | wc -l
 #+end_src
 204280
 T2T
 #+begin_src sh
 bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed  | wc -l
 #+end_src
 204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
    HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
    HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
     \
    --reference chm13v2.0.fa \
    --threads 6 \
     \
    -T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
    --false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
     \
    -o HG001
#+end_src
****** DONE Corriger FILTER : mieux mais toujours trop de négatifs. 3/4 SNP retrouvés
CLOSED: [2023-07-08 Sat 15:19] SCHEDULED: <2023-07-08 Sat>
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
   1. lister 10 reads contenant le variant
   2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
   3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023]  mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-12-02 Sat>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** KILL HG002 :hg002:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG003 :hg003:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG004 :hg004:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] =>  0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] =>  0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
**** TODO Re-télécharger proprement dans pipeline dédiés
Source:
https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=study&acc=SRP047086
https://zenodo.org/records/3597727
***** TODO HG001 :hg001:
SCHEDULED: <2023-11-26 Sun>
***** TODO HG002 :hg002:
SCHEDULED: <2023-11-30 Thu>
***** TODO HG003 :hg003:
SCHEDULED: <2023-11-30 Thu>
***** TODO HG004 :hg001:
SCHEDULED: <2023-11-30 Thu>
**** TODO Refaire les analyses pour avoir meilleurs résultats
SCHEDULED: <2023-12-03 Sun>
On veut les résultats de https://medium.com/dnanexus/benchmarking-state-of-the-art-secondary-variant-calling-pipelines-5472ca6bace7
***** TODO hap.py avec conda
SCHEDULED: <2023-12-03 Sun>
***** TODO rtgveval
SCHEDULED: <2023-12-03 Sun>
***** TODO Relancer
SCHEDULED: <2023-12-03 Sun>
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-12-02 Sat>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** DONE Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
***** DONE Liftover capture
CLOSED: [2023-08-06 Sun 18:30] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
 nextflow run -profile standard,helios workflows/lift-nextera-capture.nf  -lib lib
#+end_src
Vérification rapide : ok
***** DONE Run
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
 nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_NA12878-63118093_S260-GRCh38/callVariant/haplotypecaller/2300346867_NA12878-63118093_S260-GRCh38.vcf.gz --outdir=out/2300346867_NA12878-63118093_S260-GRCh38/happy-nextera-lifted/ --compare=happy -lib lib --capture=capture/nexterarapidcapture_exome_targetedregions_v1.2-nochrM_lifted.bed  --id=HG001 --genome=GRCh38
#+end_src
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
 #+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed  --id=HG001 --genome=GRCh38
 #+end_src
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         423 |      395 |       28 |         915 |      108 |       405 |     4 |    13 |      0.933806 |         0.788235 |       0.442623 |        0.854868 |                        |                        |        1.7012987012987013 |        2.7916666666666665 |
| INDEL | PASS   |         423 |      395 |       28 |         915 |      108 |       405 |     4 |    13 |      0.933806 |         0.788235 |       0.442623 |        0.854868 |                        |                        |        1.7012987012987013 |        2.7916666666666665 |
| SNP   | ALL    |       20984 |    20600 |      384 |       26080 |      780 |      4703 |    62 |    10 |        0.9817 |         0.963512 |        0.18033 |        0.972521 |     3.0499710592321048 |     2.7596541786743516 |          1.58256372367935 |        1.8978207694018234 |
| SNP   | PASS   |       20984 |    20600 |      384 |       26080 |      780 |      4703 |    62 |    10 |        0.9817 |         0.963512 |        0.18033 |        0.972521 |     3.0499710592321048 |     2.7596541786743516 |          1.58256372367935 |        1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed  --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         328 |      313 |       15 |         722 |       95 |       309 |     4 |    13 |      0.954268 |         0.769976 |       0.427978 |        0.852273 |                        |                        |        1.8584070796460177 |        2.8967391304347827 |
| INDEL | PASS   |         328 |      313 |       15 |         722 |       95 |       309 |     4 |    13 |      0.954268 |         0.769976 |       0.427978 |        0.852273 |                        |                        |        1.8584070796460177 |        2.8967391304347827 |
| SNP   | ALL    |       19198 |    18962 |      236 |       23381 |      684 |      3738 |    48 |    10 |      0.987707 |         0.965178 |       0.159873 |        0.976313 |     3.1034188034188035 |      2.859264147830391 |        1.5669565217391304 |        1.8578767123287672 |
| SNP   | PASS   |       19198 |    18962 |      236 |       23381 |      684 |      3738 |    48 |    10 |      0.987707 |         0.965178 |       0.159873 |        0.976313 |     3.1034188034188035 |      2.859264147830391 |        1.5669565217391304 |        1.8578767123287672 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
#+begin_src sh
ID="2300346867_NA12878-63118093_S260-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --outdir=out/${ID}/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed  --id=HG001 --genome=GRCh38 -bg
#+end_src
**** DONE Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
**** DONE Vérifier l'impact gatk 4.3 - 4.4 : aucun
CLOSED: [2023-08-05 Sat 09:25]
**** DONE Figure comparant les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Mail Paul sur  les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** DONE Mail cento pour demande le type de capture
CLOSED: [2023-10-07 Sat 17:59]
/Entered on/ [2023-08-07 Mon 20:40]
Twist exome
*** PROJ Comparer happy et happy-vcfeval :giab:
** TODO Données syndip (CHM-eval) ! :syndip:
https://github.com/lh3/CHM-eval
*** KILL Données officielles : non car génome !!
CLOSED: [2023-11-19 Sun 23:43]
**** KILL Run ERR1341793
CLOSED: [2023-11-19 Sun 23:43] SCHEDULED: <2023-11-18 Sat>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
**** KILL Run ERR1341796
CLOSED: [2023-11-19 Sun 23:43] SCHEDULED: <2023-11-18 Sat>
*** TODO Données exome Broad institute (nextflow)
SCHEDULED: <2023-11-25 Sat 21:00>
https://console.cloud.google.com/storage/browser/broad-public-datasets/CHM1_CHM13_WES;tab=objects?pli=1&prefix=&forceOnObjectsSortingFiltering=false
*** TODO Télécharger VCF
SCHEDULED: <2023-12-01 Fri>
https://github.com/lh3/CHM-eval/releases
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCento = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** KILL Extraire liste des CNVs
CLOSED: [2023-08-12 Sat 15:54]
SCHEDULED: <2023-04-17 Mon>
**** KILL Simuscop :simuscop:
CLOSED: [2023-08-12 Sat 15:54]
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-si

Replacement in projects/bisonex.org at line 76 [4.35]

B:BD[15.42604] → [15.42604:43078]

B:BD[15.43078] → [2.17274:24992]

l'option --pick
Or il n'est pas en5' dans les transcrits refseq...
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chrX%3A47575242%2D47575268&hgsid=301211823_xpelPqPJije7wSIhg070JeGH5ZwV
https://mobidetails.iurc.montp.inserm.fr/MD/api/variant/238296/browser/
Idem pour l'autre
   chr17:g.7852503T>C
https://mobidetails.iurc.montp.inserm.fr/MD/api/vari
ant/182993/browser/
Note:
VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
    MANE Select transcript status
    MANE Plus Clinical transcript status
    canonical status of transcript
    APPRIS isoform annotation
    transcript support level
    biotype of transcript ("protein_coding" preferred)
    CCDS status of transcript
    consequence rank according to this table
    translated, transcript or feature length (longer preferred)
"Wherever possible we would discourage you from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-12-02 Sat>
**** DONE En T2T avec liftover (filtre = spip) : ok mais lent et trop de variants :tests:
CLOSED: [2023-09-17 Sun 17:13] SCHEDULED: <2023-09-17 Sun>
1. Conversion en bed
#+begin_src sh :dir:~/code/sanger
open snvs-cento-sanger.csv | select chrom pos | insert pos2 {$in.pos } | to csv --separator="\t" | save snvs-cento-sanger.bed -f
#+end_src
2. Liftover avec UCSC (en ligne)
NB: vérifié sur le premier résultat en cherche le read contenant le variant (samtools view -r puis samtools view | grep en T2T) et avec l'aide d'IGV, on a un variant qui correspond en
chr1:10757746
3. En supposant que l'ordre des variants n'a pas changé, on ajoute simplement REF et ALT avec annotateLifted.jl
Annotation spip *très lente* : 1h13 !
Résultat:
2×3 DataFrame
 Row │ variant              meanQual  depth
     │ String               Float64   Int64
─────┼──────────────────────────────────────
   1 │ chr12:g.13594572      60.0      1
   2 │ chr17:g.10204026      60.0      1
144 found over 146
filter depth : another 0 missed variants
filter poly : another 0 missed variants
filter vep   : another 0 missed variants
Et on a trop de variants en sortie (7330 !)
**** DONE Mail Paul avec résultats filtre en T2T + nouveau schéma
CLOSED: [2023-09-17 Sun 23:15] SCHEDULED: <2023-09-17 Sun>
** TODO Medically relevant genes
SCHEDULED: <2023-11-30 Thu>
/Entered on/ [2023-10-18 Wed 22:37]
** TODO HG002 en T2T
/Entered on/ [2023-11-25 Sat 17:58]
https://github.com/marbl/HG002
*** Tester les benchmark préliminaires
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/
* Ré-interprétation :reanalysis:
** DONE Lancer tests sur données brutes [225/250] <(samples.csv)>  <(runs.waiting)>
CLOSED: [2023-10-14 Sat 11:58] SCHEDULED: <2023-10-08 Sun>
- [X] 100222_63015289
- [X] 1600304839_63051311
- [X] 1900007827_62913191
- [X] 1900398899_62999500
- [X] 1900486799_62913197
- [X] 2100422923_62952677
- [X] 2100458888_62933047
- [X] 2100601558_62903840
- [X] 2100609288_62905768
- [X] 2100609501_62905776
- [X] 2100614493_62951074
- [X] 2100622566_62908067
- [X] 2100622601_62908060
- [X] 2100622705_62908063
- [X] 2100640027_62911936
- [X] 2100645285_62913212
- [X] 2100661411_62914081
- [X] 2100661462_62914086
- [X] 2100708257_62921596
- [X] 2100738732_62926501
- [X] 2100738850_62926509
- [X] 2100746751_62926505
- [X] 2100746797_62926506
- [X] 2100782349_62931722
- [X] 2100782416_62931561
- [X] 2100782559_62931718
- [X] 2100799204_62934768
- [X] 2200010202_62940284
- [X] 2200023600_62940631
- [X] 2200024348_62999591
- [X] 2200027505_62942457
- [X] 2200038776_62943412
- [X] 2200041919_62943405
- [X] 2200088014_62951326
- [X] 2200146652_62959388
- [X] 2200151850_62960953
- [X] 2200160014_62959475
- [X] 2200160070_62959478
- [X] 2200201368_62967471
- [X] 2200201400_62967470
- [X] 2200265558_62976332
- [X] 2200265605_62976401
- [X] 2200267046_62975192
- [X] 2200273878_62999530
- [X] 2200279708_62977002
- [X] 2200284408_62979102
- [X] 2200293987_62979116
- [X] 2200294359_62979118
- [X] 2200306299_62982217
- [X] 2200306539_62982193
- [X] 220030671_62982211
- [X] 2200307058_62982231
- [X] 2200307108_62982196
- [X] 2200307136_62982221
- [X] 2200307199_62982239
- [X] 2200307230_62982234
- [X] 2200307262_62982219
- [X] 2200307297_62982227
- [X] 2200324510_62985453
- [X] 2200324549_62985478
- [X] 2200324573_62985445
- [X] 2200324594_62985467
- [X] 2200324606_62985463
- [X] 2200324614_62985459
- [X] 2200338306_62985430
- [X] 2200343880_62989407
- [X] 2200343910_62989460
- [X] 2200343938_62989451
- [X] 2200343966_62989456
- [X] 2200343993_62989440
- [X] 2200344013_62989464
- [X] 2200349749_62989465
- [X] 2200363462_62988848
- [X] 2200377880_62991993
- [X] 2200378032_62991991
- [X] 2200383996_62993828
- [X] 2200384015_62993796
- [X] 2200384046_62993822
- [X] 2200384117_62993808
- [X] 2200384187_62993825
- [X] 2200384231_62992898
- [X] 2200385658_63060260
- [X] 2200394260_62994732
- [X] 2200395817_62994742
- [X] 2200396731_62994737
- [X] 2200424073_62999579
- [X] 2200424207_62999632
- [X] 2200426178_62999630
- [X] 2200426243_62999635
- [X] 2200426466_62999605
- [X] 2200426642_62999627
- [X] 2200427406_62999649
- [X] 2200427512_62999639
- [X] 2200428953_62999572
- [X] 2200428981_62999600
- [X] 2200428999_62999592
- [X] 2200441970_63000868
- [X] 2200441989_63000882
- [X] 2200442135_63000864
- [X] 2200442216_63000886
- [X] 2200442257_63000951
- [X] 2200451801_63003573
- [X] 2200451862_63004218
- [X] 2200451894_63004210
- [X] 2200456165_63051294
- [X] 2200459865_63004933
- [X] 2200459968_63004937
- [X] 2200460073_63004943
- [X] 2200460121_63004684
- [X] 2200467051_63003856
- [X] 2200467225_63004940
- [X] 2200467261_63004930
- [X] 2200467338_63004925
- [X] 2200470099_63004485
- [X] 2200470142_63004480
- [X] 2200471780_63004362
- [X] 2200480910_63006466
- [X] 2200495073_63010427
- [X] 2200495510_63009152
- [X] 2200508677_63060252
- [X] 2200510531_63012582
- [X] 2200510628_63012549
- [X] 2200510657_63012554
- [X] 2200511249_63012533
- [X] 2200511274_63012586
- [X] 2200517952_63060399
- [X] 2200519525_63060439
- [X] 2200524009_63014044
- [X] 2200524609_63014046
- [X] 2200524616_63014048
- [X] 2200533429_63060425
- [X] 2200539735_63060406
- [X] 2200549908_63019339
- [X] 2200549965_63019349
- [X] 2200550414_63019357
- [X] 2200550471_63020031
- [X] 2200550490_63019351
- [X] 2200550505_63019340
- [X] 2200555565_63018614
- [X] 2200559438_63020029
- [X] 2200559682_63020030
- [X] 2200559713_63019623
- [X] 2200559739_63019626
- [X] 2200569969_63019991
- [X] 2200570001_63021580
- [X] 2200570025_63021490
- [X] 2200570035_63021491
- [X] 2200570042_63021493
- [X] 2200570050_63021494
- [X] 2200579897_63024910
- [X] 2200583995_63024866
- [X] 2200584035_63024905
- [X] 2200584069_63024888
- [X] 2200584126_63024810
- [X] 2200589507_63026712
- [X] 2200597365_63027994
- [X] 2200597480_63027988
- [X] 2200597752_63026853
- [X] 2200597778_63027992
- [X] 22005977_63026903
- [X] 2200609031_63026527
- [X] 2200614198_63113928
- [X] 2200620372_63030821
- [X] 2200620442_63030810
- [X] 2200620498_63030816
- [X] 2200620628_63031031
- [X] 2200622310_63030984
- [X] 2200622355_63030956
- [X] 2200625369_63028699
- [X] 2200625410_63028697
- [X] 2200625536_63028694
- [X] 2200630189_63030665
- [X] 2200635149_63033182
- [X] 2200644544_63037731
- [X] 2200644594_63037725
- [X] 2200650089_63038093
- [X] 2200666292_63076568
- [X] 2200669188_63036688
- [X] 2200669320_63040259
- [X] 2200669383_63040254
- [X] 2200669414_63040257
- [X] 2200669446_63040251
- [X] 2200680342_63105271
- [X] 2200694535_63042853
- [X] 2200694789_63042862

[15.42604]

[2.24992]

l'option --pick
Or il n'est pas en5' dans les transcrits refseq...
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chrX%3A47575242%2D47575268&hgsid=301211823_xpelPqPJije7wSIhg070JeGH5ZwV
https://mobidetails.iurc.montp.inserm.fr/MD/api/variant/238296/browser/
Idem pour l'autre
   chr17:g.7852503T>C
https://mobidetails.iurc.montp.inserm.fr/MD/api/variant/182993/browser/
Note:
VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
    MANE Select transcript status
    MANE Plus Clinical transcript status
    canonical status of transcript
    APPRIS isoform annotation
    transcript support level
    biotype of transcript ("protein_coding" preferred)
    CCDS status of transcript
    consequence rank according to this table
    translated, transcript or feature length (longer preferred)
"Wherever possible we would discourage you from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-12-02 Sat>
**** DONE En T2T avec liftover (filtre = spip) : ok mais lent et trop de variants :tests:
CLOSED: [2023-09-17 Sun 17:13] SCHEDULED: <2023-09-17 Sun>
1. Conversion en bed
#+begin_src sh :dir:~/code/sanger
open snvs-cento-sanger.csv | select chrom pos | insert pos2 {$in.pos } | to csv --separator="\t" | save snvs-cento-sanger.bed -f
#+end_src
2. Liftover avec UCSC (en ligne)
NB: vérifié sur le premier résultat en cherche le read contenant le variant (samtools view -r puis samtools view | grep en T2T) et avec l'aide d'IGV, on a un variant qui correspond en
chr1:10757746
3. En supposant que l'ordre des variants n'a pas changé, on ajoute simplement REF et ALT avec annotateLifted.jl
Annotation spip *très lente* : 1h13 !
Résultat:
2×3 DataFrame
 Row │ variant              meanQual  depth
     │ String               Float64   Int64
─────┼──────────────────────────────────────
   1 │ chr12:g.13594572      60.0      1
   2 │ chr17:g.10204026      60.0      1
144 found over 146
filter depth : another 0 missed variants
filter poly : another 0 missed variants
filter vep   : another 0 missed variants
Et on a trop de variants en sortie (7330 !)
**** DONE Mail Paul avec résultats filtre en T2T + nouveau schéma
CLOSED: [2023-09-17 Sun 23:15] SCHEDULED: <2023-09-17 Sun>
** TODO Medically relevant genes
SCHEDULED: <2023-11-30 Thu>
/Entered on/ [2023-10-18 Wed 22:37]
** TODO HG002 en T2T
/Entered on/ [2023-11-25 Sat 17:58]
https://github.com/marbl/HG002
*** TODO Tester les benchmark préliminaires
SCHEDULED: <2023-12-26 Tue>
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.011-20230725/
* Ré-interprétation :reanalysis:
** DONE Lancer tests sur données brutes [225/250] <(samples.csv)>  <(runs.waiting)>
CLOSED: [2023-10-14 Sat 11:58] SCHEDULED: <2023-10-08 Sun>
- [X] 100222_63015289
- [X] 1600304839_63051311
- [X] 1900007827_62913191
- [X] 1900398899_62999500
- [X] 1900486799_62913197
- [X] 2100422923_62952677
- [X] 2100458888_62933047
- [X] 2100601558_62903840
- [X] 2100609288_62905768
- [X] 2100609501_62905776
- [X] 2100614493_62951074
- [X] 2100622566_62908067
- [X] 2100622601_62908060
- [X] 2100622705_62908063
- [X] 2100640027_62911936
- [X] 2100645285_62913212
- [X] 2100661411_62914081
- [X] 2100661462_62914086
- [X] 2100708257_62921596
- [X] 2100738732_62926501
- [X] 2100738850_62926509
- [X] 2100746751_62926505
- [X] 2100746797_62926506
- [X] 2100782349_62931722
- [X] 2100782416_62931561
- [X] 2100782559_62931718
- [X] 2100799204_62934768
- [X] 2200010202_62940284
- [X] 2200023600_62940631
- [X] 2200024348_62999591
- [X] 2200027505_62942457
- [X] 2200038776_62943412
- [X] 2200041919_62943405
- [X] 2200088014_62951326
- [X] 2200146652_62959388
- [X] 2200151850_62960953
- [X] 2200160014_62959475
- [X] 2200160070_62959478
- [X] 2200201368_62967471
- [X] 2200201400_62967470
- [X] 2200265558_62976332
- [X] 2200265605_62976401
- [X] 2200267046_62975192
- [X] 2200273878_62999530
- [X] 2200279708_62977002
- [X] 2200284408_62979102
- [X] 2200293987_62979116
- [X] 2200294359_62979118
- [X] 2200306299_62982217
- [X] 2200306539_62982193
- [X] 220030671_62982211
- [X] 2200307058_62982231
- [X] 2200307108_62982196
- [X] 2200307136_62982221
- [X] 2200307199_62982239
- [X] 2200307230_62982234
- [X] 2200307262_62982219
- [X] 2200307297_62982227
- [X] 2200324510_62985453
- [X] 2200324549_62985478
- [X] 2200324573_62985445
- [X] 2200324594_62985467
- [X] 2200324606_62985463
- [X] 2200324614_62985459
- [X] 2200338306_62985430
- [X] 2200343880_62989407
- [X] 2200343910_62989460
- [X] 2200343938_62989451
- [X] 2200343966_62989456
- [X] 2200343993_62989440
- [X] 2200344013_62989464
- [X] 2200349749_62989465
- [X] 2200363462_62988848
- [X] 2200377880_62991993
- [X] 2200378032_62991991
- [X] 2200383996_62993828
- [X] 2200384015_62993796
- [X] 2200384046_62993822
- [X] 2200384117_62993808
- [X] 2200384187_62993825
- [X] 2200384231_62992898
- [X] 2200385658_63060260
- [X] 2200394260_62994732
- [X] 2200395817_62994742
- [X] 2200396731_62994737
- [X] 2200424073_62999579
- [X] 2200424207_62999632
- [X] 2200426178_62999630
- [X] 2200426243_62999635
- [X] 2200426466_62999605
- [X] 2200426642_62999627
- [X] 2200427406_62999649
- [X] 2200427512_62999639
- [X] 2200428953_62999572
- [X] 2200428981_62999600
- [X] 2200428999_62999592
- [X] 2200441970_63000868
- [X] 2200441989_63000882
- [X] 2200442135_63000864
- [X] 2200442216_63000886
- [X] 2200442257_63000951
- [X] 2200451801_63003573
- [X] 2200451862_63004218
- [X] 2200451894_63004210
- [X] 2200456165_63051294
- [X] 2200459865_63004933
- [X] 2200459968_63004937
- [X] 2200460073_63004943
- [X] 2200460121_63004684
- [X] 2200467051_63003856
- [X] 2200467225_63004940
- [X] 2200467261_63004930
- [X] 2200467338_63004925
- [X] 2200470099_63004485
- [X] 2200470142_63004480
- [X] 2200471780_63004362
- [X] 2200480910_63006466
- [X] 2200495073_63010427
- [X] 2200495510_63009152
- [X] 2200508677_63060252
- [X] 2200510531_63012582
- [X] 2200510628_63012549
- [X] 2200510657_63012554
- [X] 2200511249_63012533
- [X] 2200511274_63012586
- [X] 2200517952_63060399
- [X] 2200519525_63060439
- [X] 2200524009_63014044
- [X] 2200524609_63014046
- [X] 2200524616_63014048
- [X] 2200533429_63060425
- [X] 2200539735_63060406
- [X] 2200549908_63019339
- [X] 2200549965_63019349
- [X] 2200550414_63019357
- [X] 2200550471_63020031
- [X] 2200550490_63019351
- [X] 2200550505_63019340
- [X] 2200555565_63018614
- [X] 2200559438_63020029
- [X] 2200559682_63020030
- [X] 2200559713_63019623
- [X] 2200559739_63019626
- [X] 2200569969_63019991
- [X] 2200570001_63021580
- [X] 2200570025_63021490
- [X] 2200570035_63021491
- [X] 2200570042_63021493
- [X] 2200570050_63021494
- [X] 2200579897_63024910
- [X] 2200583995_63024866
- [X] 2200584035_63024905
- [X] 2200584069_63024888
- [X] 2200584126_63024810
- [X] 2200589507_63026712
- [X] 2200597365_63027994
- [X] 2200597480_63027988
- [X] 2200597752_63026853
- [X] 2200597778_63027992
- [X] 22005977_63026903
- [X] 2200609031_63026527
- [X] 2200614198_63113928
- [X] 2200620372_63030821
- [X] 2200620442_63030810
- [X] 2200620498_63030816
- [X] 2200620628_63031031
- [X] 2200622310_63030984
- [X] 2200622355_63030956
- [X] 2200625369_63028699
- [X] 2200625410_63028697
- [X] 2200625536_63028694
- [X] 2200630189_63030665
- [X] 2200635149_63033182
- [X] 2200644544_63037731
- [X] 2200644594_63037725
- [X] 2200650089_63038093
- [X] 2200666292_63076568
- [X] 2200669188_63036688
- [X] 2200669320_63040259
- [X] 2200669383_63040254
- [X] 2200669414_63040257
- [X] 2200669446_63040251
- [X] 2200680342_63105271
- [X] 2200694535_63042853
- [X] 2200694789_63042862

Replacement in projects/bisonex.org at line 78 [4.35]

B:BD[16.40467] → [16.40467:41210]

B:BD[16.41210] → [2.25736:30728]

3-10-26 Thu>
**** DONE compare chaque variant avec la sortie du pipeline
CLOSED: [2023-10-31 Tue 00:18] SCHEDULED: <2023-10-21 Sat>
Avec la fonction "test" dans Search.hs
1126 extracted
654 annotated
253 raw data
102 raw and annotated
236 raw and extracted
17 raw NOT extracted
890 extract WITHOUT raw
#+begin_src sh
❯ open diff.txt | from csv | get id | into string | each {|e| "~/annex/data/centogene/reports/" ++ $e ++ "*.pdf"} | each {|e| firefox $e }
#+end_src
Les 17 manquants sont
- 62913191 : CNV
- 62959388 : MT-ATP6
- 62999572 : MT-ATP6
- 62999627 : CNV
- 62999630 : CNV
- 63004218: CNV
- 63006466 : CNV
- 63009152 : manqué à extraire -> bien présent
- 63015289: CNV
- 63024910 : MT-ATP6
- 63040251 : 
CNV
- 63043050 : CNV
- 63118093 : NA12878
- NA12878 x4
*** DONE Comparer variants cento à sortie bisonex: 50/121 confirmé en sanger, 71/121 non testé, 0 confirmés manqué par pipeline, 5 manqué mais non confirmés
CLOSED: [2023-11-08 Wed 00:19] SCHEDULED: <2023-11-04 Sat>
*** Comparger sanger : variant seul
Compliqué de reconstituer l'arbre familial. L'information est là mais demande du travail.
ON suppose que le variant n'est que dans la famille....
Résultats
❯ open sangerized.csv | where "Found by bisonex" == "found" | where "Confirmed in sanger" == "true" | length
50
❯ open sangerized.csv | where "Found by bisonex" == "found" | where "Confirmed in sanger" == "" | length
71
❯ open sangerized.csv | where "Found by bisonex" == "missed" | where "Confirmed in sanger" == "" | length
5
❯ open sangerized.csv | where "Found by bisonex" == "missed" | where "Confirmed in sanger" == "true" | length
0
[[id:cd79a77c-a0b6-4bb1-9e08-fe08dc89e3aa][Résultats finaux]]
*** DONE Regarder 5 variants manqués: 3 explicables, 2 non
CLOSED: [2023-11-09 Thu 00:22] SCHEDULED: <2023-11-05 Sun>
open searched.csv |  where "Found by bisonex" == "missed"
62982193  7884996 : haplotypecaller ok... -> filtré car AD=5 <= 10
63012582  102230760 : non présent haplotypcellar mais une délétion en 755 (en 754 CG -> C). Vérifié mobidetails
63019340  50721335 : non présent haplotypecaller (vérifié igv). vérifié mobidetails
63060439  26869324 : filtré car 15 reads
63109239  14358800 : présent haplotypecaller : filtré car DP=29 <= 30
Non présent haplotypecaller avec bcftools mais zgrep ok
zgrep 7884996 call_variant/haplotypecaller/*62982193*/*
zgrep 102230760 call_variant/haplotypecaller/*63012582*/*
zgrep 50721335 call_variant/haplotypecaller/*63019340*/*
zgrep 26869324 call_variant/haplotypecaller/*63060439*/*
zgrep 14358800 call_variant/haplotypecaller/*63109239*/*
*** DONE Flowchart
CLOSED: [2023-11-09 Thu 00:22]
*** DONE Refaire extraction
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec mobidetails
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec transcrit non reconnus
CLOSED: [2023-11-04 Sat 20:42] SCHEDULED: <2023-11-04 Sat>
5 transcrits, donnés égalemen tpar
#+begin_src nu
open annotated.csv | where coding != "negatif" | where chrom == ""
#+end_src
| 62676048 | NM_001080420.1 | SHANK3    | référénce non valide   |
| 62690893 | NM_001080420.1 | KDM6B     | idem                   |
| 62690893 | NM_001080420.1 | KDM6B     | même variant           |
| 62795429 | NM_016381.3    | TREX1 | NM_033629.5   |
| 63019340 | NM_001080420.1 | SHANK3 | NM_001372044.2 |
SCHEDULED: <2023-11-01 Wed>
*** DONE Rajouter variant pour 63009152
CLOSED: [2023-11-04 Sat 20:47] SCHEDULED: <2023-11-01 Wed>
*** DONE Regénérer annotation avec NC_
CLOSED: [2023-11-04 Sat 18:59] SCHEDULED: <2023-10-31 Tue>
*** DONE Comparer variants manqué avec sanger: 0 confirmés
CLOSED: [2023-11-06 Mon 23:48] SCHEDULED: <2023-11-04 Sat>
*** DONE Annoter variants avec sanger
CLOSED: [2023-11-08 Wed 23:17] SCHEDULED: <2023-11-07 Tue>
*** DONE Mail paul avec résultats
CLOSED: [2023-11-09 Thu 00:22] SCHEDULED: <2023-11-05 Sun>
*** DONE Vérifier coordonnées des 2 variants manquants
CLOSED: [2023-11-12 Sun 16:53] SCHEDULED: <2023-11-11 Sat>
Les 2 sont des homopolymer
- 1er = même variant mais représenté différement
- SHANK3 ?
**** PITX3: filtrée car AD=8
NB: représentation synonyme
Même séquence
  >hg38_dna range=chr10:102230742-102230777 5'pad=2 3'pad=2 strand=+ repeatMasking=none
GGAGCCAGCCCGGGGGGGCCCCCGCCCAGGCCCTG
>hg19_dna range=chr10:103990500-103990534 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GGAGCCAGCCCGGGGGGGCCCCCGCCCAGGCCCTG
Selon IGV:
GGAGCCAGCCC(G)GGGGGGCCCCCGCCCAGGCCCTG
Selon cento
GGAGCCAGCCCGGGGGG(G)CCCCCGCCCAGGCCCTG
#+begin_src sh :dir ~/annex/data/bisonex/
bcftools filter -i 'POS=102230760' call_variant/haplotypecaller/*63012582*/*.vcf.gz
#+end_src
DP ok mais AD trop faible
 GT:AD:DP:GQ:PL  0/1:26,8:34:99:146,0,671
**** SHANK3: transcrit supprimé depuis: ok
Retrouvé par ERic: 50721504dup
On vérifie
#+begin_src sh :dir ~/annex/data/bisonex/
bcftools filter -i 'POS=50721504' call_variant/haplotypecaller/*63019340*/*.vcf.gz
#+end_src
#+begin_src sh :dir ~/annex/data/bisonex/
zgrep '50721504' annotate/full/*63019340*.tsv
#+end_src
*** TODO Sanger pour 4 VOUS manqués
SCHEDULED: <2023-12-13 Wed>
/Entered on/ [2023-11-13 Mon 22:40]
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-12-02 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-12-02 Sat>
** TODO Refaire statistics avec happy+ vcfeval
SCHEDULED: <2023-11-30 Thu>
/Entered on/ [2023-11-18 Sat 20:13]
* Communication
** DONE Mail NGS-diag
CLOSED: [2023-10-06 Fri 08:04] SCHEDULED: <2023-10-06 Fri>
/Entered on/ [2023-10-04 Wed 19:33]

[16.40467]

3-10-26 Thu>
**** DONE compare chaque variant avec la sortie du pipeline
CLOSED: [2023-10-31 Tue 00:18] SCHEDULED: <2023-10-21 Sat>
Avec la fonction "test" dans Search.hs
1126 extracted
654 annotated
253 raw data
102 raw and annotated
236 raw and extracted
17 raw NOT extracted
890 extract WITHOUT raw
#+begin_src sh
❯ open diff.txt | from csv | get id | into string | each {|e| "~/annex/data/centogene/reports/" ++ $e ++ "*.pdf"} | each {|e| firefox $e }
#+end_src
Les 17 manquants sont
- 62913191 : CNV
- 62959388 : MT-ATP6
- 62999572 : MT-ATP6
- 62999627 : CNV
- 62999630 : CNV
- 63004218: CNV
- 63006466 : CNV
- 63009152 : manqué à extraire -> bien présent
- 63015289: CNV
- 63024910 : MT-ATP6
- 63040251 : CNV
- 63043050 : CNV
- 63118093 : NA12878
- NA12878 x4
*** DONE Comparer variants cento à sortie bisonex: 50/121 confirmé en sanger, 71/121 non testé, 0 confirmés manqué par pipeline, 5 manqué mais non confirmés
CLOSED: [2023-11-08 Wed 00:19] SCHEDULED: <2023-11-04 Sat>
*** Comparger sanger : variant seul
Compliqué de reconstituer l'arbre familial. L'information est là mais demande du travail.
ON suppose que le variant n'est que dans la famille....
Résultats
❯ open sangerized.csv | where "Found by bisonex" == "found" | where "Confirmed in sanger" == "true" | length
50
❯ open sangerized.csv | where "Found by bisonex" == "found" | where "Confirmed in sanger" == "" | length
71
❯ open sangerized.csv | where "Found by bisonex" == "missed" | where "Confirmed in sanger" == "" | length
5
❯ open sangerized.csv | where "Found by bisonex" == "missed" | where "Confirmed in sanger" == "true" | length
0
[[id:cd79a77c-a0b6-4bb1-9e08-fe08dc89e3aa][Résultats finaux]]
*** DONE Regarder 5 variants manqués: 3 explicables, 2 non
CLOSED: [2023-11-09 Thu 00:22] SCHEDULED: <2023-11-05 Sun>
open searched.csv |  where "Found by bisonex" == "missed"
62982193  7884996 : haplotypecaller ok... -> filtré car AD=5 <= 10
63012582  102230760 : non présent haplotypcellar mais une délétion en 755 (en 754 CG -> C). Vérifié mobidetails
63019340  50721335 : non présent haplotypecaller (vérifié igv). vérifié mobidetails
63060439  26869324 : filtré car 15 reads
63109239  14358800 : présent haplotypecaller : filtré car DP=29 <= 30
Non présent haplotypecaller avec bcftools mais zgrep ok
zgrep 7884996 call_variant/haplotypecaller/*62982193*/*
zgrep 102230760 call_variant/haplotypecaller/*63012582*/*
zgrep 50721335 call_variant/haplotypecaller/*63019340*/*
zgrep 26869324 call_variant/haplotypecaller/*63060439*/*
zgrep 14358800 call_variant/haplotypecaller/*63109239*/*
*** DONE Flowchart
CLOSED: [2023-11-09 Thu 00:22]
*** DONE Refaire extraction
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec mobidetails
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec transcrit non reconnus
CLOSED: [2023-11-04 Sat 20:42] SCHEDULED: <2023-11-04 Sat>
5 transcrits, donnés égalemen tpar
#+begin_src nu
open annotated.csv | where coding != "negatif" | where chrom == ""
#+end_src
| 62676048 | NM_001080420.1 | SHANK3    | référénce non valide   |
| 62690893 | NM_001080420.1 | KDM6B     | idem                   |
| 62690893 | NM_001080420.1 | KDM6B     | même variant           |
| 62795429 | NM_016381.3    | TREX1 | NM_033629.5   |
| 63019340 | NM_001080420.1 | SHANK3 | NM_001372044.2 |
SCHEDULED: <2023-11-01 Wed>
*** DONE Rajouter variant pour 63009152
CLOSED: [2023-11-04 Sat 20:47] SCHEDULED: <2023-11-01 Wed>
*** DONE Regénérer annotation avec NC_
CLOSED: [2023-11-04 Sat 18:59] SCHEDULED: <2023-10-31 Tue>
*** DONE Comparer variants manqué avec sanger: 0 confirmés
CLOSED: [2023-11-06 Mon 23:48] SCHEDULED: <2023-11-04 Sat>
*** DONE Annoter variants avec sanger
CLOSED: [2023-11-08 Wed 23:17] SCHEDULED: <2023-11-07 Tue>
*** DONE Mail paul avec résultats
CLOSED: [2023-11-09 Thu 00:22] SCHEDULED: <2023-11-05 Sun>
*** DONE Vérifier coordonnées des 2 variants manquants
CLOSED: [2023-11-12 Sun 16:53] SCHEDULED: <2023-11-11 Sat>
Les 2 sont des homopolymer
- 1er = même variant mais représenté différement
- SHANK3 ?
**** PITX3: filtrée car AD=8
NB: représentation synonyme
Même séquence
  >hg38_dna range=chr10:102230742-102230777 5'pad=2 3'pad=2 strand=+ repeatMasking=none
GGAGCCAGCCCGGGGGGGCCCCCGCCCAGGCCCTG
>hg19_dna range=chr10:103990500-103990534 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GGAGCCAGCCCGGGGGGGCCCCCGCCCAGGCCCTG
Selon IGV:
GGAGCCAGCCC(G)GGGGGGCCCCCGCCCAGGCCCTG
Selon cento
GGAGCCAGCCCGGGGGG(G)CCCCCGCCCAGGCCCTG
#+begin_src sh :dir ~/annex/data/bisonex/
bcftools filter -i 'POS=102230760' call_variant/haplotypecaller/*63012582*/*.vcf.gz
#+end_src
DP ok mais AD trop faible
 GT:AD:DP:GQ:PL  0/1:26,8:34:99:146,0,671
**** SHANK3: transcrit supprimé depuis: ok
Retrouvé par ERic: 50721504dup
On vérifie
#+begin_src sh :dir ~/annex/data/bisonex/
bcftools filter -i 'POS=50721504' call_variant/haplotypecaller/*63019340*/*.vcf.gz
#+end_src
#+begin_src sh :dir ~/annex/data/bisonex/
zgrep '50721504' annotate/full/*63019340*.tsv
#+end_src
*** TODO Sanger pour 4 VOUS manqués
SCHEDULED: <2023-12-13 Wed>
/Entered on/ [2023-11-13 Mon 22:40]
** TODO Chercher nouveaux gènes
SCHEDULED: <2023-12-06 Wed>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-12-02 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-12-02 Sat>
** TODO Refaire statistics avec happy+ vcfeval
SCHEDULED: <2023-11-30 Thu>
/Entered on/ [2023-11-18 Sat 20:13]
* Communication
** DONE Mail NGS-diag
CLOSED: [2023-10-06 Fri 08:04] SCHEDULED: <2023-10-06 Fri>
/Entered on/ [2023-10-04 Wed 19:33]

File deletion: 20230511180745-bacteriologie.org, 20230511180745-microbiologie.org, microbiologie.org
BFD:BFD[17.29] → [18.4578:4634]
BF:BFD[18.4634] → [18.3565:3565]
BFD:BFD[17.29] → [19.41524:41580]
BF:BFD[19.41580] → [18.3565:3565]
BF:BFD[17.29] → [14.51530:51571]
BF:BFD[14.51571] → [18.3565:3565]
B:BD[20.325] → [21.24262:24285]
B:BD[22.3711] → [22.3711:3712]
B:BD[22.3712] → [21.24286:24303]
∅:D[19.41941] → [21.24567:24667]
B:BD[21.24567] → [21.24567:24667]
B:BD[21.24303] → [19.41581:41941]
∅:D[21.24285] → [20.350:380]
B:BD[20.350] → [20.350:380]
```
#+title: Microbiologie
* Bactériologie
** Divers
*** Technique
**** PCR 16s
* Virologie
#+INCLUDE: ./medecine/20230528233600-norovirus.org
- [[file:medecine/20230528225221-classification_bacteries.org][Classification bactéries]]
- [[file:medecine/20230528225531-bacteries.org][Bactéries]]
- [[file:medecine/20230528235124-culture.org][Culture]]
- [[file:medecine/20230528235213-maladies_infectieuses.org][Maladies infectieuses]]
- [[file:medecine/20230528235406-antibiotiques.org][Antibiotiques]]
#+filetags: medecine bacterio
```
Replacement in notes/medecine/hemato.org at line 3 [23.244]
B:BD[23.387] → [23.387:388]
[23.387]
[23.388]
```
* Physiologie
```

Insertion in notes/medecine/hemato.org at line 32 [23.244]

[23.1537]

* Hémostase
- Facteurs dépendant vitamine K: II, VII, IX, X
** Diminution TP
#+BEGIN_SRC dot :file ../../images/tp-diminue.png :exports results
digraph {
node[shape=rectangle];
subgraph { "II et X diminués" -> "hypo vitK" }
subgraph { "tous diminués" -> "CIVD\nIHC\nfibrinogénolyse" }
subgraph { "II ou V ou X\ndiminués" -> "congénital\nacquis" }
subgraph {
    "Fibrinogène\nII,V et X" -> {
         "II ou V ou X\ndiminués"
         "II et X diminués"
         "tous diminués"
                            }
}
subgraph {"II,VII ou X\ndiminué" -> "hypo vitK" }
subgraph { "VII diminué" -> "Début AVK\nDébut carence vitK\nDéficit isolé" }
subgraph { diminues2 [label = "tous diminués"];
           diminues2 -> "IHC"}
subgraph { "Fibrinogène\nII,V,VII ou X" -> {
           "II,VII ou X\ndiminué"
           "VII diminué"
           diminues2
           }
            }
subgraph { "TCA normal" -> "Fibrinogène\nII,V,VII ou X" }
subgraph { "TCA allongé\nsans traitement" -> "Fibrinogène\nII,V et X" }
"TP diminué" -> { "TCA normal"
             "TCA allongé\nsans traitement" }
}
#+END_SRC
#+RESULTS:
[[file:../../images/tp-diminue.png]]
** Augmentation isolée TCA sans traitement
#+BEGIN_SRC dot :file ../../images/tca-diminue.png :exports results
digraph {
node[shape=rectangle];
fact [label="VIII, IV, XI, XII"];
subgraph {"VIII seul" -> "Willebrand\nHémophilie A\nHémophilie A acquise !"}
subgraph {"IX seul" -> "Hémophilie B\nInhibiteur IX"}
subgraph {"XI seul" -> "Infection\nGrosseses\nDéficit constit\nInhibiteur"}
subgraph {"XII seul" -> "0 risque"}
subgraph {" >= 2" -> "Infection\nInterférence ACC ?" }
subgraph {"normaux" -> "ACC ?" }
subgraph { fact -> {
" >= 2"
           "VIII seul"
           "IX seul"
           "XI seul"
           "XII seul"
           "normaux"
                 }}
"TCA allongé seul\nsans traitement" -> fact
}
#+END_SRC
#+RESULTS:
[[file:../../images/tca-diminue.png]]