apraga/org - Change UMC5I25ULXUZL4IX6OJ6RHWCJMUFVRFPFHOZDF6AMALI2CJTBFFAC

Review done

Created by Alexis Praga on December 4, 2023

UMC5I25ULXUZL4IX6OJ6RHWCJMUFVRFPFHOZDF6AMALI2CJTBFFAC

Dependencies

In channels

main

Change contents

Insertion in projects.org at line 406 [5.123895]

[3.613]

[4.44]

*** TODO Réparer frigo
SCHEDULED: <2023-12-07 Thu>
Mail envoyé 2023-12-04
/Entered on/ [2023-12-04 Mon 20:40]

Replacement in projects.org at line 434 [5.123895]
B:BD[3.705] → [6.12:45]
```
SCHEDULED: <2023-12-02 Sat .+1d>
```
[3.705]
[3.738]
```
SCHEDULED: <2023-12-04 Mon .+1d>
```

Insertion in projects.org at line 439 [5.123895]

[3.811]

[6.46]

- State "DONE"       from "TODO"       [2023-12-03 Sun 23:16]
- State "DONE"       from "TODO"       [2023-12-02 Sat 23:16]

Replacement in projects.org at line 466 [5.123895]
B:BD[7.7092] → [7.7092:7120]
```
SCHEDULED: <2023-12-04 Mon>
```
[7.7092]
[7.7120]
```
SCHEDULED: <2023-12-11 Mon>
```

Replacement in projects.org at line 999 [5.123895]

∅:D[8.4034] → [9.630:745]

B:BD[10.593] → [9.630:745]

B:BD[9.745] → [11.905:966]

** TODO Review "Parallelization with Load Balancing of the Weather Model WSM7 for Heterogeneous CPU-GPU Platforms"
SCHEDULED: <2023-11-25 Sat 15:00> DEADLINE: <2023-12-04 Mon>

[8.4034]

[9.800]

** DONE Review "Parallelization with Load Balancing of the Weather Model WSM7 for Heterogeneous CPU-GPU Platforms"
CLOSED: [2023-12-04 Mon 20:39] SCHEDULED: <2023-11-25 Sat 15:00> DEADLINE: <2023-12-04 Mon>

Replacement in projects/bisonex.org at line 2 [14.35]

B:BD[12.8453] → [13.410:8602]

WES 129.94×
HiSeq2500 SRR1611184 SeqCap EZ Human Exome Lib v3.0 WES 111.90×
Kit acessible ?
**** Résumé
Kit disponible en hg38
| HiSeq 4000   | Agilent SureSelect v7 | SRX11061486 | https://github.com/kevinblighe/agilent |
| NovaSeq 6000 | Agilent SureSelect v7 | SRX11061516 | idem                                   |
Kit disponible en hg19
| HiSeq2000 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611178 |http://hgdownload.soe.ucsc.edu/gbdb/hg19/exomeProbesets/
| HiSeq2000 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611179 |idem
| HiSeq2500 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611183 |idem
| HiSeq2500 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611184 |idem
https://emea.support.illumina.com/downloads/truseq-exome-product-files.html
*** Liste de capture
Agilent sureselect v7 hg19 et 38 https://github.com/kevinblighe/agilent
**** UCSCS
- [[http://hgdownload.soe.ucsc.edu/gbdb/hg19/exomeProbesets/][hg19]]
- [[http://hgdownload.soe.ucsc.edu/gbdb/hg38/exomeProbesets/][hg38]]
**** github aztrazeneca
https://github.com/AstraZeneca-NGS/reference_data
- IDT xGen Exome Research Panel v1.0
- Agilent SureSelect Human All Exon V6
- Agilent SureSelect Clinical Research Exome
- Nimblegen SeqCap EZ MedExome
- Nmblegen SeqCap EZ Exome v3
**** Trueseq
https://emea.support.illumina.com/downloads/truseq-exome-product-files.html
*** Exemple de validation avec bcbio:
Télécharge données + bed + liftover avec crossmap
https://github.com/bcbio/bcbio_validation_workflows/blob/master/giab-exome/input/get_data.sh
*** TODO Comment télécharger
**** DONE Tester ligne de commande
CLOSED: [2023-11-29 Wed 23:37] SCHEDULED: <2023-11-28 Tue>
***** KILL Tester aws
CLOSED: [2023-11-28 Tue 23:47] SCHEDULED: <2023-11-28 Tue>
Semble télécharger le .sra vu la taille (manque l'extension)
#+begin_src
aws s3 cp s3://sra-pub-run-odp/sra/SRR1611178/SRR1611178 --no-sign-request .
#+end_src
***** KILL Tester sra faster dump
CLOSED: [2023-11-29 Wed 22:20] SCHEDULED: <2023-11-28 Tue>
Selon la doc https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump, il faut faire un "pré" - téléchargement
#+begin_src sh
prefetch  SRR1611178
fastqer-dump  SRR1611178
#+end_src
Note fasterq-dump créé un répertoire temporaire de la taille de prefetch et le supprime. Les fastq ne sont pas compressés
***** DONE Passer par ENA qui donne un lien vers FTP directement
CLOSED: [2023-11-29 Wed 23:37]
**** TODO Nextflow
***** KILL fromSRA
CLOSED: [2023-11-29 Wed 23:15]
Ne renvoie pas le FTP pour SRR1611178/SRR1611178 même avec clé API
**** TODO DataToolkit.jl
SCHEDULED: <2023-11-28 Tue>
- plusieurs datasets par patient appelé NA12878 par exemple mais avec attributs différents (séquencer, kit, pair1, pair2)
- FTP depuis ENA (FTP)
*** Zone de capture GIAB fourni le .bed pour l'exome . INfo : https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
*** Valider la méthode
- 1000 genomes + SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
- GIAB + liftover du fichire de capture en hg38
Ce qui est aussi fait par
https://bcbio-nextgen.readthedocs.io/en/stable/contents/germline_variants.html
Mais avec UCSC liftover
** Centogène
https://www.twistbioscience.com/node/23906
Bed non fourni pour exactement cette capture
On prend https://www.twistbioscience.com/resources/data-files/twist-alliance-vcgs-exome-401mb-bed-files
qui content la majeure partie
* Réunion
** <2023-08-10 Thu> Alexis
Ok pour bloquer le développment d'ici mardi prochain
Dév:
- pipeline jusque VEP en T2T + GRCh38
- ok pour valider spip T2T sur quelques variant => à intégrer au pipeline
- annotation :
  - ok pour mobidetails hg38
  - +OMIM T2T+ non
  - +franklin hg38+ non pour le moment
- métriques (fastq a minima) + rapport multiqc
- optionnel
  - reformater la sortie
- on abandonne
  - XAMScissors ave indel
  - parallélisation haplotype caller
  - spliceai à la vollée
  - pangolin
Test
- GIAB:
  - hg38: ok pour refaire les tests NA12878 avec données cento, sinon ok pour "c'est difficile" sur les 3 fichiers de capture
  - T2T: ok pour faire des tests rapides mais probablement pas assez de temps !
- patient de synthèse : variant cento confirém par sanger seuls
Résultats
- ok pour scale up bwa mem et haplotyecaller
Manuscrit
- validation de méthode : laisser tomber la version actuelle et faire comme strasbourg (cf ngs diag) dans la présentatino
- a envoyé le powerponit avec les références des différsences articles
- ok pour robo4 si résultat
- architecture cible = VM : 78 coeurs 54Go RAUM et 1To espace disque
Passage en production : ok pour présentation rapide du code
* Nixpkgs :nix:
** DONE GATK
CLOSED: [2023-05-06 Sat 08:51]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185819][Binaire]]
CLOSED: [2022-09-10 Sat 23:53] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** KILL Corriger code pour utiliser source
CLOSED: [2022-09-11 Sun 22:05]
*** DONE Corriger PATH pour include java et python
CLOSED: [2022-10-11 Tue 11:46]
https://github.com/NixOS/nixpkgs/pull/191548
Review <2022-10-10 Mon> , corrigé dans la journée
*** DONE Update 4.3.0.0
CLOSED: [2023-04-13 Thu 09:01]
** HOLD Nextflow
*** KILL version script seule
CLOSED: [2023-04-01 Sat 18:29]
Fix pour SGE et nextflow
https://github.com/NixOS/nixpkgs/issues/192396
*** KILL Version avec gradle
CLOSED: [2022-10-09 Sun 22:51]
*** HOLD [[https://github.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
    https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
    .//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
  Dépendences
  s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP :spip:
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann :T2T:
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** KILL Mise à jour T2T :T2T:
*** WAIT Corriger PR
SCHEDULED: <2023-12-03 Sun>
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** DONE BioDBBBigFile
CLOSED: [2023-11-30 Thu 21:52]
:PROPERTIES:
:ORDERED:  t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Version de kent déjà packagée : forcer version  335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[h

[12.8453]

[2.306]

WES 129.94×
HiSeq2500 SRR1611184 SeqCap EZ Human Exome Lib v3.0 WES 111.90×
Kit acessible ?
**** Résumé
Kit disponible en hg38
| HiSeq 4000   | Agilent SureSelect v7 | SRX11061486 | https://github.com/kevinblighe/agilent |
| NovaSeq 6000 | Agilent SureSelect v7 | SRX11061516 | idem                                   |
Kit disponible en hg19
| HiSeq2000 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611178 |http://hgdownload.soe.ucsc.edu/gbdb/hg19/exomeProbesets/
| HiSeq2000 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611179 |idem
| HiSeq2500 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611183 |idem
| HiSeq2500 |  SeqCap EZ Human Exome Lib v3.0 | SRR1611184 |idem
https://emea.support.illumina.com/downloads/truseq-exome-product-files.html
*** Liste de capture
Agilent sureselect v7 hg19 et 38 https://github.com/kevinblighe/agilent
**** UCSCS
- [[http://hgdownload.soe.ucsc.edu/gbdb/hg19/exomeProbesets/][hg19]]
- [[http://hgdownload.soe.ucsc.edu/gbdb/hg38/exomeProbesets/][hg38]]
**** github aztrazeneca
https://github.com/AstraZeneca-NGS/reference_data
- IDT xGen Exome Research Panel v1.0
- Agilent SureSelect Human All Exon V6
- Agilent SureSelect Clinical Research Exome
- Nimblegen SeqCap EZ MedExome
- Nmblegen SeqCap EZ Exome v3
**** Trueseq
https://emea.support.illumina.com/downloads/truseq-exome-product-files.html
*** Exemple de validation avec bcbio:
Télécharge données + bed + liftover avec crossmap
https://github.com/bcbio/bcbio_validation_workflows/blob/master/giab-exome/input/get_data.sh
*** TODO Comment télécharger
**** DONE Tester ligne de commande
CLOSED: [2023-11-29 Wed 23:37] SCHEDULED: <2023-11-28 Tue>
***** KILL Tester aws
CLOSED: [2023-11-28 Tue 23:47] SCHEDULED: <2023-11-28 Tue>
Semble télécharger le .sra vu la taille (manque l'extension)
#+begin_src
aws s3 cp s3://sra-pub-run-odp/sra/SRR1611178/SRR1611178 --no-sign-request .
#+end_src
***** KILL Tester sra faster dump
CLOSED: [2023-11-29 Wed 22:20] SCHEDULED: <2023-11-28 Tue>
Selon la doc https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump, il faut faire un "pré" - téléchargement
#+begin_src sh
prefetch  SRR1611178
fastqer-dump  SRR1611178
#+end_src
Note fasterq-dump créé un répertoire temporaire de la taille de prefetch et le supprime. Les fastq ne sont pas compressés
***** DONE Passer par ENA qui donne un lien vers FTP directement
CLOSED: [2023-11-29 Wed 23:37]
**** TODO Nextflow
***** KILL fromSRA
CLOSED: [2023-11-29 Wed 23:15]
Ne renvoie pas le FTP pour SRR1611178/SRR1611178 même avec clé API
**** TODO DataToolkit.jl
SCHEDULED: <2023-11-28 Tue>
- plusieurs datasets par patient appelé NA12878 par exemple mais avec attributs différents (séquencer, kit, pair1, pair2)
- FTP depuis ENA (FTP)
*** Zone de capture GIAB fourni le .bed pour l'exome . INfo : https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
*** Valider la méthode
- 1000 genomes + SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
- GIAB + liftover du fichire de capture en hg38
Ce qui est aussi fait par
https://bcbio-nextgen.readthedocs.io/en/stable/contents/germline_variants.html
Mais avec UCSC liftover
** Centogène
https://www.twistbioscience.com/node/23906
Bed non fourni pour exactement cette capture
On prend https://www.twistbioscience.com/resources/data-files/twist-alliance-vcgs-exome-401mb-bed-files
qui content la majeure partie
* Réunion
** <2023-08-10 Thu> Alexis
Ok pour bloquer le développment d'ici mardi prochain
Dév:
- pipeline jusque VEP en T2T + GRCh38
- ok pour valider spip T2T sur quelques variant => à intégrer au pipeline
- annotation :
  - ok pour mobidetails hg38
  - +OMIM T2T+ non
  - +franklin hg38+ non pour le moment
- métriques (fastq a minima) + rapport multiqc
- optionnel
  - reformater la sortie
- on abandonne
  - XAMScissors ave indel
  - parallélisation haplotype caller
  - spliceai à la vollée
  - pangolin
Test
- GIAB:
  - hg38: ok pour refaire les tests NA12878 avec données cento, sinon ok pour "c'est difficile" sur les 3 fichiers de capture
  - T2T: ok pour faire des tests rapides mais probablement pas assez de temps !
- patient de synthèse : variant cento confirém par sanger seuls
Résultats
- ok pour scale up bwa mem et haplotyecaller
Manuscrit
- validation de méthode : laisser tomber la version actuelle et faire comme strasbourg (cf ngs diag) dans la présentatino
- a envoyé le powerponit avec les références des différsences articles
- ok pour robo4 si résultat
- architecture cible = VM : 78 coeurs 54Go RAUM et 1To espace disque
Passage en production : ok pour présentation rapide du code
* Nixpkgs :nix:
** DONE GATK
CLOSED: [2023-05-06 Sat 08:51]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185819][Binaire]]
CLOSED: [2022-09-10 Sat 23:53] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** KILL Corriger code pour utiliser source
CLOSED: [2022-09-11 Sun 22:05]
*** DONE Corriger PATH pour include java et python
CLOSED: [2022-10-11 Tue 11:46]
https://github.com/NixOS/nixpkgs/pull/191548
Review <2022-10-10 Mon> , corrigé dans la journée
*** DONE Update 4.3.0.0
CLOSED: [2023-04-13 Thu 09:01]
** HOLD Nextflow
*** KILL version script seule
CLOSED: [2023-04-01 Sat 18:29]
Fix pour SGE et nextflow
https://github.com/NixOS/nixpkgs/issues/192396
*** KILL Version avec gradle
CLOSED: [2022-10-09 Sun 22:51]
*** HOLD [[https://github.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
    https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
    .//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
  Dépendences
  s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP :spip:
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann :T2T:
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** KILL Mise à jour T2T :T2T:
*** WAIT Corriger PR
SCHEDULED: <2023-12-18 Mon>
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** DONE BioDBBBigFile
CLOSED: [2023-11-30 Thu 21:52]
:PROPERTIES:
:ORDERED:  t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Version de kent déjà packagée : forcer version  335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[h

Replacement in projects/bisonex.org at line 64 [14.35]

B:BD[15.43998] → [2.16692:24884]

P    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
   1. lister 10 reads contenant le variant
   2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
   3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023]  mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-12-08 Fri>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** KILL HG002 :hg002:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG003 :hg003:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG004 :hg004:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] =>  0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] =>  0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
**** TODO Re-télécharger proprement dans pipeline dédiés
[[*Résumé][Résumé]]
Cf [[*Validation : Quelles données de référence ?][Validation : Quelles données de référence ?]]
https://medium.com/dnanexus/benchmarking-state-of-the-art-secondary-variant-calling-pipelines-5472ca6bace7
Source:
https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=study&acc=SRP047086
https://zenodo.org/records/3597727
Selon https://github.com/genome-in-a-bottle/giab_data_indexes
***** TODO HG001 :hg001:
SCHEDULED: <2023-11-29 Wed>
****** TODO Avec données en hg38
SCHEDULED: <2023-11-29 Wed>
[[*Résumé][Résumé]]
****** TODO Avec données en hg19
SCHEDULED: <2023-12-10 Sun>
Utiliser crossmap ! https://crossmap.readthedocs.io/en/latest/ (inspiré de [[https://github.com/bcbio/bcbio_validation_workflows/blob/master/giab-exome/input/get_data.sh][bcbio]]
pour vérifier
***** TODO HG002 :hg002:
SCHEDULED: <2023-12-04 Mon>
***** TODO HG003 :hg003:
SCHEDULED: <2023-12-04 Mon>
***** TODO HG004 :hg001:
SCHEDULED: <2023-12-04 Mon>
**** TODO Refaire les analyses pour avoir meilleurs résultats
SCHEDULED: <2023-12-03 Sun>
On veut les résultats de https://medium.com/dnanexus/benchmarking-state-of-the-art-secondary-variant-calling-pipelines-5472ca6bace7
***** TODO hap.py avec conda
SCHEDULED: <2023-12-10 Sun>
***** TODO rtgveval
SCHEDULED: <2023-12-10 Sun>
***** TODO Relancer
SCHEDULED: <2023-12-10 Sun>
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-12-08 Fri>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-

[15.43998]

[2.24884]

P    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
   1. lister 10 reads contenant le variant
   2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
   3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023]  mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-12-08 Fri>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** KILL HG002 :hg002:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG003 :hg003:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** KILL HG004 :hg004:T2T:
CLOSED: [2023-11-26 Sun 12:30]
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] =>  0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] =>  0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
**** TODO Re-télécharger proprement dans pipeline dédiés
[[*Résumé][Résumé]]
Cf [[*Validation : Quelles données de référence ?][Validation : Quelles données de référence ?]]
https://medium.com/dnanexus/benchmarking-state-of-the-art-secondary-variant-calling-pipelines-5472ca6bace7
Source:
https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=study&acc=SRP047086
https://zenodo.org/records/3597727
Selon https://github.com/genome-in-a-bottle/giab_data_indexes
***** TODO HG001 :hg001:
SCHEDULED: <2023-11-29 Wed>
****** TODO Avec données en hg38
SCHEDULED: <2023-11-29 Wed>
[[*Résumé][Résumé]]
****** TODO Avec données en hg19
SCHEDULED: <2023-12-10 Sun>
Utiliser crossmap ! https://crossmap.readthedocs.io/en/latest/ (inspiré de [[https://github.com/bcbio/bcbio_validation_workflows/blob/master/giab-exome/input/get_data.sh][bcbio]]
pour vérifier
***** TODO HG002 :hg002:
SCHEDULED: <2023-12-09 Sat>
***** TODO HG003 :hg003:
SCHEDULED: <2023-12-09 Sat>
***** TODO HG004 :hg001:
SCHEDULED: <2023-12-09 Sat>
**** TODO Refaire les analyses pour avoir meilleurs résultats
SCHEDULED: <2023-12-09 Sat>
On veut les résultats de https://medium.com/dnanexus/benchmarking-state-of-the-art-secondary-variant-calling-pipelines-5472ca6bace7
***** TODO hap.py avec conda
SCHEDULED: <2023-12-10 Sun>
***** TODO rtgveval
SCHEDULED: <2023-12-10 Sun>
***** TODO Relancer
SCHEDULED: <2023-12-10 Sun>
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-12-08 Fri>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-