G2GQZU6T4NH2G2TXBY4G7IPKQPM23BSEMPK4I7ZAU3ANSMVXCXXQC
ISSOVQIPGPFBVEFLS6IBIRX63IE4KU5ZNOJBBIVFUSGS4NSH44PAC
X4TKJADHF355D73Q4WYMCWHFQO3RGAD3BKAB2F3WR3Y23QLIVIGQC
I4OQ35BVH6AC4IEQKZP2YXLKGWBCRM4TUX7S66OFZ5D7RZP23R2QC
HK72ZI3QZZKNR4BNVO2XELQHRXTOCQMTLS7ZRDR2PON6SXGFYCOQC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
YA4R25RPAF23T46FTNB2YLYF2U2I7BZZ2673YNUG6SO73JRCWAIQC
E4KW7YDDHZXLERHI7J7QP5EDLT3LPCGDKK4RVK55BNVYW6PVOUOAC
FYUID2XZFKETTS6EJLOGMV5DJJMLY4NCQSATT3ZKWDLHEKJRO4TAC
WU76EHIWT7E3CEKGV6LQJUSOC63JYDPPDRV3KKOEGLP2FB5UEUEQC
3JPO7GIIDNQGW4SFJEQCNT4S7Z6XITFQ4DH6TMNDC3BW7ZHU53OAC
MWK2ADVAEU5LFANBGH3GPEHYXUHRN5BHVQKI7JIVY6DHE5I6JJ7AC
7F3ARXB7ZM4WPC3FOH5RHXFHWB3MGZ6XJYJX5IH6RFQNUKOWHAOQC
G7KNVIJW7ORXWK4IX3VHPW5DIILVOT6W7U6DEDLNUZWHNDGYWKEQC
HODZPQLZBLXXWP6NVKDWVSLFEN6HKIKOFG6QVX2PEJRYAM4I6ZQQC
ZT7ANUTIQZ4P6FB7ND64I5RDV742TUDQSYCKIKCHVJQILXI4GJJQC
ULP633GABRVOBHMS7H2MFFWPPSVQJWMNZVNEUCGZE32HEWMH4KIAC
4EYK4BECM65Q7PWBSARPRHL5MIPC4JGIWYIPFDCSAV6YWSZH67BQC
36DOXPCS6H5JNDEGCINCKM34KVMO3GO3PQ7ZLQTJUG2KVIEGY63QC
5QSNY2QCI5S5TMB7344ZCQWVHBEGWKAA6ECRRGI4ZHYXNVS3OHUQC
FFRVLC5ETQ5C5VBU6RM26MGEHS76DMISRZMAOOTYDIIGAQVTWVYAC
NOERIAFJQT4BVHDRI2OGECWRS2O5N5MGUIQJQT6AMPP46CSFUXCQC
XBXXQ7NGCA2AM7F6DODF75VNIA52J3MNYSLNAYRRC2KKYJUVCN2QC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
SA3DIEDDBCZJXIMXOMIPR33TWH7ZX427Q7BKTYABR24ZXRELGJXQC
F4OH5H3ONZKUVBUI5NKYJ25B66FS22QQ7LRAO53OQX3ZBL2BL6JAC
DQXWDHCPZVQ5GXDGORO33WKVR5HYOODQ5CHGZXVMM6ESHCFM5FSQC
** TODO Attestation box ?
SCHEDULED: <2023-10-30 Mon>
** KILL Attestation box ?
CLOSED: [2023-11-02 Thu 22:29] SCHEDULED: <2023-10-30 Mon>
** TODO Arrêter résilier et souscrire nouveau contrat :internet:
SCHEDULED: <2023-11-13 Mon>
Appel 2023-11-03: ouverture nouvelle ligne dans 9 jours
RV demain
** TODO Relancer DAM pour conservation compte
SCHEDULED: <2023-10-27 Fri>
** TODO Mail DSI pour conservation compte (attente)
SCHEDULED: <2023-10-27 Fri>
** DONE Relancer DAM pour conservation compte
CLOSED: [2023-11-02 Thu 22:29] SCHEDULED: <2023-10-27 Fri>
** DONE Mail DSI pour conservation compte (attente)
CLOSED: [2023-11-02 Thu 22:29] SCHEDULED: <2023-10-27 Fri>
ub.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
.//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
Dépendences
s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP T2T
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** TODO Mise à jour packages nix
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** TODO BioDBBBigFile
:PROPERTIES:
:ORDERED: t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Version de kent déjà packagée : forcer version 335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[https://github.com/NixOS/nixpkgs/pull/206991][Restore building kent 404]]
CLOSED: [2023-05-06 Sat 17:40]
Review faite <2023-03-26 Sun> , atteinte merge]
Relancé <2023-05-06 Sat>
Kent 446 n'a pas ce problème donc PR inutile
***** DONE [[https://github.com/NixOS/nixpkgs/pull/223411][Ajouter les header to package]] (inc folder)
CLOSED: [2023-05-08 Mon 10:18] SCHEDULED: <2023-05-07 Sun>
Review à faire
https://github.com/NixOS/nixpkgs/pull/223411
Corrigé et plus besoin de la PR précédente
***** KILL [[https://github.com/NixOS/nixpkgs/pull/186462][BioDBBBigFile]] avec ces 2 changements
CLOSED: [2023-07-02 Sun 11:20]
**** KILL Version de kent déjà packagée : 404
CLOSED: [2023-03-27 Mon 16:43]
Compile mais les tests de passent pas
**** DONE Modifier selon PR https://github.com/NixOS/nixpkgs/pull/186462
CLOSED: [2023-07-30 Sun 22:01] SCHEDULED: <2023-07-30 Sun 20:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] => 1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
**** TODO Corriger PR biobigfile
SCHEDULED: <2023-11-03 Fri>
/Entered on/ [2023-10-15 Sun 17:21]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** TODO PR python 3 upstream
SCHEDULED: <2023-10-29 Sun>
*** TODO nixpkgs en l'état
SCHEDULED: <2023-10-29 Sun>
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID: 5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023
ub.com/NixOS/nixpkgs/issues/192396][Bug report Version 22.10.6]]
**** Notes
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
.//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver
**** KILL Patch NXF_OFFLINE=true
CLOSED: [2023-07-02 Sun 11:02] SCHEDULED: <2023-06-11 Sun>
** WAIT [[https://github.com/NixOS/nixpkgs/pull/249329][Multiqc]]
HG002,sanger-chr20,data/HG002-sanger-inserted-chr20_1.fq.gz,data/HG002-sanger-inserted-chr20_2.fq.gz
** KILL Mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
Packaging faisable mais nombreux paquet python
** TODO Variant validator -> hgvs
C'est juste une interface autour d'hgvs mais il faut
- postgresql
- un accès ou télécharger des bases de données
Dépendences
s: wcwidth, pyee, pure-eval, ptyprocess, pickleshare, parsley, parse, fake-useragent, executing, backcall, appdirs, zipp, websockets, w3lib, urllib3, traitlets, tqdm, tabulate, sqlparse, soupsieve, six, pygments, psycopg2, prompt-toolkit, pexpect, parso, lxml, idna, humanfriendly, decorator, cython, cssselect, configparser, charset-normalizer, certifi, attrs, requests, pysam, pyquery, matplotlib-inline, jedi, importlib-metadata, coloredlogs, beautifulsoup4, asttokens, yoyo-migrations, stack-data, pyppeteer, bs4, bioutils, requests-html, ipython, biocommons.seqrepo, hgvs
** TODO SPIP T2T
*** DONE PR upstream
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** DONE Mail R. Lemann
CLOSED: [2023-08-12 Sat 18:23] SCHEDULED: <2023-08-12 Sat 18:00>
*** TODO Mise à jour packages nix
*** TODO Corriger PR
SCHEDULED: <2023-11-05 Sun>
** TODO VEP :vep:
*** DONE [[https://github.com/NixOS/nixpkgs/pull/185691][BioPerl]]
SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-09 Tue 10:57]
PR submitted
*** TODO BioDBBBigFile
:PROPERTIES:
:ORDERED: t
:END:
/Entered on/ [2022-08-10 Wed 14:28]
On utilise la dernière version de kent, donc plus de problème.
PRête à être mergé. Rebase faite<2023-07-02 Sun>
**** DONE Version de kent déjà packagée : forcer version 335
CLOSED: [2023-07-02 Sun 11:20]
***** KILL [[https://github.com/NixOS/nixpkgs/pull/206991][Restore building kent 404]]
CLOSED: [2023-05-06 Sat 17:40]
Review faite <2023-03-26 Sun> , atteinte merge]
Relancé <2023-05-06 Sat>
Kent 446 n'a pas ce problème donc PR inutile
***** DONE [[https://github.com/NixOS/nixpkgs/pull/223411][Ajouter les header to package]] (inc folder)
CLOSED: [2023-05-08 Mon 10:18] SCHEDULED: <2023-05-07 Sun>
Review à faire
https://github.com/NixOS/nixpkgs/pull/223411
Corrigé et plus besoin de la PR précédente
***** KILL [[https://github.com/NixOS/nixpkgs/pull/186462][BioDBBBigFile]] avec ces 2 changements
CLOSED: [2023-07-02 Sun 11:20]
**** KILL Version de kent déjà packagée : 404
CLOSED: [2023-03-27 Mon 16:43]
Compile mais les tests de passent pas
**** DONE Modifier selon PR https://github.com/NixOS/nixpkgs/pull/186462
CLOSED: [2023-07-30 Sun 22:01] SCHEDULED: <2023-07-30 Sun 20:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] => 1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
**** TODO Corriger PR biobigfile
SCHEDULED: <2023-11-03 Fri>
/Entered on/ [2023-10-15 Sun 17:21]
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** TODO PR python 3 upstream
SCHEDULED: <2023-11-06 Mon>
*** TODO nixpkgs en l'état
SCHEDULED: <2023-11-06 Mon>
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID: 5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023
<(aligneur)>
CLOSED: [2023-10-13 Fri 17:40] SCHEDULED: <2023-10-01 Sun>
*** DONE Figure: nombre d'articles citant les principaux aligneur par année
CLOSED: [2023-10-11 Wed 23:54] SCHEDULED: <2023-10-03 Tue>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
*** DONE Figure: nombre d'articles citant les principaux aligneur
CLOSED: [2023-10-12 Thu 23:58] SCHEDULED: <2023-10-12 Thu>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
On se base sur
** Appel de variant
*** TODO Biblio <(biblio appel variant)> <(appel variant)>
SCHEDULED: <2023-10-18 Wed>
*** TODO Figure: nombre de publication par appel de variant
SCHEDULED: <2023-10-21 Sat>
/Entered on/ [2023-09-19 Tue 08:43]
** TODO Figure: nombre d'exomes par années
SCHEDULED: <2023-11-02 Thu>
/Entered on/ [2023-09-19 Tue 08:43]
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new
<(aligneur)>
CLOSED: [2023-10-13 Fri 17:40] SCHEDULED: <2023-10-01 Sun>
*** DONE Figure: nombre d'articles citant les principaux aligneur par année
CLOSED: [2023-10-11 Wed 23:54] SCHEDULED: <2023-10-03 Tue>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
*** DONE Figure: nombre d'articles citant les principaux aligneur
CLOSED: [2023-10-12 Thu 23:58] SCHEDULED: <2023-10-12 Thu>
Il faudrait utiliser pubmed en local, sinon c'est 10 000 requete par aligner !
On se base sur
** Appel de variant
*** TODO Biblio <(biblio appel variant)> <(appel variant)>
SCHEDULED: <2023-11-04 Sat>
*** TODO Figure: nombre de publication par appel de variant
SCHEDULED: <2023-11-05 Sun>
/Entered on/ [2023-09-19 Tue 08:43]
** TODO Figure: nombre d'exomes par années
SCHEDULED: <2023-11-16 Thu>
/Entered on/ [2023-09-19 Tue 08:43]
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new
TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 |
| SNP | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
#+begin_src sh
bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed | wc -l
#+end_src
204280
T2T
#+begin_src sh
bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed | wc -l
#+end_src
204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
\
--reference chm13v2.0.fa \
--threads 6 \
\
-T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
--false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
\
-o HG001
#+end_src
****** DONE Corriger FILTER : mieux mais toujours trop de négatifs. 3/4 SNP retrouvés
CLOSED: [2023-07-08 Sat 15:19] SCHEDULED: <2023-07-08 Sat>
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
INDEL PASS 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
SNP ALL 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
SNP PASS 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
2 │ chr1 608765 A G ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:ti:SNP:homalt:188
liftDown UCSC: rien en GIAB : vrai FP
3 │ chr1 762943 A G ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:ti:SNP:homalt:287
4 │ chr1 762945 A T ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:tv:SNP:homalt:287
Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38 chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
pour les SNP:
alt_T2T[i] = DNA_GRC38[j]
avec i la position en T2T et j la position en GRCh38
Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
- Pour chaque FP, c'est un "faux" FP si
- REF en hg38 == ALT en T2T
- et REF en hg38 != REF en T2T
- et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
1. lister 10 reads contenant le variant
2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023] mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-11-02 Thu>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] => 0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] => 0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-11-02 Thu>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats
donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** DONE Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
***** DONE Liftover capture
CLOSED: [2023-08-06 Sun 18:30] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
nextflow run -profile standard,helios workflows/lift-nextera-capture.nf -lib lib
#+end_src
Vérification rapide : ok
***** DONE Run
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_NA12878-63118093_S260-GRCh38/callVariant/haplotypecaller/2300346867_NA12878-63118093_S260-GRCh38.vcf.gz --outdir=out/2300346867_NA12878-63118093_S260-GRCh38/happy-nextera-lifted/ --compare=happy -lib lib --capture=capture/nexterarapidcapture_exome_targetedregions_v1.2-nochrM_lifted.bed --id=HG001 --genome=GRCh38
#+end_src
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed --id=HG001 --genome=GRCh38
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| INDEL | PASS | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| SNP | ALL | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
| SNP | PASS | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| INDEL | PASS | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| SNP | ALL | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
| SNP | PASS | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
#+begin_src sh
ID="2300346867_NA12878-63118093_S260-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --outdir=out/${ID}/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
**** DONE Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
**** DONE Vérifier l'impact gatk 4.3 - 4.4 : aucun
CLOSED: [2023-08-05 Sat 09:25]
**** DONE Figure comparant les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Mail Paul sur les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** DONE Mail cento pour demande le type de capture
CLOSED: [2023-10-07 Sat 17:59]
/Entered on/ [2023-08-07 Mon 20:40]
Twist exome
*** PROJ Comparer happy et happy-vcfeval :giab:
** TODO Données CHM13 :chm:
https://github.com/lh3/CHM-eval
*** TODO Run ERR1341793
SCHEDULED: <2023-11-04 Sat>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
*** TODO Run ERR1341796
SCHEDULED: <2023-11-04 Sat>
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automa
TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 |
| SNP | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
#+begin_src sh
bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed | wc -l
#+end_src
204280
T2T
#+begin_src sh
bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed | wc -l
#+end_src
204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
\
--reference chm13v2.0.fa \
--threads 6 \
\
-T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
--false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
\
-o HG001
#+end_src
****** DONE Corriger FILTER : mieux mais toujours trop de négatifs. 3/4 SNP retrouvés
CLOSED: [2023-07-08 Sat 15:19] SCHEDULED: <2023-07-08 Sat>
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
INDEL PASS 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
SNP ALL 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
SNP PASS 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* DONE Examiner les FP
CLOSED: [2023-07-30 Sun 22:05]
******* DONE Tester un FP
CLOSED: [2023-07-30 Sun 22:05]
2 │ chr1 608765 A G ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:ti:SNP:homalt:188
liftDown UCSC: rien en GIAB : vrai FP
3 │ chr1 762943 A G ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:ti:SNP:homalt:287
4 │ chr1 762945 A T ./.:.:.:.:NOCALL:nocall:. 1/1:FP:.:tv:SNP:homalt:287
Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38 chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
****** DONE Examiner les FP restant après correction selon séquence de référence
CLOSED: [2023-08-12 Sat 15:57]
****** HOLD Examiner les variants supprimé
****** TODO Enlever les FP qui correspondent à un changement dans le génome
******* Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
pour les SNP:
alt_T2T[i] = DNA_GRC38[j]
avec i la position en T2T et j la position en GRCh38
Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******* Idée
- Pour chaque FP, c'est un "faux" FP si
- REF en hg38 == ALT en T2T
- et REF en hg38 != REF en T2T
- et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
1. lister 10 reads contenant le variant
2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******* DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******* HOLD Corriger proprement VCF ou résultats Happy
******* TODO Adapter pour gérer plusieurs variants par read
****** DONE Méthodologie du pangenome
CLOSED: [2023-10-03 Tue 21:28]
Voir biblio[cite:@liao2023] mais ont aligné sur GRCH38
******* DONE Mail alexis
CLOSED: [2023-10-03 Tue 21:28]
****** DONE Méthodologie T2T
CLOSED: [2023-10-16 Mon 19:42]
Mail alexis
SCHEDULED: <2023-10-04 Wed>
***** TODO Rendre simplement le nombre de vrais positifs
SCHEDULED: <2023-11-09 Thu>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] => 0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] => 0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
**** DONE Mail Paul sur les résultat ashkenazim +/- centogene
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Relancer comparaison GIAB avec GATK 4.4.0
CLOSED: [2023-08-12 Sat 15:55]
/Entered on/ [2023-08-03 Thu 12:42]
*** TODO Platinum genome :platinum:
https://emea.illumina.com/platinumgenomes.html
**** TODO Tester sur la zone couverte par l'exome centogène
SCHEDULED: <2023-11-09 Thu>
*** DONE Séquencer NA12878 :cento:hg001:
CLOSED: [2023-10-07 Sat 17:59]
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** DONE Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
***** DONE Liftover capture
CLOSED: [2023-08-06 Sun 18:30] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
nextflow run -profile standard,helios workflows/lift-nextera-capture.nf -lib lib
#+end_src
Vérification rapide : ok
***** DONE Run
CLOSED: [2023-08-06 Sun 19:05] SCHEDULED: <2023-08-06 Sun>
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_NA12878-63118093_S260-GRCh38/callVariant/haplotypecaller/2300346867_NA12878-63118093_S260-GRCh38.vcf.gz --outdir=out/2300346867_NA12878-63118093_S260-GRCh38/happy-nextera-lifted/ --compare=happy -lib lib --capture=capture/nexterarapidcapture_exome_targetedregions_v1.2-nochrM_lifted.bed --id=HG001 --genome=GRCh38
#+end_src
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed --id=HG001 --genome=GRCh38
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| INDEL | PASS | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| SNP | ALL | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
| SNP | PASS | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| INDEL | PASS | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| SNP | ALL | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
| SNP | PASS | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
#+begin_src sh
ID="2300346867_NA12878-63118093_S260-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --outdir=out/${ID}/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
**** DONE Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
CLOSED: [2023-08-05 Sat 09:25] SCHEDULED: <2023-08-03 Thu 20:00>
**** DONE Vérifier l'impact gatk 4.3 - 4.4 : aucun
CLOSED: [2023-08-05 Sat 09:25]
**** DONE Figure comparant les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** DONE Mail Paul sur les 3 capture :hg001:
CLOSED: [2023-08-06 Sun 20:24] SCHEDULED: <2023-08-06 Sun>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** DONE Mail cento pour demande le type de capture
CLOSED: [2023-10-07 Sat 17:59]
/Entered on/ [2023-08-07 Mon 20:40]
Twist exome
*** PROJ Comparer happy et happy-vcfeval :giab:
** TODO Données CHM13 :chm:
https://github.com/lh3/CHM-eval
*** TODO Run ERR1341793
SCHEDULED: <2023-11-11 Sat>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
*** TODO Run ERR1341796
SCHEDULED: <2023-11-11 Sat>
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automa
from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-11-04 Sat>
**** DONE En T2T avec liftover (filtre = spip) : ok mais lent et trop de variants :tests:
CLOSED: [2023-09-17 Sun 17:13] SCHEDULED: <2023-09-17 Sun>
1. Conversion en bed
#+begin_src sh :dir:~/code/sanger
open snvs-cento-sanger.csv | select chrom pos | insert pos2 {$in.pos } | to csv --separator="\t" | save snvs-cento-sanger.bed -f
#+end_src
2. Liftover avec UCSC (en ligne)
NB: vérifié sur le premier résultat en cherche le read contenant le variant (samtools view -r puis samtools view | grep en T2T) et avec l'aide d'IGV, on a un variant qui correspond en
chr1:10757746
3. En supposant que l'ordre des variants n'a pas changé, on ajoute simplement REF et ALT avec annotateLifted.jl
Annotation spip *très lente* : 1h13 !
Résultat:
2×3 DataFrame
Row │ variant meanQual depth
│ String Float64 Int64
─────┼──────────────────────────────────────
1 │ chr12:g.13594572 60.0 1
2 │ chr17:g.10204026 60.0 1
144 found over 146
filter depth : another 0 missed variants
filter poly : another 0 missed variants
filter vep : another 0 missed variants
Et on a trop de variants en sortie (7330 !)
**** DONE Mail Paul avec résultats filtre en T2T + nouveau schéma
CLOSED: [2023-09-17 Sun 23:15] SCHEDULED: <2023-09-17 Sun>
** TODO Medically relevant genes
SCHEDULED: <2023-11-02 Thu>
/Entered on/ [2023-10-18 Wed 22:37]
* Ré-interprétation :reanalysis:
** DONE Lancer tests sur données brutes [225/250] <(samples.csv)> <(runs.waiting)>
CLOSED: [2023-10-14 Sat 11:58] SCHEDULED: <2023-10-08 Sun>
- [X] 100222_63015289
- [X] 1600304839_63051311
- [X] 1900007827_62913191
- [X] 1900398899_62999500
- [X] 1900486799_62913197
- [X] 2100422923_62952677
- [X] 2100458888_62933047
- [X] 2100601558_62903840
- [X] 2100609288_62905768
- [X] 2100609501_62905776
- [X] 2100614493_62951074
- [X] 2100622566_62908067
- [X] 2100622601_62908060
- [X] 2100622705_62908063
- [X] 2100640027_62911936
- [X] 2100645285_62913212
- [X] 2100661411_62914081
- [X] 2100661462_62914086
- [X] 2100708257_62921596
- [X] 2100738732_62926501
- [X] 2100738850_62926509
- [X] 2100746751_62926505
- [X] 2100746797_62926506
- [X] 2100782349_62931722
- [X] 2100782416_62931561
- [X] 2100782559_62931718
- [X] 2100799204_62934768
- [X] 2200010202_62940284
- [X] 2200023600_62940631
- [X] 2200024348_62999591
- [X] 2200027505_62942457
- [X] 2200038776_62943412
- [X] 2200041919_62943405
- [X] 2200088014_62951326
- [X] 2200146652_62959388
- [X] 2200151850_62960953
- [X] 2200160014_62959475
- [X] 2200160070_62959478
- [X] 2200201368_62967471
- [X] 2200201400_62967470
- [X] 2200265558_62976332
- [X] 2200265605_62976401
- [X] 2200267046_62975192
- [X] 2200273878_62999530
- [X] 2200279708_62977002
- [X] 2200284408_62979102
- [X] 2200293987_62979116
- [X] 2200294359_62979118
- [X] 2200306299_62982217
- [X] 2200306539_62982193
- [X] 220030671_62982211
- [X] 2200307058_62982231
- [X] 2200307108_62982196
- [X] 2200307136_62982221
- [X] 2200307199_62982239
- [X] 2200307230_62982234
- [X] 2200307262_62982219
- [X] 2200307297_62982227
- [X] 2200324510_62985453
- [X] 2200324549_62985478
- [X] 2200324573_62985445
- [X] 2200324594_62985467
- [X] 2200324606_62985463
- [X] 2200324614_62985459
- [X] 2200338306_62985430
- [X] 2200343880_62989407
- [X] 2200343910_62989460
- [X] 2200343938_62989451
- [X] 2200343966_62989456
- [X] 2200343993_62989440
- [X] 2200344013_62989464
- [X] 2200349749_62989465
- [X] 2200363462_62988848
- [X] 2200377880_62991993
- [X] 2200378032_62991991
- [X] 2200383996_62993828
- [X] 2200384015_62993796
- [X] 2200384046_62993822
- [X] 2200384117_62993808
- [X] 2200384187_62993825
- [X] 2200384231_62992898
- [X] 2200385658_63060260
- [X] 2200394260_62994732
- [X] 2200395817_62994742
- [X] 2200396731_62994737
- [X] 2200424073_62999579
- [X] 2200424207_62999632
- [X] 2200426178_62999630
- [X] 2200426243_62999635
- [X] 2200426466_62999605
- [X] 2200426642_62999627
- [X] 2200427406_62999649
- [X] 2200427512_62999639
- [X] 2200428953_62999572
- [X] 2200428981_62999600
- [X] 2200428999_62999592
- [X] 2200441970_63000868
- [X] 2200441989_63000882
- [X] 2200442135_63000864
- [X] 2200442216_63000886
- [X] 2200442257_63000951
- [X] 2200451801_63003573
- [X] 2200451862_63004218
- [X] 2200451894_63004210
- [X] 2200456165_63051294
- [X] 2200459865_63004933
- [X] 2200459968_63004937
- [X] 2200460073_63004943
- [X] 2200460121_63004684
- [X] 2200467051_63003856
- [X] 2200467225_63004940
- [X] 2200467261_63004930
- [X] 2200467338_63004925
- [X] 2200470099_63004485
- [X] 2200470142_63004480
- [X] 2200471780_63004362
- [X] 2200480910_63006466
- [X] 2200495073_63010427
- [X] 2200495510_63009152
- [X] 2200508677_63060252
- [X] 2200510531_63012582
- [X] 2200510628_63012549
- [X] 2200510657_63012554
- [X] 2200511249_63012533
- [X] 2200511274_63012586
- [X] 2200517952_63060399
- [X] 2200519525_63060439
- [X] 2200524009_63014044
- [X] 2200524609_63014046
- [X] 2200524616_63014048
- [X] 2200533429_63060425
- [X] 2200539735_63060406
- [X] 2200549908_63019339
- [X] 2200549965_63019349
- [X] 2200550414_63019357
- [X] 2200550471_63020031
- [X] 2200550490_63019351
- [X] 2200550505_63019340
- [X] 2200555565_63018614
- [X] 2200559438_63020029
- [X] 2200559682_63020030
- [X] 2200559713_63019623
- [X] 2200559739_63019626
- [X] 2200569969_63019991
- [X] 2200570001_63021580
- [X] 2200570025_63021490
- [X] 2200570035_63021491
- [X] 2200570042_63021493
- [X] 2200570050_63021494
- [X] 2200579897_63024910
- [X] 2200583995_63024866
- [X] 2200584035_63024905
- [X] 2200584069_63024888
- [X] 2200584126_63024810
- [X] 2200589507_63026712
- [X] 2200597365_63027994
- [X] 2200597480_63027988
- [X] 2200597752_63026853
- [X] 2200597778_63027992
- [X] 22005977_63026903
- [X] 2200609031_63026527
- [X] 2200614198_63113928
- [X] 2200620372_63030821
- [X] 2200620442_63030810
- [X] 2200620498_63030816
- [X] 2200620628_63031031
- [X] 2200622310_63030984
- [X] 2200622355_63030956
- [X] 2200625369_63028699
- [X] 2200625410_63028697
- [X] 2200625536_63028694
- [X] 2200630189_63030665
- [X] 2200635149_63033182
- [X] 2200644544_63037731
- [X] 2200644594_63037725
- [X] 2200650089_63038093
- [X] 2200666292_63076568
- [X] 2200669188_63036688
- [X] 2200669320_63040259
- [X] 2200669383_63040254
- [X] 2200669414_63040257
- [X] 2200669446_63040251
- [X] 2200680342_63105271
- [X] 2200694535_63042853
- [X] 2200694789_63042862
- [X] 2200694858_63042702
- [X] 2200694917_63042696
- [X] 2200699290_63043047
- [X] 2200699345_63040238
- [X] 2200699383_63043050
- [X] 2200699412_63040731
- [X] 220071551_63048935
- [X] 2200731515_63048963
- [X] 2200748145_63051198
- [X] 2200748171_63051213
- [X] 2200751046_63051249
- [X] 2200751101_63051234
- [X] 2200766471_63054590
- [X] 2200767731_63054595
- [X] 2200767822_63054464
- [X] 2200775505_63060410
- [X] 2200850441_63019345
- [X] 220597589_63026879
- [X] 2300003253_63060430
- [X] 2300005679_63060370
- [X] 2300009914_63060390
- [X] 2300028784_63060001
- [X] 2300036815_63063357
- [X] 2300055382_63061874
- [X] 2300055421_63061871
- [X] 2300055440_63061880
- [X] 230006894_63064950
- [X] 2300071111_63070356
- [X] 2300083434_63071675
- [X] 2300103609_63076239
- [X] 2300104572_63076232
- [X] 2300109602_63076765
- [X] 2300109665_63076770
- [X] 2300119721_63078732
- [X] 2300137773_63078133
- [X] 2300137834_63078123
- [X] 2300167821_63086183
- [X] 2300172698_63113453
- [X] 2300188216_63090609
- [X] 2300188281_63090632
- [ ] 2300188800_63090616
- [ ] 2300193193645_63090623
- [ ] 2300193668_63090611
- [ ] 2300195426_63090608
- [ ] 2300201017_63089636
- [ ] 2300227479_63098330
- [ ] 2300232688_63130821
- [ ] 2300292749_63109239
- [ ] 230029277_63109247
- [
] 2300294712_63109236
- [ ] 2300308032_63111581
- [ ] 2300323537_63114209
- [ ] 2300334609_63115535
- [ ] 2300346867_63118093
- [ ] 2300346867_63118093_NA12878
- [ ] 2300348940_63118099
- [ ] 2300359806_63119915
- [ ] 2300380476_63123963
- [ ] 2300382582_63123749
- [ ] 2300384269_63126867
- [ ] 2300407581_63130826
- [ ] 2300407626_63130842
- [ ] 2300409593_63130874
- [ ] 2300409612_63130980
- [ ] 2300417623_63131524
** TODO Variants manqués :missed:
SCHEDULED: <2023-10-21 Sat>
*** DONE 63012582: chr10:g.102230760 filtré par AD :63012582:
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
Il est en sortie d'haplotypecaller !
Attention à la position : POS=102230753 noté CG->C
GT:AD:DP:GQ:PL 0/1:26,8:34:99:146,0,671
Filtré par la condition AD <= 10 (porté par 8 reads seulement)
Non confirméen sanger, rendu vous
**** KILL image BAM cento
CLOSED: [2023-10-08 Sun 23:13]
**** DONE image BAM bisonex
CLOSED: [2023-10-08 Sun 23:23] SCHEDULED: <2023-10-08 Sun>
**** DONE Mail Paul
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
*** DONE 63060439: chr15:g.26869324 = Problème de profondeur DP=15 :63060439:
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
GABRA5
Rendu VOUS avec un variant patho MDB5 pour même patient (VOUS- même)
Non confirmé en Sanger
GT:AD:DP:GQ:PL 0/1:9,6:15:99:103,0,213
**** DONE image BAM bisonex
CLOSED: [2023-10-08 Sun 22:56]
**** DONE Mail Paul
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
** TODO Comparer variants cento à sortie bisonex :checkpipeline:
SCHEDULED: <2023-10-21 Sat>
*** TODO Un seul exécutable pour toutes les étapes
SCHEDULED: <2023-10-21 Sat>
Un utilitaire en ligne de commande qui appel les différentes étapes.
On utilise une structure unique pour toutes les étapes mais qui sera remplie au fur et à mesure. En stockant dans un csv à chaque étape
**** DONE parse variants
CLOSED: [2023-10-21 Sat 23:29] SCHEDULED: <2023-10-21 Sat>
**** DONE Ajouter négatifs dans la liste des variants
CLOSED: [2023-10-22 Sun 23:01] SCHEDULED: <2023-10-21 Sat>
**** DONE Mettre à jour liste des variants
CLOSED: [2023-10-22 Sun 23:01] SCHEDULED: <2023-10-21 Sat>
- [X] Régéner la liste des variants
- [ ] Retrouver les variants modifié à la main avec diff
On ne garde que les ajouts
#+begin_src sh
awk -F ',' '{print $1","$2":"$3$4$5}' extracted.csv | ^sort | save -f extracted_concat.csv
xsv select 1-2 ~/annex/data/centogene/variants/variant_genomic.csv | ^sort | save variant_genomic_corr.csv -f
diff extracted_concat.csv variant_genomic_corr.csv | grep '^>' | save -f update.diff
#+end_src
- [X] Ajouter négatifs
345 variants non trouvés avant modification
141 après modification
- [X] Ajouter différence
- [X] Corriger erreurs de parsing
**** KILL Lifter coordonées variants cento génomique en GRCh38
CLOSED: [2023-10-21 Sat 22:47]
**** DONE Un seul type de données
CLOSED: [2023-10-25 Wed 09:13] SCHEDULED: <2023-10-21 Sat>
**** KILL variant_recoder pour avoir les coordonnées VCF
CLOSED: [2023-10-25 Wed 09:13] SCHEDULED: <2023-10-21 Sat>
mobidetails n e trouve pas les ieux transcrits
**** DONE Annotation mobidetails (gene + données gonémique)
CLOSED: [2023-10-25 Wed 09:14]
**** DONE Envoyer liste à Paul
SCHEDULED: <2023-10-26 Thu>
**** DONE compare chaque variant avec la sortie du pipeline
CLOSED: [2023-10-31 Tue 00:18] SCHEDULED: <2023-10-21 Sat>
Avec la fonction "test" dans Search.hs
1126 extracted
654 annotated
253 raw data
102 raw and annotated
236 raw and extracted
17 raw NOT extracted
890 extract WITHOUT raw
#+begin_src sh
❯ open diff.txt | from csv | get id | into string | each {|e| "~/annex/data/centogene/reports/" ++ $e ++ "*.pdf"} | each {|e| firefox $e }
#+end_src
Les 17 manquants sont
- 62913191 : CNV
- 62959388 : MT-ATP6
- 62999572 : MT-ATP6
- 62999627 : CNV
- 62999630 : CNV
- 63004218: CNV
- 63006466 : CNV
- 63009152 : manqué à extraire -> bien présent
- 63015289: CNV
- 63024910 : MT-ATP6
- 63040251 : CNV
- 63043050 : CNV
- 63118093 : NA12878
- NA12878 x4
**** TODO Annoter avec la sortie du pipeline
SCHEDULED: <2023-11-01 Wed>
**** TODO Annoter avec Sanger
SCHEDULED: <2023-11-01 Wed>
**** TODO Rajouter variant pour 63009152
SCHEDULED: <2023-11-01 Wed>
**** TODO Regénérer annotation avec NC_
SCHEDULED: <2023-10-31 Tue>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-11-04 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-11-04 Sat>
* Communication
** DONE Mail NGS-diag
CLOSED: [2023-10-06 Fri 08:04] SCHEDULED: <2023-10-06 Fri>
/Entered on/ [2023-10-04 Wed 19:33]
from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-11-11 Sat>
**** DONE En T2T avec liftover (filtre = spip) : ok mais lent et trop de variants :tests:
CLOSED: [2023-09-17 Sun 17:13] SCHEDULED: <2023-09-17 Sun>
1. Conversion en bed
#+begin_src sh :dir:~/code/sanger
open snvs-cento-sanger.csv | select chrom pos | insert pos2 {$in.pos } | to csv --separator="\t" | save snvs-cento-sanger.bed -f
#+end_src
2. Liftover avec UCSC (en ligne)
NB: vérifié sur le premier résultat en cherche le read contenant le variant (samtools view -r puis samtools view | grep en T2T) et avec l'aide d'IGV, on a un variant qui correspond en
chr1:10757746
3. En supposant que l'ordre des variants n'a pas changé, on ajoute simplement REF et ALT avec annotateLifted.jl
Annotation spip *très lente* : 1h13 !
Résultat:
2×3 DataFrame
Row │ variant meanQual depth
│ String Float64 Int64
─────┼──────────────────────────────────────
1 │ chr12:g.13594572 60.0 1
2 │ chr17:g.10204026 60.0 1
144 found over 146
filter depth : another 0 missed variants
filter poly : another 0 missed variants
filter vep : another 0 missed variants
Et on a trop de variants en sortie (7330 !)
**** DONE Mail Paul avec résultats filtre en T2T + nouveau schéma
CLOSED: [2023-09-17 Sun 23:15] SCHEDULED: <2023-09-17 Sun>
** TODO Medically relevant genes
SCHEDULED: <2023-11-09 Thu>
/Entered on/ [2023-10-18 Wed 22:37]
* Ré-interprétation :reanalysis:
** DONE Lancer tests sur données brutes [225/250] <(samples.csv)> <(runs.waiting)>
CLOSED: [2023-10-14 Sat 11:58] SCHEDULED: <2023-10-08 Sun>
- [X] 100222_63015289
- [X] 1600304839_63051311
- [X] 1900007827_62913191
- [X] 1900398899_62999500
- [X] 1900486799_62913197
- [X] 2100422923_62952677
- [X] 2100458888_62933047
- [X] 2100601558_62903840
- [X] 2100609288_62905768
- [X] 2100609501_62905776
- [X] 2100614493_62951074
- [X] 2100622566_62908067
- [X] 2100622601_62908060
- [X] 2100622705_62908063
- [X] 2100640027_62911936
- [X] 2100645285_62913212
- [X] 2100661411_62914081
- [X] 2100661462_62914086
- [X] 2100708257_62921596
- [X] 2100738732_62926501
- [X] 2100738850_62926509
- [X] 2100746751_62926505
- [X] 2100746797_62926506
- [X] 2100782349_62931722
- [X] 2100782416_62931561
- [X] 2100782559_62931718
- [X] 2100799204_62934768
- [X] 2200010202_62940284
- [X] 2200023600_62940631
- [X] 2200024348_62999591
- [X] 2200027505_62942457
- [X] 2200038776_62943412
- [X] 2200041919_62943405
- [X] 2200088014_62951326
- [X] 2200146652_62959388
- [X] 2200151850_62960953
- [X] 2200160014_62959475
- [X] 2200160070_62959478
- [X] 2200201368_62967471
- [X] 2200201400_62967470
- [X] 2200265558_62976332
- [X] 2200265605_62976401
- [X] 2200267046_62975192
- [X] 2200273878_62999530
- [X] 2200279708_62977002
- [X] 2200284408_62979102
- [X] 2200293987_62979116
- [X] 2200294359_62979118
- [X] 2200306299_62982217
- [X] 2200306539_62982193
- [X] 220030671_62982211
- [X] 2200307058_62982231
- [X] 2200307108_62982196
- [X] 2200307136_62982221
- [X] 2200307199_62982239
- [X] 2200307230_62982234
- [X] 2200307262_62982219
- [X] 2200307297_62982227
- [X] 2200324510_62985453
- [X] 2200324549_62985478
- [X] 2200324573_62985445
- [X] 2200324594_62985467
- [X] 2200324606_62985463
- [X] 2200324614_62985459
- [X] 2200338306_62985430
- [X] 2200343880_62989407
- [X] 2200343910_62989460
- [X] 2200343938_62989451
- [X] 2200343966_62989456
- [X] 2200343993_62989440
- [X] 2200344013_62989464
- [X] 2200349749_62989465
- [X] 2200363462_62988848
- [X] 2200377880_62991993
- [X] 2200378032_62991991
- [X] 2200383996_62993828
- [X] 2200384015_62993796
- [X] 2200384046_62993822
- [X] 2200384117_62993808
- [X] 2200384187_62993825
- [X] 2200384231_62992898
- [X] 2200385658_63060260
- [X] 2200394260_62994732
- [X] 2200395817_62994742
- [X] 2200396731_62994737
- [X] 2200424073_62999579
- [X] 2200424207_62999632
- [X] 2200426178_62999630
- [X] 2200426243_62999635
- [X] 2200426466_62999605
- [X] 2200426642_62999627
- [X] 2200427406_62999649
- [X] 2200427512_62999639
- [X] 2200428953_62999572
- [X] 2200428981_62999600
- [X] 2200428999_62999592
- [X] 2200441970_63000868
- [X] 2200441989_63000882
- [X] 2200442135_63000864
- [X] 2200442216_63000886
- [X] 2200442257_63000951
- [X] 2200451801_63003573
- [X] 2200451862_63004218
- [X] 2200451894_63004210
- [X] 2200456165_63051294
- [X] 2200459865_63004933
- [X] 2200459968_63004937
- [X] 2200460073_63004943
- [X] 2200460121_63004684
- [X] 2200467051_63003856
- [X] 2200467225_63004940
- [X] 2200467261_63004930
- [X] 2200467338_63004925
- [X] 2200470099_63004485
- [X] 2200470142_63004480
- [X] 2200471780_63004362
- [X] 2200480910_63006466
- [X] 2200495073_63010427
- [X] 2200495510_63009152
- [X] 2200508677_63060252
- [X] 2200510531_63012582
- [X] 2200510628_63012549
- [X] 2200510657_63012554
- [X] 2200511249_63012533
- [X] 2200511274_63012586
- [X] 2200517952_63060399
- [X] 2200519525_63060439
- [X] 2200524009_63014044
- [X] 2200524609_63014046
- [X] 2200524616_63014048
- [X] 2200533429_63060425
- [X] 2200539735_63060406
- [X] 2200549908_63019339
- [X] 2200549965_63019349
- [X] 2200550414_63019357
- [X] 2200550471_63020031
- [X] 2200550490_63019351
- [X] 2200550505_63019340
- [X] 2200555565_63018614
- [X] 2200559438_63020029
- [X] 2200559682_63020030
- [X] 2200559713_63019623
- [X] 2200559739_63019626
- [X] 2200569969_63019991
- [X] 2200570001_63021580
- [X] 2200570025_63021490
- [X] 2200570035_63021491
- [X] 2200570042_63021493
- [X] 2200570050_63021494
- [X] 2200579897_63024910
- [X] 2200583995_63024866
- [X] 2200584035_63024905
- [X] 2200584069_63024888
- [X] 2200584126_63024810
- [X] 2200589507_63026712
- [X] 2200597365_63027994
- [X] 2200597480_63027988
- [X] 2200597752_63026853
- [X] 2200597778_63027992
- [X] 22005977_63026903
- [X] 2200609031_63026527
- [X] 2200614198_63113928
- [X] 2200620372_63030821
- [X] 2200620442_63030810
- [X] 2200620498_63030816
- [X] 2200620628_63031031
- [X] 2200622310_63030984
- [X] 2200622355_63030956
- [X] 2200625369_63028699
- [X] 2200625410_63028697
- [X] 2200625536_63028694
- [X] 2200630189_63030665
- [X] 2200635149_63033182
- [X] 2200644544_63037731
- [X] 2200644594_63037725
- [X] 2200650089_63038093
- [X] 2200666292_63076568
- [X] 2200669188_63036688
- [X] 2200669320_63040259
- [X] 2200669383_63040254
- [X] 2200669414_63040257
- [X] 2200669446_63040251
- [X] 2200680342_63105271
- [X] 2200694535_63042853
- [X] 2200694789_63042862
- [X] 2200694858_63042702
- [X] 2200694917_63042696
- [X] 2200699290_63043047
- [X] 2200699345_63040238
- [X] 2200699383_63043050
- [X] 2200699412_63040731
- [X] 220071551_63048935
- [X] 2200731515_63048963
- [X] 2200748145_63051198
- [X] 2200748171_63051213
- [X] 2200751046_63051249
- [X] 2200751101_63051234
- [X] 2200766471_63054590
- [X] 2200767731_63054595
- [X] 2200767822_63054464
- [X] 2200775505_63060410
- [X] 2200850441_63019345
- [X] 220597589_63026879
- [X] 2300003253_63060430
- [X] 2300005679_63060370
- [X] 2300009914_63060390
- [X] 2300028784_63060001
- [X] 2300036815_63063357
- [X] 2300055382_63061874
- [X] 2300055421_63061871
- [X] 2300055440_63061880
- [X] 230006894_63064950
- [X] 2300071111_63070356
- [X] 2300083434_63071675
- [X] 2300103609_63076239
- [X] 2300104572_63076232
- [X] 2300109602_63076765
- [X] 2300109665_63076770
- [X] 2300119721_63078732
- [X] 2300137773_63078133
- [X] 2300137834_63078123
- [X] 2300167821_63086183
- [X] 2300172698_63113453
- [X] 2300188216_63090609
- [X] 2300188281_63090632
- [ ] 2300188800_63090616
- [ ] 2300193193645_63090623
- [ ] 2300193668_63090611
- [ ] 2300195426_63090608
- [ ] 2300201017_63089636
- [ ] 2300227479_63098330
- [ ] 2300232688_63130821
- [ ] 2300292749_63109239
- [ ] 230029277_63109247
- [ ] 2300294712_63109236
- [ ] 2300308032_63111581
- [ ] 2300323537_63114209
- [ ] 2300334609_63115535
- [ ] 2300346867_63118093
- [ ] 2300346867_63118093_NA12878
- [ ] 2300348940_63118099
- [ ] 2300359806_63119915
- [ ] 2300380476_63123963
- [ ] 2300382582_63123749
- [ ] 2300384269_63126867
- [ ] 2300407581_63130826
- [ ] 2300407626_63130842
- [ ] 2300409593_63130874
- [ ] 2300409612_63130980
- [ ] 2300417623_63131524
** TODO Variants manqués :checkpipeline:
SCHEDULED: <2023-10-21 Sat>
*** DONE 63012582: chr10:g.102230760 filtré par AD :63012582:
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
Il est en sortie d'haplotypecaller !
Attention à la position : POS=102230753 noté CG->C
GT:AD:DP:GQ:PL 0/1:26,8:34:99:146,0,671
Filtré par la condition AD <= 10 (porté par 8 reads seulement)
Non confirméen sanger, rendu vous
**** KILL image BAM cento
CLOSED: [2023-10-08 Sun 23:13]
**** DONE image BAM bisonex
CLOSED: [2023-10-08 Sun 23:23] SCHEDULED: <2023-10-08 Sun>
**** DONE Mail Paul
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
*** DONE 63060439: chr15:g.26869324 = Problème de profondeur DP=15 :63060439:
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
GABRA5
Rendu VOUS avec un variant patho MDB5 pour même patient (VOUS- même)
Non confirmé en Sanger
GT:AD:DP:GQ:PL 0/1:9,6:15:99:103,0,213
**** DONE image BAM bisonex
CLOSED: [2023-10-08 Sun 22:56]
**** DONE Mail Paul
CLOSED: [2023-10-08 Sun 23:24] SCHEDULED: <2023-10-08 Sun>
*** DONE Un seul exécutable pour toutes les étapes
CLOSED: [2023-11-04 Sat 19:00] SCHEDULED: <2023-10-21 Sat>
Un utilitaire en ligne de commande qui appel les différentes étapes.
On utilise une structure unique pour toutes les étapes mais qui sera remplie au fur et à mesure. En stockant dans un csv à chaque étape
**** DONE parse variants
CLOSED: [2023-10-21 Sat 23:29] SCHEDULED: <2023-10-21 Sat>
**** DONE Ajouter négatifs dans la liste des variants
CLOSED: [2023-10-22 Sun 23:01] SCHEDULED: <2023-10-21 Sat>
**** DONE Mettre à jour liste des variants
CLOSED: [2023-10-22 Sun 23:01] SCHEDULED: <2023-10-21 Sat>
- [X] Régéner la liste des variants
- [ ] Retrouver les variants modifié à la main avec diff
On ne garde que les ajouts
#+begin_src sh
awk -F ',' '{print $1","$2":"$3$4$5}' extracted.csv | ^sort | save -f extracted_concat.csv
xsv select 1-2 ~/annex/data/centogene/variants/variant_genomic.csv | ^sort | save variant_genomic_corr.csv -f
diff extracted_concat.csv variant_genomic_corr.csv | grep '^>' | save -f update.diff
#+end_src
- [X] Ajouter négatifs
345 variants non trouvés avant modification
141 après modification
- [X] Ajouter différence
- [X] Corriger erreurs de parsing
**** KILL Lifter coordonées variants cento génomique en GRCh38
CLOSED: [2023-10-21 Sat 22:47]
**** DONE Parser coordonnée patient
CLOSED: [2023-11-04 Sat 18:59] SCHEDULED: <2023-10-31 Tue>
**** DONE Un seul type de données
CLOSED: [2023-11-01 Wed 00:55] SCHEDULED: <2023-10-31 Tue>
***** DONE Vérifier avec derniere version sauvegardé
CLOSED: [2023-11-01 Wed 00:55] SCHEDULED: <2023-10-31 Tue>
file,transcript,coding,codingPos,codingChange,proteinChange,classification,zygosity
->
patient,transcript (cento),transcript (canonical),coding,genomic (hg38),classification (cento),zygosity,gene,Confirmed in sanger,Found by bisonex,chrom,pos,ref,alt
- [ ] Fusionner coding,codingPos,codingChange
- [ ] Ne pas écrire proteinchange
- [ ] Fichier de référence : insérer champs vides avec awk : transcript (canonical), genomic puis gene, etc
- [ ] Fichier de référence : renommer header
- [ ] vérifier que fichier toujours identique
#+begin_src julia
using DataFramesMeta, CSV
# - [ ] Fichier de référence : insérer champs vides avec awk : transcript (canonical), genomic puis gene, etc
# - [ ] Fichier de référence : renommer header
# - [ ] vérifier que fichier toujours identique
#file,transcript,coding,codingPos,codingChange,proteinChange,classification,zygosity
#->
#patient,transcript (cento),transcript (canonical),coding,genomic (hg38),classification (cento),zygosity,gene,Confirmed in sanger,Found by bisonex,chrom,pos,ref,alt
function mergeCoding(c, p, ch)
"negatif" in [c, p, ch] ? "negatif" : c * p * ch
end
function negative(t, pos)
t == "negatif" ? -1 : pos
end
cols = [:patient,:"transcript (cento)",:"transcript (canonical)",:coding,:"genomic (hg38)",
:"classification (cento)",:zygosity,:gene,:"Confirmed in sanger",:"Found by bisonex",
:chrom,:pos,:ref,:alt]
d = @chain CSV.read("variant_extracted.csv", DataFrame) begin
# Fusionner coding,codingPos,codingChange
@transform :coding = mergeCoding.(:coding, :codingPos, :codingChange)
# Ne pas écrire proteinchange
@select $(Not([:codingPos, :codingChange, :proteinChange]))
# Add missing mcolumns
@rename :"transcript (cento)" = :transcript :patient = :file :"classification (cento)" = :classification
@transform :"transcript (canonical)" = missing :"genomic (hg38)" = missing :gene = missing
@transform :"Confirmed in sanger" = missing :"Found by bisonex" = missing
@transform :chrom = missing :pos = missing :ref = missing :alt = missing
# Rorder
@select :patient :"transcript (cento)" :"transcript (canonical)" :coding :"genomic (hg38)" :"classification (cento)" :zygosity :gene :"Confirmed in sanger" :"Found by bisonex" :chrom :pos :ref :alt
# Set -1 for negative variant
@rtransform :pos = :"transcript (cento)" == "negatif" ? -1 : :pos
@rtransform :"transcript (canonical)"= :"transcript (cento)" == "negatif" ? "negatif" : :"transcript (canonical)"
@rtransform :"genomic (hg38)" = :"transcript (cento)" == "negatif" ? "negatif" : :"genomic (hg38)"
@rtransform :coding = :"transcript (cento)" == "negatif" ? "negatif" : :coding
@rtransform :gene = :"transcript (cento)" == "negatif" ? "negatif" : :gene
@rtransform :pos = ismissing(:pos) ? -1 : :pos
end
CSV.write("variant_extracted_remap.csv", d)
d2 = @chain CSV.read("extracted.csv", DataFrame) begin
@orderby :patient
end
CSV.write("extracted_sorted.csv", d2)
#+end_src
ON trie les fichiers pour bien avoir le bon order (sinon diff ne fonctionne pas ??)
diff extracted_sorted.csv variant_extracted_remap.csv -u | save extracted.diff
patch -p1 extracted_sorted.csv extracted.diff
patching file extracted_sorted.csv
diff extracted_sorted.csv variant_extracted_remap.csv -u
**** KILL variant_recoder pour avoir les coordonnées VCF
CLOSED: [2023-10-25 Wed 09:13] SCHEDULED: <2023-10-21 Sat>
mobidetails n e trouve pas les ieux transcrits
**** DONE Annotation mobidetails (gene + données gonémique)
CLOSED: [2023-10-25 Wed 09:14]
**** DONE Envoyer liste à Paul
SCHEDULED: <2023-10-26 Thu>
**** DONE compare chaque variant avec la sortie du pipeline
CLOSED: [2023-10-31 Tue 00:18] SCHEDULED: <2023-10-21 Sat>
Avec la fonction "test" dans Search.hs
1126 extracted
654 annotated
253 raw data
102 raw and annotated
236 raw and extracted
17 raw NOT extracted
890 extract WITHOUT raw
#+begin_src sh
❯ open diff.txt | from csv | get id | into string | each {|e| "~/annex/data/centogene/reports/" ++ $e ++ "*.pdf"} | each {|e| firefox $e }
#+end_src
Les 17 manquants sont
- 62913191 : CNV
- 62959388 : MT-ATP6
- 62999572 : MT-ATP6
- 62999627 : CNV
- 62999630 : CNV
- 63004218: CNV
- 63006466 : CNV
- 63009152 : manqué à extraire -> bien présent
- 63015289: CNV
- 63024910 : MT-ATP6
- 63040251 : CNV
- 63043050 : CNV
- 63118093 : NA12878
- NA12878 x4
*** TODO Comparer variants cento à sortie bisonex
SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire extraction
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec mobidetails
CLOSED: [2023-11-04 Sat 19:02] SCHEDULED: <2023-11-04 Sat>
*** DONE Refaire annotation avec transcrit non reconnus
CLOSED: [2023-11-04 Sat 20:42] SCHEDULED: <2023-11-04 Sat>
5 transcrits, donnés égalemen tpar
#+begin_src nu
open annotated.csv | where coding != "negatif" | where chrom == ""
#+end_src
| 62676048 | NM_001080420.1 | SHANK3 | référénce non valide |
| 62690893 | NM_001080420.1 | KDM6B | idem |
| 62690893 | NM_001080420.1 | KDM6B | même variant |
| 62795429 | NM_016381.3 | TREX1 | NM_033629.5 |
| 63019340 | NM_001080420.1 | SHANK3 | NM_001372044.2 |
SCHEDULED: <2023-11-01 Wed>
*** DONE Rajouter variant pour 63009152
CLOSED: [2023-11-04 Sat 20:47] SCHEDULED: <2023-11-01 Wed>
*** DONE Regénérer annotation avec NC_
CLOSED: [2023-11-04 Sat 18:59] SCHEDULED: <2023-10-31 Tue>
*** TODO Comparer variants cento avec sanger
SCHEDULED: <2023-11-04 Sat>
*** TODO Mail paul avec résultats
SCHEDULED: <2023-11-05 Sun>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-11-11 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-11-11 Sat>
* Communication
** DONE Mail NGS-diag
CLOSED: [2023-10-06 Fri 08:04] SCHEDULED: <2023-10-06 Fri>
/Entered on/ [2023-10-04 Wed 19:33]