apraga/org - Change T6Z42Y23DCC245L4APLHDYCFCHYAMWSR7AOXVICDCUGYVZX5IHNQC

Synthetic patient result

Created by Alexis Praga on August 20, 2023

T6Z42Y23DCC245L4APLHDYCFCHYAMWSR7AOXVICDCUGYVZX5IHNQC

Dependencies

In channels

main

Change contents

Replacement in projects.org at line 32 [5.123895]

B:BD[3.1219] → [4.100:159]

B:BD[4.159] → [6.43:71]

** TODO Copie carte d'identié + mutuelle accueil hôpital
SCHEDULED: <2023-08-17 Thu>

[3.1219]

[4.187]

** DONE Copie carte d'identié + mutuelle accueil hôpital
CLOSED: [2023-08-19 Sat 20:09] SCHEDULED: <2023-08-17 Thu>

Replacement in projects/bisonex.org at line 21 [8.35]

B:BD[7.18249] → [7.18249:19033]

B:BD[7.19033] → [9.9006:16414]

B:BD[9.16414] → [2.29:16413]

utside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028034:g.89894645:A>G    | NTR            | 00 % [00 % - 00.92 %]      |     0.000 | +      | 89894645 | substitution | A>G      | Intron 3 |     1398 | NR_028034    | FAS  | donor     |           160 | DeepIntron |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No   
            | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028035:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_028035    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028036:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_028036    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135313:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_135313    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NM_001410956:g.89894645:A>G | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NM_001410956 | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135314:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NR_135314    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135315:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_135315    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
|       |          |     |     |     |      |        |      |                             |                |                            |           |        |          |              |          |          |          |              |      |           |               |            |                 |                              |    |             |               |             |                |               |                    |                  |                     |                      |            |              |                   |             |               |                    |                  |                       |     |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** TODO Réordonner les colonnes :annotation:
SCHEDULED: <2023-08-20 Sun> DEADLINE: <2023-08-19 Sat>
Pas d'OMIM, pas de CADD, pas de spliceAI
*** TODO Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une correspondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
 zgrep -o "ID=gene[^;]*;"  GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
 wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;"  GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+begi
n_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** TODO OMIM sur nom du gène :annotation:
SCHEDULED: <2023-08-20 Sun>
*** TODO Mobidetails API
SCHEDULED: <2023-08-20 Sun>
*** PROJ Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** TODO Regarder si clinique disponible avec vep :annotation:
** TODO [#B] Indicateurs qualité :qualité:
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** TODO Relancer résultats avec NA1287 et NA12878 + sanger
SCHEDULED: <2023-08-19 Sat>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** HOLD Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** PROJ Grantham
SCHEDULED: <2023-08-20 Sun>
**** PROJ ACMG incidental
**** TODO Gnomad ?
**** TODO Sortie VCF (pour avoir la fraction allélique AF)
SCHEDULED: <2023-08-20 Sun>
**** TODO VCF -> tsv
SCHEDULED: <2023-08-20 Sun>
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** KILL Tester sarek
CLOSED: [2023-08-12 Sat 15:53]
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
** DONE Support pour samplesheet
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Thu 13:00>
/Entered on/ [2023-08-03 Thu 13:12]
** DONE Petit jeu de données : chr22 sur HG001
CLOSED: [2023-08-05 Sat 14:21] SCHEDULED: <2023-08-05 Sat>
* Amélioration :amelioration:
* Documentation
:PROPERTIES:
:CATEGORY: doc
:END:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript
:PROPERTIES:
:CATEGORY: manuscript
:END:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la n

[7.18249]

[2.16413]

utside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028034:g.89894645:A>G    | NTR            | 00 % [00 % - 00.92 %]      |     0.000 | +      | 89894645 | substitution | A>G      | Intron 3 |     1398 | NR_028034    | FAS  | donor     |           160 | DeepIntron |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028035:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_028035    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028036:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_028036    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135313:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_135313    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NM_001410956:g.89894645:A>G | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NM_001410956 | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135314:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NR_135314    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135315:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_135315    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
|       |          |     |     |     |      |        |      |                             |                |                            |           |        |          |              |          |          |          |              |      |           |               |            |                 |                              |    |             |               |             |                |               |                    |                  |                     |                      |            |              |                   |             |               |                    |                  |                       |     |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** PROJ Réordonner les colonnes :annotation:
Pas d'OMIM, pas de CADD, pas de spliceAI
*** TODO Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une correspondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
 zgrep -o "ID=gene[^;]*;"  GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
 wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;"  GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+begin_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** TODO OMIM sur nom du gène :annotation:
*** PROJ Mobidetails API
*** PROJ Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** TODO Regarder si clinique disponible avec vep :annotation:
** TODO [#B] Indicateurs qualité :qualité:
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** TODO Relancer résultats avec NA1287 et NA12878 + sanger
SCHEDULED: <2023-08-20 Sun>
*** TODO Comparer avec hg19
SCHEDULED: <2023-08-20 Sun>
*** TODO Comparer avec autres kit de capture
SCHEDULED: <2023-08-20 Sun>
*** TODO Comparer avec no-alt
SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** HOLD Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** PROJ Grantham
**** PROJ ACMG incidental
**** TODO Gnomad ?
**** TODO Sortie VCF (pour avoir la fraction allélique AF)
**** PROJ VCF -> tsv
**** PROJ Filtrer après VEP avec filter_vep
nNon testé
*** TODO Comparer les annotations sur 63003856
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** KILL Tester sarek
CLOSED: [2023-08-12 Sat 15:53]
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
** DONE Support pour samplesheet
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Thu 13:00>
/Entered on/ [2023-08-03 Thu 13:12]
** DONE Petit jeu de données : chr22 sur HG001
CLOSED: [2023-08-05 Sat 14:21] SCHEDULED: <2023-08-05 Sat>
* Amélioration :amelioration:
* Documentation
:PROPERTIES:
:CATEGORY: doc
:END:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript
:PROPERTIES:
:CATEGORY: manuscript
:END:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la n

Replacement in projects/bisonex.org at line 70 [8.35]

B:BD[9.49049] → [9.49049:49184]

B:BD[9.49184] → [2.16549:17756]


   3 │ chr2:g.240719197G>C    60.0        77
   4 │ chr3:g.41227353G>C     60.0       105
   5 │ chr4:g.15536991T>G     60.0  
      41
   6 │ chr5:g.14474096G>A     60.0       191
   7 │ chr8:g.43122149C>T     60.0       237
   8 │ chr9:g.128603589A>C    60.0       304
   9 │ chr9:g.137452819G>C    60.0       107
  10 │ chr10:g.129957338T>C   60.0       116
  11 │ chr10:g.247389T>G      60.0        56
  12 │ chr11:g.61313668G>A    60.0        83
  13 │ chr12:g.45850467C>T    60.0       291
  14 │ chr14:g.64216315C>G    60.0       263
  15 │ chr15:g.60514655G>A    60.0       259
  16 │ chr17:g.61966475G>T    60.0       144
  17 │ chr17:g.7852503T>C     60.0       190
  18 │ chr19:g.13230158G>A    60.0       172
  19 │ chr19:g.38523211C>G    60.0        93
  20 │ chr19:g.4110557G>C     59.9929    425
  21 │ chr20:g.62334188G>A    60.0        62
  22 │ chrX:g.47575255G>A     60.0       244
  23 │ chrX:g.53409112G>A     60.0       136
**** TODO [#A] Tout insérer dans NA12878 avec XAMscissors (XAMScissors à jour)
SCHEDULED: <2023-08-19 Sat>
***** PROJ Insertion
**** TODO Données simuscop 200x
SCHEDULED: <2023-08-22 Tue>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-08-26 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-08-26 Sat>

[9.49049]


   3 │ chr2:g.240719197G>C    60.0        77
   4 │ chr3:g.41227353G>C     60.0       105
   5 │ chr4:g.15536991T>G     60.0        41
   6 │ chr5:g.14474096G>A     60.0       191
   7 │ chr8:g.43122149C>T     60.0       237
   8 │ chr9:g.128603589A>C    60.0       304
   9 │ chr9:g.137452819G>C    60.0       107
  10 │ chr10:g.129957338T>C   60.0       116
  11 │ chr10:g.247389T>G      60.0        56
  12 │ chr11:g.61313668G>A    60.0        83
  13 │ chr12:g.45850467C>T    60.0       291
  14 │ chr14:g.64216315C>G    60.0       263
  15 │ chr15:g.60514655G>A    60.0       259
  16 │ chr17:g.61966475G>T    60.0       144
  17 │ chr17:g.7852503T>C     60.0       190
  18 │ chr19:g.13230158G>A    60.0       172
  19 │ chr19:g.38523211C>G    60.0        93
  20 │ chr19:g.4110557G>C     59.9929    425
  21 │ chr20:g.62334188G>A    60.0        62
  22 │ chrX:g.47575255G>A     60.0       244
  23 │ chrX:g.53409112G>A     60.0       136
**** TODO [#A] Tout insérer dans NA12878 avec XAMscissors (XAMScissors à jour)
SCHEDULED: <2023-08-19 Sat>
***** DONE Insertion
CLOSED: [2023-08-20 Sun 09:15]
***** DONE Vérifier après haplotypecaller: 3 variants manquant mais ok
CLOSED: [2023-08-20 Sun 09:18] SCHEDULED: <2023-08-20 Sun>
3×3 DataFrame
 Row │ variant              meanQual  depth
     │ String               Float64   Int64
─────┼──────────────────────────────────────
   1 │ chr12:g.13720138C>T      60.0      1
   2 │ chr17:g.10296150T>A      60.0      1
   3 │ chr21:g.43426167C>T       0.0     59
Manque de profondeur sur 2 et mauvaise qualité sur 3
***** DONE Vérifier après filterdepth: 0 perdus en plus
CLOSED: [2023-08-20 Sun 09:18] SCHEDULED: <2023-08-20 Sun>
***** DONE Vérifier après filterpolymorphis : 0 perdus en plus
CLOSED: [2023-08-20 Sun 09:18] SCHEDULED: <2023-08-20 Sun>
***** TODO Vérifier après filter vep: 26 perdus
SCHEDULED: <2023-08-20 Sun>
 Row │ variant               meanQual  depth
     │ String                Float64   Int64
─────┼───────────────────────────────────────
   1 │ chr1:g.183222115C>T    60.0       124
   2 │ chr1:g.39388062C>T     60.0       136
   3 │ chr2:g.240719197G>C    60.0        98
   4 │ chr3:g.41227353G>C     60.0        93
   5 │ chr4:g.15536991T>G     59.6584    161
   6 │ chr5:g.14474096G>A     60.0        96
   7 │ chr8:g.43122149C>T     60.0       134
   8 │ chr9:g.128603589A>C    60.0       104
   9 │ chr9:g.137452819G>C    60.0       156
  10 │ chr10:g.129957338T>C   60.0        67
  11 │ chr10:g.247389T>G      60.0        79
  12 │ chr11:g.61313668G>A    60.0        69
  13 │ chr12:g.45850467C>T    60.0        90
  14 │ chr14:g.58458545G>A    60.0        51
  15 │ chr14:g.64216315C>G    60.0       167
  16 │ chr15:g.60514655G>A    60.0       113
  17 │ chr15:g.66703292C>T    60.0        95
  18 │ chr17:g.61966475G>T    60.0        67
  19 │ chr17:g.7852503T>C     60.0        96
  20 │ chr19:g.13230158G>A    60.0       135
  21 │ chr19:g.38523211C>G    60.0       180
  22 │ chr19:g.4110557G>C     60.0       219
  23 │ chr20:g.62334188G>A    60.0        94
  24 │ chrX:g.24737739G>T     60.0        76
  25 │ chrX:g.47575255G>A     60.0       145
  26 │ chrX:g.53409112G>A     60.0       186
**** TODO Données simuscop 200x
SCHEDULED: <2023-08-22 Tue>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-08-26 Sat>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-08-26 Sat>