apraga/org - Change T7GS3FVFOFBEFQXCLESYLY7JJ6D2ZVA2CAH5U4FRTKHTMPLINNXAC

Cleaing up

Created by Alexis Praga on September 11, 2023

T7GS3FVFOFBEFQXCLESYLY7JJ6D2ZVA2CAH5U4FRTKHTMPLINNXAC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 5 [4.35]

B:BD[3.8221] → [3.8221:9005]

∅:D[3.9005] → [5.8221:8361]

B:BD[5.8221] → [5.8221:8361]

∅:D[5.8361] → [6.13895:16418]

B:BD[6.13895] → [6.13895:16418]

∅:D[6.16418] → [7.16442:16540]

B:BD[7.16442] → [7.16442:16540]

∅:D[7.16540] → [8.29384:35841]

B:BD[8.29384] → [8.29384:35841]

∅:D[8.35841] → [9.32797:32835]

B:BD[9.32797] → [9.32797:32835]

∅:D[10.8221] → [11.9704:9726]

∅:D[9.32835] → [11.9704:9726]

B:BD[11.9704] → [11.9704:9726]

∅:D[12.8221] → [13.27599:33921]

∅:D[11.9726] → [13.27599:33921]

B:BD[13.27599] → [13.27599:33921]

 Mon 23:29]
Ajout vérification checksum -> à vérifier
**** DONE VEP version 1.10
CLOSED: [2023-08-06 Sun 09:45] SCHEDULED: <2023-08-06 Sun>
**** DONE transcriptome (spip)
CLOSED: [2023-06-12 Mon 23:29]
Rajouter checksum manuel
**** KILL Refseq
**** KILL OMIM
CLOSED: [2023-06-12 Mon 23:29]
codé, à vérifier
**** KILL ACMG incidental
CLOSED: [2023-06-12 Mon 23:29]
*** TODO Données :T2T:
:PROPERTIES:
:ID:       5d915178-ca96-44ef-87f1-6702af114f2b
:END:
**** DONE fasta
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE fasta index
CLOSED: [2023-06-13 Tue 00:07]
***** DONE compatibilité hg38
CLOSED: [2023-06-13 Tue 00:07]
**** DONE fasta dictionnaire
CLOSED: [2023-06-13 Tue 00:07]
**** DONE db
SNP
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE commonSNP
CLOSED: [2023-06-1
4 Wed 22:32]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:32]
cd /Work/Groups/bisonex/data/dbsnp/GRCh38.p14
❯ ga@mesointeractive GRCh38.p14]$ zgrep -c '^NC' dbSNP_common.vcf.gz
21340485
[apraga@mesointeractive GRCh38.p14]$ pwd
[apraga@mesointeractive GRCh38.p14]$ zgrep -c '^NC'
dbSNP_common.vcf.gz                     ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz.tbi                 ID_of_common_snp.txt
[apraga@mesointeractive dbsnp]$ cd chm13v2.0/
[apraga@mesointeractive chm13v2.0]$ ls
chm13v2.0_dbSNPv155.vcf.gz      dbSNP_common.vcf.gz.tbi                 versions.yml
chm13v2.0_dbSNPv155.vcf.gz.tbi  ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz             ID_of_common_snp.txt
[apraga@mesointeractive chm13v2.0]$ zgrep -c '^chr' dbSNP_common.vcf.gz
19433713
[apraga@mesointeractive chm13v2.0] $
❯ man tmux
**** DONE commonSNP non patho
CLOSED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** DONE cache vep
CLOSED: [2023-06-30 Sun 14:20] SCHEDULED: <2023-07-25 Tue>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod
.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 51
8832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.
vcf.gz
clinvar=$dir/clinvar/GRCh38/cl
invar-patho.vcf.gz
ge
nome=$dir/genome/GRCh38.p13/genomeRef.sdf
srun rtg vcfeval -b $dbSNP -c $clinvar -o common-not-patho -t $genome --sample ALT
#+end_src
****** HOLD Voir pour régions complexes non traitées
***** DONE bcftools isec : non
CLOSED: [2022-11-27 Sun 00:38]
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -p common
#+end_src
On vérifie bien que les 2 fichiers commons on le même nombre de lignes
#+begin_src sh
$ grep -e '^NC'  0002.vcf | wc -l
74302
alex@gentoo ~/code/bisonex/data/common $ grep -e '^NC'  0003.vcf | wc -l
74302
#+end_src
****** DONE Impact option -n
CLOSED: [2022-10-23 Sun 13:56]
Mais en spécifiant -n =2:
#+begin_src sh
$ bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
74978
#+end_src
Si on ne regarde que les variants, on retrouve bien 74302
#+begin_src sh
rg "^NC" none_sorted.vcf  | wc -l
#+end_src
NB : test fait avec
#+begin_src
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none.vcf
sort common/0003.vcf > common/0003_sorted.vcf
comm -13 common/0003_sorted.vcf none_sorted.vcf
#+end_src
****** DONE Géstion des duplicates: -c none
CLOSED: [2022-10-23 Sun 13:56]
Si on ne garde que ceux avec REF et ALT identiques
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | wc -l
74978
#+end_src
Si on garde tout
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | wc -l
137777
#+end_src
Pour regarder la différence :
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none_sorted.vcf
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | sort > all_sorted.vcf
comm -13 none_sorted.vcf all_sorted.vcf | head
#+end_src
Sur un exemple,on a bien des variants différents
****** DONE Suppression des clinvar patho
CLOSED: [2022-10-23 Sun 18:55]
Semble faire le travail vu que dbSNP_commo a 23194960 lignes (donc ~80 000 de moins)
 #+begin_src sh
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10  dbSNP_common.vcf.gz clinvar.vcf.gz | wc -l
Note: -w option not given, printing list of sites...
23119984
 #+end_src
 Par contre, l'o'ption -w ou -p fait des ficher "data"...
Après un nouvel essai, plus de problème
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n=1 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
À noter le choix de l'option -n qui change entre "=1" et "~10"...
En effet "=1" = au moins 1 fichier et "~10" fait exactement dans le premier et non dans le second
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
****** DONE Valider avec Alexis : bcftool isec
CLOSED: [2022-11-07 Mon 21:42   ]
****** DONE Pourquoi nombre de lignes différentes avec la version d'Alexis -> isec ne gère pas plusieurs ALT
CLOSED: [2022-11-26 Sat 23:36]
Grosse différence !
#+begin_src
$ wc -l ID_of_common_snp_not_clinvar_patho.txt
23119915 ID_of_common_snp_not_clinvar_patho.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
85820 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
À noter que tout dbSNP = 23194960
******* Clinvar classe 4 ? Moins mais toujours trop
#+begin_src
$ zgrep '^NC' tmp.vcf.gz  | wc -l
21081654
#+end_src
******* Comparer les ID et regarder ceux en plus
#+begin_src sh
bcftools isec -e 'INFO/CLNSIG="Pathogenic"' -c none -n~10 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -w 1 -o tmp.vcf.gz
zgrep -o -e 'rs[[:digit:]]\' tmp.vcf.gz | sort | id_sorted.txt
sort ../database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  > reference_sorted.txt
comm -23 id_sorted.txt reference_sorted.txt > unique1.txt
#+end_src
Par exemple
#+begin_src sh
zgrep rs1000000561 ../database/dbSNP/dbSNP_common.vcf.gz
#+end_src
NC_000002.12	136732859	rs1000000561	ACG	A,ACGCG	.	PASS	RS=1000000561;dbSNPBuildID=151;SSR=0;VC=INDEL;GNO;FREQ=ALSPAC:0.2506,0.7494,.|TOMMO:0.9971,0.002865,.|TWINSUK:0.2473,0.7527,.|dbGaP_PopFreq:0.993,0.006943,8.902e-05;COMMON
Attention, clinvar est en numéro de chromosomoe et dbSNP en NC...
Normalement, géré lors du calcul d'intersection !
Ce SNP n'est pas dans clinvar (vérifié dans UCSC)
******* Tester sur chromosome 20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools view --regions NC_000020.11 ../database/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_chr20.vcf.gz
bcftools view --regions 20 ../database/clinvar/clinvar.vcf.gz -o clinvar_chr20.vcf.gz
tabix dbSNP_common_chr20.vcf.gz
tabix clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
Attention à bien renommer clinvar !
#+begin_src sh :dir ~/code/bisonex/test_isec
mv clinvar_chr20.vcf.gz clinvar_chr20_notremapped.vcf.gz
bcftools annotate --rename-chrs chromosome_mapping.txt clinvar_chr20_notremapped.vcf.gz -o clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
*ATTENTION*: sans indexer les vcf, les fichiers seront *VIDES*
*ATTENTION*: par défaut les filtres s'appliquent sur les 2. Cela est un problème si on joue sur l'inclusion et non l'exclusion
Attention: vérifier la conventdion de nommage des chromosomes
******** Test pathogene: ne prend pas en compte les multi-allèles ????
On teste l'intersection dbsnp et clinvar patho ainsi que le complémentaire
#+begin_src sh :dir ~/code/bisonex/test_isec
clinvar=clinvar_chr20_patho.vcf.gz
snp=dbSNP_common_chr20.vcf.gz
bcftools index $clinvar
bcftools index $snp
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz -o $clinvar
bcftools isec  $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
#+RESULTS:
| tmp/0000.vcf |
|       518846 |
| tmp/0001.vcf |
|            0 |
| tmp/0002.vcf |
|            0 |
| tmp/0003.vcf |
|            0 |
Aucun clinvar patho... Clairement faux !
Autre méthode : on inclut tous les SNP et clinvar patho et on regarde ceux uniq

[3.8221]

[13.33921]

 Mon 23:29]
Ajout vérification checksum -> à vérifier
**** DONE VEP version 1.10
CLOSED: [2023-08-06 Sun 09:45] SCHEDULED: <2023-08-06 Sun>
**** DONE transcriptome (spip)
CLOSED: [2023-06-12 Mon 23:29]
Rajouter checksum manuel
**** KILL Refseq
**** KILL OMIM
CLOSED: [2023-06-12 Mon 23:29]
codé, à vérifier
**** KILL ACMG incidental
CLOSED: [2023-06-12 Mon 23:29]
*** DONE Données :T2T:
CLOSED: [2023-09-10 Sun 16:45]
:PROPERTIES:
:ID:       5d915178-ca96-44ef-87f1-6702af114f2b
:END:
**** DONE fasta
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE fasta index
CLOSED: [2023-06-13 Tue 00:07]
***** DONE compatibilité hg38
CLOSED: [2023-06-13 Tue 00:07]
**** DONE fasta dictionnaire
CLOSED: [2023-06-13 Tue 00:07]
**** DONE dbSNP
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE commonSNP
CLOSED: [2023-06-14 Wed 22:32]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:32]
cd /Work/Groups/bisonex/data/dbsnp/GRCh38.p14
❯ ga@mesointeractive GRCh38.p14]$ zgrep -c '^NC' dbSNP_common.vcf.gz
21340485
[apraga@mesointeractive GRCh38.p14]$ pwd
[apraga@mesointeractive GRCh38.p14]$ zgrep -c '^NC'
dbSNP_common.vcf.gz                     ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz.tbi                 ID_of_common_snp.txt
[apraga@mesointeractive dbsnp]$ cd chm13v2.0/
[apraga@mesointeractive chm13v2.0]$ ls
chm13v2.0_dbSNPv155.vcf.gz      dbSNP_common.vcf.gz.tbi                 versions.yml
chm13v2.0_dbSNPv155.vcf.gz.tbi  ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz             ID_of_common_snp.txt
[apraga@mesointeractive chm13v2.0]$ zgrep -c '^chr' dbSNP_common.vcf.gz
19433713
[apraga@mesointeractive chm13v2.0] $
❯ man tmux
**** DONE commonSNP non patho
CLOSED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** DONE cache vep
CLOSED: [2023-06-30 Sun 14:20] SCHEDULED: <2023-07-25 Tue>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
genome=$dir/genome/GRCh38.p13/genomeRef.sdf
srun rtg vcfeval -b $dbSNP -c $clinvar -o common-not-patho -t $genome --sample ALT
#+end_src
****** HOLD Voir pour régions complexes non traitées
***** DONE bcftools isec : non
CLOSED: [2022-11-27 Sun 00:38]
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -p common
#+end_src
On vérifie bien que les 2 fichiers commons on le même nombre de lignes
#+begin_src sh
$ grep -e '^NC'  0002.vcf | wc -l
74302
alex@gentoo ~/code/bisonex/data/common $ grep -e '^NC'  0003.vcf | wc -l
74302
#+end_src
****** DONE Impact option -n
CLOSED: [2022-10-23 Sun 13:56]
Mais en spécifiant -n =2:
#+begin_src sh
$ bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
74978
#+end_src
Si on ne regarde que les variants, on retrouve bien 74302
#+begin_src sh
rg "^NC" none_sorted.vcf  | wc -l
#+end_src
NB : test fait avec
#+begin_src
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none.vcf
sort common/0003.vcf > common/0003_sorted.vcf
comm -13 common/0003_sorted.vcf none_sorted.vcf
#+end_src
****** DONE Géstion des duplicates: -c none
CLOSED: [2022-10-23 Sun 13:56]
Si on ne garde que ceux avec REF et ALT identiques
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | wc -l
74978
#+end_src
Si on garde tout
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | wc -l
137777
#+end_src
Pour regarder la différence :
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none_sorted.vcf
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | sort > all_sorted.vcf
comm -13 none_sorted.vcf all_sorted.vcf | head
#+end_src
Sur un exemple,on a bien des variants différents
****** DONE Suppression des clinvar patho
CLOSED: [2022-10-23 Sun 18:55]
Semble faire le travail vu que dbSNP_commo a 23194960 lignes (donc ~80 000 de moins)
 #+begin_src sh
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10  dbSNP_common.vcf.gz clinvar.vcf.gz | wc -l
Note: -w option not given, printing list of sites...
23119984
 #+end_src
 Par contre, l'o'ption -w ou -p fait des ficher "data"...
Après un nouvel essai, plus de problème
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n=1 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
À noter le choix de l'option -n qui change entre "=1" et "~10"...
En effet "=1" = au moins 1 fichier et "~10" fait exactement dans le premier et non dans le second
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
****** DONE Valider avec Alexis : bcftool isec
CLOSED: [2022-11-07 Mon 21:42   ]
****** DONE Pourquoi nombre de lignes différentes avec la version d'Alexis -> isec ne gère pas plusieurs ALT
CLOSED: [2022-11-26 Sat 23:36]
Grosse différence !
#+begin_src
$ wc -l ID_of_common_snp_not_clinvar_patho.txt
23119915 ID_of_common_snp_not_clinvar_patho.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
85820 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
À noter que tout dbSNP = 23194960
******* Clinvar classe 4 ? Moins mais toujours trop
#+begin_src
$ zgrep '^NC' tmp.vcf.gz  | wc -l
21081654
#+end_src
******* Comparer les ID et regarder ceux en plus
#+begin_src sh
bcftools isec -e 'INFO/CLNSIG="Pathogenic"' -c none -n~10 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -w 1 -o tmp.vcf.gz
zgrep -o -e 'rs[[:digit:]]\' tmp.vcf.gz | sort | id_sorted.txt
sort ../database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  > reference_sorted.txt
comm -23 id_sorted.txt reference_sorted.txt > unique1.txt
#+end_src
Par exemple
#+begin_src sh
zgrep rs1000000561 ../database/dbSNP/dbSNP_common.vcf.gz
#+end_src
NC_000002.12	136732859	rs1000000561	ACG	A,ACGCG	.	PASS	RS=1000000561;dbSNPBuildID=151;SSR=0;VC=INDEL;GNO;FREQ=ALSPAC:0.2506,0.7494,.|TOMMO:0.9971,0.002865,.|TWINSUK:0.2473,0.7527,.|dbGaP_PopFreq:0.993,0.006943,8.902e-05;COMMON
Attention, clinvar est en numéro de chromosomoe et dbSNP en NC...
Normalement, géré lors du calcul d'intersection !
Ce SNP n'est pas dans clinvar (vérifié dans UCSC)
******* Tester sur chromosome 20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools view --regions NC_000020.11 ../database/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_chr20.vcf.gz
bcftools view --regions 20 ../database/clinvar/clinvar.vcf.gz -o clinvar_chr20.vcf.gz
tabix dbSNP_common_chr20.vcf.gz
tabix clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
Attention à bien renommer clinvar !
#+begin_src sh :dir ~/code/bisonex/test_isec
mv clinvar_chr20.vcf.gz clinvar_chr20_notremapped.vcf.gz
bcftools annotate --rename-chrs chromosome_mapping.txt clinvar_chr20_notremapped.vcf.gz -o clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
*ATTENTION*: sans indexer les vcf, les fichiers seront *VIDES*
*ATTENTION*: par défaut les filtres s'appliquent sur les 2. Cela est un problème si on joue sur l'inclusion et non l'exclusion
Attention: vérifier la conventdion de nommage des chromosomes
******** Test pathogene: ne prend pas en compte les multi-allèles ????
On teste l'intersection dbsnp et clinvar patho ainsi que le complémentaire
#+begin_src sh :dir ~/code/bisonex/test_isec
clinvar=clinvar_chr20_patho.vcf.gz
snp=dbSNP_common_chr20.vcf.gz
bcftools index $clinvar
bcftools index $snp
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz -o $clinvar
bcftools isec  $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
#+RESULTS:
| tmp/0000.vcf |
|       518846 |
| tmp/0001.vcf |
|            0 |
| tmp/0002.vcf |
|            0 |
| tmp/0003.vcf |
|            0 |
Aucun clinvar patho... Clairement faux !
Autre méthode : on inclut tous les SNP et clinvar patho et on regarde ceux uniq

Replacement in projects/bisonex.org at line 21 [4.35]

B:BD[14.148] → [2.29:16630]

∅:D[2.16630] → [15.8303:8389]

B:BD[15.8303] → [15.8303:8389]

∅:D[15.8389] → [16.8221:8415]

B:BD[16.8221] → [16.8221:8415]

∅:D[16.8415] → [17.8221:9203]

B:BD[17.8221] → [17.8221:9203]

∅:D[17.9203] → [18.16588:17013]

B:BD[18.16588] → [18.16588:17013]

∅:D[18.17013] → [19.8308:8437]

B:BD[19.8308] → [19.8308:8437]

∅:D[19.8437] → [14.16413:17458]

B:BD[14.16413] → [14.16413:17458]

∅:D[14.17458] → [20.24605:26348]

B:BD[20.24605] → [20.24605:26348]

∅:D[20.26348] → [21.16453:16548]

B:BD[21.16453] → [21.16453:16548]

∅:D[21.16548] → [22.39246:41826]

B:BD[22.39246] → [22.39246:41826]

∅:D[22.41826] → [23.32838:33058]

B:BD[23.32838] → [23.32838:33058]

∅:D[23.33058] → [24.15292:15768]

B:BD[24.15292] → [24.15292:15768]

 uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** TODO OMIM sur nom du gène :annotation:
*** TODO Mobidetails API
*** TODO Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** TODO Regarder si clinique disponible avec vep :annotation:
** TODO [#B] Indicateurs qualité :qualité:
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** HOLD Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
  - retélécharger
  - indexer nous-meme
**** DONE Supprimer score spip en double
CLOSED: [2023-08-31 Thu 14:17] SCHEDULED: <2023-08-31 Thu>
**** DONE Vérifier variant 63126867
CLOSED: [2023-08-31 Thu 10:52] SCHEDULED: <2023-08-31 Thu>
**** DONE Ajouter tronquant ou non
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** DONE Ajouter récessif
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** KILL Corriger allelic depth
CLOSED: [2023-08-31 Thu 11:18] SCHEDULED: <2023-08-31 Thu>
Problème lié à libre office
**** DONE Regénérer annotation pour na12878, inserted et patient PEX1
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** TODO ACMG incidental
**** DONE Sortie VCF (pour avoir la fraction allélique AF)
CLOSED: [2023-08-28 Mon 17:22]
**** DONE VCF -> tsv avec bcftools
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-28 Mon>
**** DONE Un seul transcrit après VEP avec filter_vep :filter:
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-28 Mon>
Avec mise à jour VEP 110, pick_flag semble fonctionner.
***** DONE Test chr20: Pas de variant "perdus"
CLOSED: [2023-08-28 Mon 17:31] SCHEDULED: <2023-08-28 Mon>
contrairement au résultat communiqué à alexis par mail
#+begin_src sh :dir out/annotate
bcftools +counts vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz
#+end_src
Number of samples: 1
Number of SNPs:    123
Number of INDELs:  32
Number of MNPs:    53
Number of others:  0
Number of sites:   208
#+begin_src  sh
filter_vep -i vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz  --filter 'PICK' | bcftools +counts
#+end_src
Number of samples: 1
Number of SNPs:    123
Number of INDELs:  32
Number of MNPs:    53
Number of others:  0
Number of sites:   208
2nd vérification
#+begin_src sh :dir out/annotate
 filter_vep -i vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz  --filter 'PICK' --soft_filter | grep fail
#+end_src
***** DONE Test NA12878 + variants sanger : variants perdus avec --pick ?
CLOSED: [2023-08-29 Tue 10:36] SCHEDULED: <2023-08-28 Mon>
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/out/annotate
~/.nix-profile/bin/bcftools +counts vep/NA12878-sanger-all-GRCh38/NA12878-sanger-all-GRCh38.vep.vcf.gz
#+end_src
#+RESULTS:
| Number | of | samples: |    1 |
| Number | of | SNPs:    | 6293 |
| Number | of | INDELs:  | 1515 |
| Number | of | MNPs:    | 1588 |
| Number | of | others:  |    0 |
| Number | of | sites:   | 9322 |
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/out/annotate
~/.nix-profile/bin/filter_vep -i vep/NA12878-sanger-all-GRCh38/NA12878-sanger-all-GRCh38.vep.vcf.gz  --filter 'PICK' | bcftools +counts
#+end_src
| Number | of | samples: |    1 |
| Number | of | SNPs:    | 6293 |
| Number | of | INDELs:  | 1515 |
| Number | of | MNPs:    | 1588 |
| Number | of | others:  |    0 |
| Number | of | sites:   | 9322 |
***** DONE Test NA12878 + variants sanger: vérifier sortie avec julia : ok
CLOSED: [2023-08-29 Tue 10:21] SCHEDULED: <2023-08-28 Mon>
143 variants/146 comme avant
***** DONE Relancer en T2T pour vérifier compatibilité :T2T:
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-29 Tue>
**** DONE Repasser les tests sanger sur NA12878
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-08-31 Thu>
2 variants manquants après filter vep
**** DONE Choisir le meilleur transcript nous-meme
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-09-01 Fri>
**** DONE Vérifier T2T passe
CLOSED: [2023-08-31 Thu 22:10] SCHEDULED: <2023-08-31 Thu>
**** DONE Revoir choix du transcrit + filtre  avec paul
CLOSED: [2023-09-08 Fri 22:46] SCHEDULED: <2023-09-06 Wed>
**** TODO Filtrer les transcripts selon les filtres d'Alexis et garder tous les résultat
SCHEDULED: <2023-09-09 Sat>
**** TODO Ajout colonne MANE SELECT
SCHEDULED: <2023-09-09 Sat>
**** TODO v1.0
SCHEDULED: <2023-09-09 Sat>
***** TODO Branche prod
SCHEDULED: <2023-09-09 Sat>
Merge depuis debug
***** DONE Mail alexis
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-08-31 Thu>
*** KILL Comparer les annotations sur 63003856
CLOSED: [2023-08-28 Mon 17:28]
**** Relancer le nouveau pipeline
*** KILL Ancienne version
CLOSED: [2023-08-28 Mon 17:24]
**** KILL HGVS
CLOSED: [2023-08-28 Mon 17:24]
**** KILL Filtrer après VEP
CLOSED: [2023-08-28 Mon 17:24]
**** KILL OMIM
CLOSED: [2023-08-28 Mon 17:24]
**** KILL clinvar
CLOSED: [2023-08-28 Mon 17:24]
**** KILL ACMG incidental
CLOSED: [2023-08-28 Mon 17:24]
**** KILL Grantham
CLOSED: [2023-08-28 Mon 17:24]
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** KIL
L Gnomad
CLOSED: [2023-08-28 Mon 17:24]
*** DONE Réordonner les colonnes :annotat
ion:
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-28 Mon>
Pas d'OMIM, pas de CADD, pas de spliceAI
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:5
6]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** KILL Tester sarek
CLOSED: [2023-08-12 Sat 15:53]
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
** DONE Support pour samplesheet
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Thu 13:00>
/Entered on/ [2023-08-03 Thu 13:12]
** DONE Petit jeu de données : chr22 sur HG001
CLOSED: [2023-08-05 Sat 14:21] SCHEDULED: <2023-08-05 Sat>
* Amélioration :amelioration:
* Documentation
:PROPERTIES:
:CATEGORY: doc
:END:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-
04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript
:PROPERTIES:
:CATEGORY: manuscript
:END:
* Tests :tests:
** KILL 
Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt |
22-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_cons
ensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_sr
c
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682'

[14.148]

[24.15768]

 uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** TODO OMIM sur nom du gène :annotation:
*** DONE Mobidetails API
CLOSED: [2023-09-10 Sun 16:44]
Trop long ... 1h à 1h30 d'exécution
Disponible dans module
*** TODO Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** KILL Regarder si clinique disponible avec vep :annotation:
CLOSED: [2023-09-10 Sun 16:44]
** DONE [#B] Indicateurs qualité :qualité:
CLOSED: [2023-09-10 Sun 16:46]
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** HOLD Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
  - retélécharger
  - indexer nous-meme
**** DONE Supprimer score spip en double
CLOSED: [2023-08-31 Thu 14:17] SCHEDULED: <2023-08-31 Thu>
**** DONE Vérifier variant 63126867
CLOSED: [2023-08-31 Thu 10:52] SCHEDULED: <2023-08-31 Thu>
**** DONE Ajouter tronquant ou non
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** DONE Ajouter récessif
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** KILL Corriger allelic depth
CLOSED: [2023-08-31 Thu 11:18] SCHEDULED: <2023-08-31 Thu>
Problème lié à libre office
**** DONE Regénérer annotation pour na12878, inserted et patient PEX1
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-31 Thu>
**** TODO ACMG incidental
**** DONE Sortie VCF (pour avoir la fraction allélique AF)
CLOSED: [2023-08-28 Mon 17:22]
**** DONE VCF -> tsv avec bcftools
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-28 Mon>
**** DONE Un seul transcrit après VEP avec filter_vep :filter:
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-28 Mon>
Avec mise à jour VEP 110, pick_flag semble fonctionner.
***** DONE Test chr20: Pas de variant "perdus"
CLOSED: [2023-08-28 Mon 17:31] SCHEDULED: <2023-08-28 Mon>
contrairement au résultat communiqué à alexis par mail
#+begin_src sh :dir out/annotate
bcftools +counts vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz
#+end_src
Number of samples: 1
Number of SNPs:    123
Number of INDELs:  32
Number of MNPs:    53
Number of others:  0
Number of sites:   208
#+begin_src  sh
filter_vep -i vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz  --filter 'PICK' | bcftools +counts
#+end_src
Number of samples: 1
Number of SNPs:    123
Number of INDELs:  32
Number of MNPs:    53
Number of others:  0
Number of sites:   208
2nd vérification
#+begin_src sh :dir out/annotate
 filter_vep -i vep/NA12878-sanger-chr20-GRCh38/NA12878-sanger-chr20-GRCh38.vep.vcf.gz  --filter 'PICK' --soft_filter | grep fail
#+end_src
***** DONE Test NA12878 + variants sanger : variants perdus avec --pick ?
CLOSED: [2023-08-29 Tue 10:36] SCHEDULED: <2023-08-28 Mon>
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/out/annotate
~/.nix-profile/bin/bcftools +counts vep/NA12878-sanger-all-GRCh38/NA12878-sanger-all-GRCh38.vep.vcf.gz
#+end_src
#+RESULTS:
| Number | of | samples: |    1 |
| Number | of | SNPs:    | 6293 |
| Number | of | INDELs:  | 1515 |
| Number | of | MNPs:    | 1588 |
| Number | of | others:  |    0 |
| Number | of | sites:   | 9322 |
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/out/annotate
~/.nix-profile/bin/filter_vep -i vep/NA12878-sanger-all-GRCh38/NA12878-sanger-all-GRCh38.vep.vcf.gz  --filter 'PICK' | bcftools +counts
#+end_src
| Number | of | samples: |    1 |
| Number | of | SNPs:    | 6293 |
| Number | of | INDELs:  | 1515 |
| Number | of | MNPs:    | 1588 |
| Number | of | others:  |    0 |
| Number | of | sites:   | 9322 |
***** DONE Test NA12878 + variants sanger: vérifier sortie avec julia : ok
CLOSED: [2023-08-29 Tue 10:21] SCHEDULED: <2023-08-28 Mon>
143 variants/146 comme avant
***** DONE Relancer en T2T pour vérifier compatibilité :T2T:
CLOSED: [2023-08-29 Tue 11:03] SCHEDULED: <2023-08-29 Tue>
**** DONE Repasser les tests sanger sur NA12878
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-08-31 Thu>
2 variants manquants après filter vep
**** DONE Choisir le meilleur transcript nous-meme
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-09-01 Fri>
**** DONE Vérifier T2T passe
CLOSED: [2023-08-31 Thu 22:10] SCHEDULED: <2023-08-31 Thu>
**** DONE Revoir choix du transcrit + filtre  avec paul
CLOSED: [2023-09-08 Fri 22:46] SCHEDULED: <2023-09-06 Wed>
**** DONE Filtrer les variants selon les filtres d'Alexis et garder tous les résultat
CLOSED: [2023-09-10 Sun 15:39] SCHEDULED: <2023-09-09 Sat>
**** DONE Ajout colonne MANE SELECT et garder les autres
CLOSED: [2023-09-10 Sun 15:39] SCHEDULED: <2023-09-09 Sat>
**** TODO v1.0
SCHEDULED: <2023-09-09 Sat>
***** DONE Branche prod
CLOSED: [2023-09-10 Sun 15:44] SCHEDULED: <2023-09-09 Sat>
Merge depuis debug
***** DONE Mail alexis
CLOSED: [2023-09-01 Fri 10:32] SCHEDULED: <2023-08-31 Thu>
***** STRT Relancer test sanger
SCHEDULED: <2023-09-10 Sun>
***** TODO Mail Paul pour validation
SCHEDULED: <2023-09-10 Sun>
*** KILL Comparer les annotations sur 63003856
CLOSED: [2023-08-28 Mon 17:28]
**** Relancer le nouveau pipeline
*** KILL Ancienne version
CLOSED: [2023-08-28 Mon 17:24]
**** KILL HGVS
CLOSED: [2023-08-28 Mon 17:24]
**** KILL Filtrer après VEP
CLOSED: [2023-08-28 Mon 17:24]
**** KILL OMIM
CLOSED: [2023-08-28 Mon 17:24]
**** KILL clinvar
CLOSED: [2023-08-28 Mon 17:24]
**** KILL ACMG incidental
CLOSED: [2023-08-28 Mon 17:24]
**** KILL Grantham
CLOSED: [2023-08-28 Mon 17:24]
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** KILL Gnomad
CLOSED: [2023-08-28 Mon 17:24]
*** DONE Réordonner les colonnes :annotation:
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-28 Mon>
Pas d'OMIM, pas de CADD, pas de spliceAI
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** KILL Tester sarek
CLOSED: [2023-08-12 Sat 15:53]
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
** DONE Support pour samplesheet
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Thu 13:00>
/Entered on/ [2023-08-03 Thu 13:12]
** DONE Petit jeu de données : chr22 sur HG001
CLOSED: [2023-08-05 Sat 14:21] SCHEDULED: <2023-08-05 Sat>
* Documentation
:PROPERTIES:
:CATEGORY: doc
:END:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript
:PROPERTIES:
:CATEGORY: manuscript
:END:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682'

Replacement in projects/bisonex.org at line 56 [4.35]

B:BD[25.8004] → [25.8004:8519]

∅:D[25.8519] → [22.59238:66761]

B:BD[22.59238] → [22.59238:66761]

∅:D[22.66761] → [26.50929:51083]

B:BD[26.50929] → [26.50929:51083]

to pour demande le type de capture
/Entered on/ [2023-08-07 Mon 20:40]
** TODO Données CHM13 :chm:
https://github.com/lh3/CHM-eval
*** TODO Run ERR1341793
SCHEDULED: <2023-09-10 Sun>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
*** TODO Run ERR1341796
SCHEDULED: <2023-09-10 Sun>
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2
023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCento = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** KILL Extraire liste des CNVs
CLOSED: [2023-08-12 Sat 15:54]
SCHEDULED: <2023-04-17 Mon>
**** KILL Simuscop :simuscop:
CLOSED: [2023-08-12 Sat 15:54]
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bw
a/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github  HiSeq

[25.8004]

[26.51083]

to pour demande le type de capture
/Entered on/ [2023-08-07 Mon 20:40]
** TODO Données CHM13 :chm:
https://github.com/lh3/CHM-eval
*** TODO Run ERR1341793
SCHEDULED: <2023-09-17 Sun>
(raw reads ERR1341793_1.fastq.gz and ERR1341793_2.fastq.gz downloaded from https://www.ebi.ac.uk/ena/browser/view/ERR1341793)
*** TODO Run ERR1341796
SCHEDULED: <2023-09-17 Sun>
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCento = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** KILL Extraire liste des CNVs
CLOSED: [2023-08-12 Sat 15:54]
SCHEDULED: <2023-04-17 Mon>
**** KILL Simuscop :simuscop:
CLOSED: [2023-08-12 Sat 15:54]
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github  HiSeq

Replacement in projects/bisonex.org at line 70 [4.35]

B:BD[20.29422] → [20.29422:30023]

B:BD[20.30023] → [15.8390:8795]

wser/
Idem pour l'autre
   chr17:g.7852503T>C
https://mobidetails.iurc.montp.inserm.fr/MD/api/variant/182993/browser/
Note:
VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
    MANE Select transcript status
    MANE Plus Clinical transcript status
    canonical status of transcript
    APPRIS isoform annotation
    transcript support level
    biotype of transcript ("protein_coding" preferred)
    CCDS status of transcript
    consequence rank according to this table
    translated, transcri
pt or feature length (longer preferred)
"Wherever possible we would discourage you from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-09-09 Sat>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-09-13 Wed>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-09-13 Wed>

[20.29422]

wser/
Idem pour l'autre
   chr17:g.7852503T>C
https://mobidetails.iurc.montp.inserm.fr/MD/api/variant/182993/browser/
Note:
VEP chooses one block of annotation per variant, using an ordered set of criteria. This order may be customised using --pick_order.
    MANE Select transcript status
    MANE Plus Clinical transcript status
    canonical status of transcript
    APPRIS isoform annotation
    transcript support level
    biotype of transcript ("protein_coding" preferred)
    CCDS status of transcript
    consequence rank according to this table
    translated, transcript or feature length (longer preferred)
"Wherever possible we would discourage you from summarising data in this way. "
**** DONE Mail alexis
CLOSED: [2023-08-20 Sun 13:45] SCHEDULED: <2023-08-20 Sun>
**** TODO Données simuscop 200x
SCHEDULED: <2023-09-17 Sun>
* Résultats
** TODO Speed-up BWA-mem
SCHEDULED: <2023-09-13 Wed>
** TODO Speed-up Hapotypecaller
SCHEDULED: <2023-09-13 Wed>

Insertion in archive/bisonex.org_archive at line 190 [28.46128]

[27.55086]


* Amélioration :amelioration:
:PROPERTIES:
:ARCHIVE_TIME: 2023-09-10 Sun 16:45
:ARCHIVE_FILE: ~/roam/personal/projects/bisonex.org
:ARCHIVE_CATEGORY: bisonex
:END: