apraga/org - Change WY5OHMYD4ZRUVYNHUQ63E34H6KFPR5RSYUN4EJR4OTREACIHROQAC

Notes bisonex

Created by Alexis Praga on December 23, 2022

WY5OHMYD4ZRUVYNHUQ63E34H6KFPR5RSYUN4EJR4OTREACIHROQAC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 1 [4.35]

B:BD[4.35] → [4.36:1043]

B:BD[4.1043] → [5.66:407]

∅:D[5.407] → [4.1043:1180]

B:BD[4.1043] → [4.1043:1180]

B:BD[4.1180] → [3.14:290]

∅:D[3.290] → [4.1180:1770]

B:BD[4.1180] → [4.1180:1770]

B:BD[4.1770] → [2.14:52]

∅:D[2.52] → [4.1808:1933]

B:BD[4.1808] → [4.1808:1933]

B:BD[4.1933] → [2.53:157]

∅:D[2.157] → [4.2001:9722]

B:BD[4.2001] → [4.2001:9722]

B:BD[4.9722] → [2.158:213]

∅:D[2.213] → [4.9777:30392]

B:BD[4.9777] → [4.9777:30392]

B:BD[4.30392] → [5.408:433]

∅:D[5.433] → [4.30417:30867]

B:BD[4.30417] → [4.30417:30867]

B:BD[4.30867] → [5.434:574]

∅:D[5.574] → [4.30940:33095]

B:BD[4.30940] → [4.30940:33095]

B:BD[4.33095] → [5.575:686]

∅:D[5.686] → [4.33265:33978]

B:BD[4.33265] → [4.33265:33978]

B:BD[4.33978] → [5.687:816]

∅:D[5.816] → [4.34052:34332]

B:BD[4.34052] → [4.34052:34332]

B:BD[4.34332] → [5.817:10053]

B:BD[5.10053] → [2.214:684]

∅:D[2.684] → [4.34332:34363]

∅:D[5.10053] → [4.34332:34363]

B:BD[4.34332] → [4.34332:34363]

B:BD[4.34363] → [3.291:351]

∅:D[3.351] → [4.34437:34438]

B:BD[4.34437] → [4.34437:34438]

B:BD[4.34438] → [3.352:685]

∅:D[3.685] → [4.34438:37661]

B:BD[4.34438] → [4.34438:37661]

#+title: Bisonex
* Biblio
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
  -> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
  -> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
  On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données
** TODO Vérifier qualité données sur mesocentre
*** STRT BAM
picard ValidateSamFile
On regarde juste le code d'erreur (0 = pas d'erreur)
*** STRT Fastq
fastqc
Il faut ensuite extraire les zip and chercher les erreur dedans
** TODO Lister données sur mesocentre
* Nouveau workflow
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** STRT Télécharger
- [X] Genome de référence
- [X] dbSNP
- [X] OMIM
- [X] VEP 20G
- [X] transcriptome (spip)
- [ ] Refseq
*** DONE Télécharger les données avec nextflow
CLOSED: [2022-09-13 Tue 21:37]
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
genome=$dir/genome/GRCh38.p13/genomeRef.sdf
srun rtg vcfeval -b $dbSNP -c $clinvar -o common-not-patho -t $genome --sample ALT
#+end_src
****** HOLD Voir pour régions complexes non traitées
***** DONE bcftools isec : non
CLOSED: [2022-11-27 Sun 00:38]
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -p common
#+end_src
On vérifie bien que les 2 fichiers commons on le même nombre de lignes
#+begin_src sh
$ grep -e '^NC'  0002.vcf | wc -l
74302
alex@gentoo ~/code/bisonex/data/common $ grep -e '^NC'  0003.vcf | wc -l
74302
#+end_src
****** DONE Impact option -n
CLOSED: [2022-10-23 Sun 13:56]
Mais en spécifiant -n =2:
#+begin_src sh
$ bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
74978
#+end_src
Si on ne regarde que les variants, on retrouve bien 74302
#+begin_src sh
rg "^NC" none_sorted.vcf  | wc -l
#+end_src
NB : test fait avec
#+begin_src
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none.vcf
sort common/0003.vcf > common/0003_sorted.vcf
comm -13 common/0003_sorted.vcf none_sorted.vcf
#+end_src
****** DONE Géstion des duplicates: -c none
CLOSED: [2022-10-23 Sun 13:56]
Si on ne garde que ceux avec REF et ALT identiques
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | wc -l
74978
#+end_src
Si on garde tout
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | wc -l
137777
#+end_src
Pour regarder la différence :
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none_sorted.vcf
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | sort > all_sorted.vcf
comm -13 none_sorted.vcf all_sorted.vcf | head
#+end_src
Sur un exemple,on a bien des variants différents
****** DONE Suppression des clinvar patho
CLOSED: [2022-10-23 Sun 18:55]
Semble faire le travail vu que dbSNP_commo a 23194960 lignes (donc ~80 000 de moins)
 #+begin_src sh
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10  dbSNP_common.vcf.gz clinvar.vcf.gz | wc -l
Note: -w option not given, printing list of sites...
23119984
 #+end_src
 Par contre, l'o'ption -w ou -p fait des ficher "data"...
Après un nouvel essai, plus de problème
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n=1 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
À noter le choix de l'option -n qui change entre "=1" et "~10"...
En effet "=1" = au moins 1 fichier et "~10" fait exactement dans le premier et non dans le second
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
****** DONE Valider avec Alexis : bcftool isec
CLOSED: [2022-11-07 Mon 21:42   ]
****** DONE Pourquoi nombre de lignes différentes avec la version d'Alexis -> isec ne gère pas plusieurs ALT
CLOSED: [2022-11-26 Sat 23:36]
Grosse différence !
#+begin_src
$ wc -l ID_of_common_snp_not_clinvar_patho.txt
23119915 ID_of_common_snp_not_clinvar_patho.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
85820 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
À noter que tout dbSNP = 23194960
******* Clinvar classe 4 ? Moins mais toujours trop
#+begin_src
$ zgrep '^NC' tmp.vcf.gz  | wc -l
21081654
#+end_src
******* Comparer les ID et regarder ceux en plus
#+begin_src sh
bcftools isec -e 'INFO/CLNSIG="Pathogenic"' -c none -n~10 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -w 1 -o tmp.vcf.gz
zgrep -o -e 'rs[[:digit:]]\' tmp.vcf.gz | sort | id_sorted.txt
sort ../database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  > reference_sorted.txt
comm -23 id_sorted.txt reference_sorted.txt > unique1.txt
#+end_src
Par exemple
#+begin_src sh
zgrep rs1000000561 ../database/dbSNP/dbSNP_common.vcf.gz
#+end_src
NC_000002.12	136732859	rs1000000561	ACG	A,ACGCG	.	PASS	RS=1000000561;dbSNPBuildID=151;SSR=0;VC=INDEL;GNO;FREQ=ALSPAC:0.2506,0.7494,.|TOMMO:0.9971,0.002865,.|TWINSUK:0.2473,0.7527,.|dbGaP_PopFreq:0.993,0.006943,8.902e-05;COMMON
Attention, clinvar est en numéro de chromosomoe et dbSNP en NC...
Normalement, géré lors du calcul d'intersection !
Ce SNP n'est pas dans clinvar (vérifié dans UCSC)
******* Tester sur chromosome 20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools view --regions NC_000020.11 ../database/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_chr20.vcf.gz
bcftools view --regions 20 ../database/clinvar/clinvar.vcf.gz -o clinvar_chr20.vcf.gz
tabix dbSNP_common_chr20.vcf.gz
tabix clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
Attention à bien renommer clinvar !
#+begin_src sh :dir ~/code/bisonex/test_isec
mv clinvar_chr20.vcf.gz clinvar_chr20_notremapped.vcf.gz
bcftools annotate --rename-chrs chromosome_mapping.txt clinvar_chr20_notremapped.vcf.gz -o clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
*ATTENTION*: sans indexer les vcf, les fichiers seront *VIDES*
*ATTENTION*: par défaut les filtres s'appliquent sur les 2. Cela est un problème si on joue sur l'inclusion et non l'exclusion
Attention: vérifier la conventdion de nommage des chromosomes
******** Test pathogene: ne prend pas en compte les multi-allèles ????
On teste l'intersection dbsnp et clinvar patho ainsi que le complémentaire
#+begin_src sh :dir ~/code/bisonex/test_isec
clinvar=clinvar_chr20_patho.vcf.gz
snp=dbSNP_common_chr20.vcf.gz
bcftools index $clinvar
bcftools index $snp
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz -o $clinvar
bcftools isec  $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
#+RESULTS:
| tmp/0000.vcf |
|       518846 |
| tmp/0001.vcf |
|            0 |
| tmp/0002.vcf |
|            0 |
| tmp/0003.vcf |
|            0 |
Aucun clinvar patho... Clairement faux !
Autre méthode : on inclut tous les SNP et clinvar patho et on regarde ceux uniquement dans dbsnp
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -n=2 -i - -i 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
 # grep '^[^#]' tmp/0000.vcf | wc -l
#+end_src
#+RESULTS:
Soit tout dbsnp donc rien
Note : on ne peut pas exclure les clinvar patho directement
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -i - -e 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
Car on ne peut plus faire la différence !
Si on utilise la version d'Alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
Si on cherche les clinvar patho (donc non présent dans la sortie)
#+begin_src sh :dir ~/code/bisonex/test_isec
  bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > all.txt
  sort common-notpatho-alexis.txt > alexis.txt
  comm -23 all.txt alexis.txt > patho.txt
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS\n' -i 'ID=@patho.txt' dbSNP_common_chr20.vcf.gz -o pos.txt
for pos in $(cat pos.txt); do
  bcftools query -f '%CHROM %POS %ID %REF %ALT\n' -i 'POS='$pos dbSNP_common_chr20.vcf.gz
  bcftools query -f '%CHROM %POS %ID %REF %ALT %INFO/CLNSIG\n' -i 'POS='$pos  clinvar_chr20.vcf.gz
  echo "------"
done
#+end_src
#+RESULTS:
| NC_000020.11 |  3234173 |   rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 |  3234173 |      262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 |  3234173 |     1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 |  3234173 |      208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 |  3234173 |        1312 | TGGCGAAGC | T         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 |  4699605 |   rs1799990 | A         | G         |                                              |
| NC_000020.11 |  4699605 |       13397 | A         | G         | Benign/Likely_benign                         |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10652589 |   rs1131695 | G         | A,C,T     |                                              |
| NC_000020.11 | 10652589 |      163705 | G         | .         | Benign                                       |
| NC_000020.11 | 10652589 |      143063 | G         | A         | Benign                                       |
| NC_000020.11 | 10652589 |      234555 | G         | C         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10658574 |   rs1801138 | G         | A,T       |                                              |
| NC_000020.11 | 10658574 |       42481 | G         | A         | Benign                                       |
| NC_000020.11 | 10658574 |      992651 | G         | T         | Likely_pathogenic                            |
| NC_000020.11 | 10658574 |      213550 | GC        | A         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10672794 |  rs79338570 | G         | A,C       |                                              |
| NC_000020.11 | 10672794 |      255557 | G         | A         | Benign/Likely_benign                         |
| NC_000020.11 | 10672794 |      594067 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 10672794 |     1324603 | G         | GGA       | Likely_pathogenic                            |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 18525868 | rs146917730 | C         | T         |                                              |
| NC_000020.11 | 18525868 |      811603 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 25390747 | rs373200654 | G         | C         |                                              |
| NC_000020.11 | 25390747 |      338000 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 32800145 |   rs2424926 | C         | G,T       |                                              |
| NC_000020.11 | 32800145 |      338173 | C         | G         | Benign                                       |
| NC_000020.11 | 32800145 |      338174 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 33412656 |  rs35938843 | C         | G,T       |                                              |
| NC_000020.11 | 33412656 |      220958 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 45891622 | rs181943893 | G         | A,C,T     |                                              |
| NC_000020.11 | 45891622 |      459632 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 45891622 |      797035 | G         | T         | Likely_benign                                |
| NC_000020.11 | 45891622 |     1572689 | GCTA      | G         | Likely_benign                                |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 54171651 |  rs35873579 | G         | A,T       |                                              |
| NC_000020.11 | 54171651 |      285894 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 54171651 |     1373583 | G         | C         | Uncertain_significance                       |
| NC_000020.11 | 54171651 |      895614 | G         | T         | Benign/Likely_benign                         |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 62172726 |  rs36106901 | G         | A         |                                              |
| NC_000020.11 | 62172726 |      981031 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 63349782 |   rs1044396 | G         | A,C       |                                              |
| NC_000020.11 | 63349782 |       93427 | G         | A         | Benign                                       |
| NC_000020.11 | 63349782 |      857384 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 63414925 |   rs1801545 | G         | A,C,T     |                                              |
| NC_000020.11 | 63414925 |      194284 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 63414925 |      129337 | G         | C         | Benign                                       |
| NC_000020.11 | 63414925 |      851545 | GG        | CA        | Uncertain_significance                       |
| ------       |          |             |           |           |                                              |
On a donc plusieurs problèmes :
1. isec devrait fonctionner au moins sur
| NC_000020.11 | 25390747 | rs373200654 | G         | C         |                                              |
| NC_000020.11 | 25390747 |      338000 | G         | C         | Conflicting_interpretations_of_pathogenicity |
On teste juste sur cette ligne
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools filter -i 'POS=25390747' dbSNP_common_chr20.vcf.gz -o dbSNP_test.vcf.gz
#+end_src
On retrouve bien la ligne dans l'intersection...
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools index dbSNP_test.vcf.gz dbSNP_test.vcf.gz
bcftools index dbSNP_test.vcf.gz clinvar_test.vcf.gz
bcftools isec dbSNP_test.vcf.gz clinvar_test.vcf.gz -p test
#+end_src
#+RESULTS:
2. isec ne semble pas fonctionner sur en cas d'ALT multiples
| NC_000020.11 | 32800145 | rs2424926 | C | G,T |                                              |
| NC_000020.11 | 32800145 |    338173 | C | G   | Benign                                       |
| NC_000020.11 | 32800145 |    338174 | C | T   | Conflicting_interpretations_of_pathogenicity |
|              |          |           |   |     |                                              |
3. s'il y a plusieurs variantions à une position, il faut bien vérifier que tous ne sont pas patho.
   La version d'Alexis le fait bien
| NC_000020.11 | 3234173 | rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 | 3234173 |    262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 |   1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 | 3234173 |    208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 | 3234173 |      1312 | TGGCGAAGC | T         | Pathogenic                                   |
****** DONE Voir si isec gère les multiallélique (chr20) : non, impossible de faire marcher
CLOSED: [2022-11-27 Sun 00:37]
******* DONE chr20 en prenant un patho clinvar aussi dans dbSNP
CLOSED: [2022-11-27 Sun 00:37]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter dbSNP_common_chr20.vcf.gz -i 'POS=10652589' -o test_dbsnp.vcf.gz
bcftools filter clinvar_chr20.vcf.gz -i 'POS=10652589' -o test_clinvar.vcf.gz
bcftools index test_dbsnp.vcf.gz
bcftools index test_clinvar.vcf.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec test_dbsnp.vcf.gz test_clinvar.vcf.gz -p tmp
grep '^[^#]' tmp/0002.vcf
grep '^[^#]' tmp/0003.vcf
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Testé en modifiant test_dbsnp !
Fonctionne avec un variant par ligne
****** DONE isec en coupant les sites multialléliques: non
CLOSED: [2022-11-27 Sun 00:37]
******* DONE Exemple simple ok
CLOSED: [2022-11-27 Sun 00:34]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=10652589' dbSNP_common_chr20.vcf.gz -o dbsnp_mwi.vcf.gz
bcftools filter -i 'POS=10652589' clinvar_chr20.vcf.gz -o clinvar_mwi.vcf.gz
bcftools index -f dbsnp_mwi.vcf.gz
bcftools index -f clinvar_mwi.vcf.gz
bcftools isec dbsnp_mwi.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Chr 20
Avec les fichiers du teste précédent
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbsnp_mwi.vcf.gz -o dbsnp_mwi_norm.vcf.gz
bcftools index dbsnp_mwi_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
** DONE Preprocessing avec nextflow
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Map to reference
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
** TODO Annotation avec nextflow
*** TODO VEP
*** TODO Spip
*** TODO Filtrer après VEP
On doit pouvoir se passer d'un script R avec bcftools
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* TODO Tests
** TODO Test de non régression avec version ALexis avec nix
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535155 | new.txt |
| 1070310 | total   |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_20_old.vcf.gz
bcftools filter -i 'CHROM="19" | CHROM="20"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_20_old.vcf.gz
#+end_src
#+RESULTS:
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19_20.vcf.gz --dbSNP dbSNP_common_19_20.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_20_old.vcf.gz --dbSNP dbSNP_common_19_20_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
***** DONE Regarder la répartition des différences
CLOSED: [2022-12-11 Sun 16:29]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.old
comm -23 notpatho.new notpatho.old > nopatho.diff
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
 bcftools query -i 'ID=@nopatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
 #+end_src
 On a principalement des coordonnées non consensuelles (non "NC_", voir notes)
 #+RESULTS:
  :     2 NC_000002.12
  :    18 NC_000003.12
  :     2 NC_000004.12
  :     2 NC_000005.10
  :    14 NC_000006.12
  :     6 NC_000007.14
  :     2 NC_000009.12
  :     1 NC_000010.11
  :     6 NC_000014.9
  :     1 NC_000015.10
  :     3 NC_000016.10
  :     3 NC_000017.11
  :     1 NC_000019.10
  :     1 NC_000020.11
  :     1 NC_000021.9
  :     2 NC_000022.11
  : 16018 NT_113793.3
  : 17010 NT_113796.3
  :    14 NT_113891.3
  :     1 NT_167244.2
  :    13 NT_167245.2
  :     2 NT_167246.2
  :    13 NT_167247.2
  :     7 NT_167248.2
  :    14 NT_167249.2
  : 14857 NT_187361.1
  :    92 NT_187367.1
  :     1 NT_187369.1
  :    13 NT_187381.1
  :    54 NT_187383.1
  :     6 NT_187499.1
  :    46 NT_187502.1
  : 13754 NT_187513.1
  :   611 NT_187517.1
  :     1 NT_187520.1
  :     1 NT_187524.1
  :   249 NT_187526.1
  :    18 NT_187532.1
  :     1 NT_187546.1
  :   886 NT_187562.1
  :     1 NT_187564.1
  :   346 NT_187576.1
  :    13 NT_187600.1
  :     5 NT_187601.1
  :   494 NT_187606.1
  :     1 NT_187607.1
  :    12 NT_187613.1
  :   307 NT_187614.1
  :     1 NT_187625.1
  :   445 NT_187633.1
  :    43 NT_187648.1
  :    18 NT_187649.1
  :     1 NT_187652.1
  :   512 NT_187661.1
  :    18 NT_187678.1
  :    49 NT_187681.1
  :     1 NT_187682.1
  :    18 NT_187688.1
  :    12 NT_187689.1
  :    18 NT_187690.1
  :    18 NT_187691.1
  :   404 NT_187693.1
  :     2 NW_003315952.3
  :     1 NW_003315970.2
  :   203 NW_003571054.1
  :   322 NW_003571055.2
  :    16 NW_003571056.2
  :    16 NW_003571057.2
  :    16 NW_003571058.2
  :    16 NW_003571059.2
  :    16 NW_003571060.1
  :   213 NW_003571061.2
  :     2 NW_009646201.1
  :   322 NW_009646205.1
  :   321 NW_009646206.1
  :   371 NW_012132914.1
  :     1 NW_012132915.1
  :    13 NW_012132918.1
  :     2 NW_013171801.1
  :     1 NW_013171807.1
  :    49 NW_015148966.1
  :    14 NW_015495298.1
  :     2 NW_015495299.1
  :     1 NW_016107298.1
  :     4 NW_017363813.1
  :     2 NW_017852933.1
  :     1 NW_018654722.1
  :    38 NW_021160001.1
  :     1 NW_021160003.1
  :     1 NW_021160007.1
  :     7 NW_021160017.1
***** DONE Regarder la différence avec la version sans les sites non consensuels: ok !
CLOSED: [2022-12-11 Sun 20:11]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > notpatho.diff
wc -l
#+end_src
#+RESULTS:
: 528 notpatho.diff
Il manque 528 variants
rs1057520103
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
 bcftools query -i 'ID=@notpatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
 #+end_src
 #+RESULTS:
 : 528 NC_012920.1
 Donc la nouvelle version fonctionne mieux !
 ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | wc -l
528
$ grep -w -f notpatho.diff  /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
**** DONE Supprimer les sites non consensuels
CLOSED: [2022-12-11 Sun 19:51]
**** DONE Rajouter les mitochondries (vu avec Paul)
CLOSED: [2022-12-13 Tue 17:26]
Ok avec notre version générée. Sur le J: 21155134
$ wc -l dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
21155065 dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
$ wc -l ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
21155065 ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
La différence vient probablement d'une vieille version de clinvar
*** TODO alignement + variant:
On en a en plus
$ grep '^NC' filter-depth.vcf | wc -l
86580
Alexis
$  zgrep '^NC' 63003856_S135_DP_over_30.vcf  | wc -l
82033
Ne vient pas du filtre sur la profondeur
On a testé
bcftools filter -i 'FORMAT/AD[0:1]<=10' 63003856_S135_DP_over_30.vcf
cftools filter -i 'FORMAT/DP<=30' 63003856_S135_DP_over_30.vcf
Idem pour notre version. Rien ne sort.
Haplotypecaller a les même arguments ??
*** TODO 63003856_S135
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
** TODO Genome in a bottle ?
On n'a pas l'ADN.. séquencer à Centogène ?
* Améliorations
** TODO Quality score recalibration avec un ensemble de fichier
Voir GATK best practice
** TODO Utiliser T-to-T comme références
Semble compliqué avec les nouvelles bases de données
** TODO Macro excel
** TODO Utiliser le XML de clinvar
Extraction sous VCF possible avec
https://github.com/SeqOne/clinvcf
** Annotation
Liste complète
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9252745/
*** TODO Utilise une version allégée de GnomAD (une seule colonne)
*** TODO Digenisme (cf nomenclature omim)
C’est dans le nom de la maladie
* HOLD Implémenter d’autres pipeline
Voir https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04407-x
** KILL GATK
CLOSED: [2022   -11-11 Fri 20:01]
https://broadinstitute.github.io/warp/docs/Pipelines/Exome_Germline_Single_Sample_Pipeline/README
A priori, respecte les bonnes pratiques
** KILL Essayer snmake avec bonne pratiques
https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/.github/workflows/main.yml
Installer Mamba (micromamba ne fonctionne pas sous nix)
Ne fonctionne pas sous WSL2... MultiQC n’est pas assez à jour
Problèmes de versions...
** KILL Sarek
CLOSED: [2022-12-11 Sun 11:09]
*** Dépendences
**** Nix
#+begin_src sh
 nix profile install nixpkgs#mosdepth nixpkgs#python3
  nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
***** KILL derivation nix pour profile complet
CLOSED: [2022-12-11 Sun 11:09]
**** KILL Sans nix
CLOSED: [2022-09-24 Sat 10:20]
On utilise conda
#+begin_src sh
module unload nix
module load anaconda3@2021.05/gcc-12.1.0
module load nextflow@22.04.0/gcc-12.1.0
module load openjdk@11.0.14.1_1/gcc-12.1.0
nextflow run nf-core/sarek -profile conda,test --executor slurm --queue smp --outdir test -resume
#+end_src
Essai 1: erreurs de permissions, corrigé en relancant le programme
#+begin_quote
  Failed to create Conda environment
  command: conda create --mkdir --yes --quiet --prefix /Work/Users/apraga/test-sarek/work/conda/env-2d53b1db50de676670cf1a91ef0cf6db bioconda::tabix=1.11
  status : 1
  message:
    NotWritableError: The current user does not have write permissions to a required path.
      path: /Home/Users/apraga/.conda/pkgs/urls.txt
      uid: 1696
      gid: 513
    If you feel that permissions on this path are set incorrectly, you can manually
    change them by executing
      $ sudo chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_quote
Corrigé avec
#+begin_src sh
      chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_src
Mais problème de proxy
*** KILL Dérivation nix pour modules python
CLOSED: [2022-12-11 Sun 11:09]
*** KILL Lancer sarek en mode test
CLOSED: [2022-12-11 Sun 11:09]
#+begin_src sh
  nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
*** KILL Lancer sarek sur données allégées
CLOSED: [2022-12-11 Sun 11:09]

[4.35]

#+title: Bisonex
* Biblio
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
  -> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
  -> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
  On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données
** DONE Vérifier qualité données sur mesocentre
*** DONE BAM
picard ValidateSamFile
On regarde juste le code d'erreur (0 = pas d'erreur)
*** DONE Fastq
fastqc
Il faut ensuite extraire les zip and chercher les erreur dedans
** DONE Lister données sur mesocentre
<2022-12-23 ven.>
176 patients ok
Données manquantes :
- 5 avec juste 1 fastq  sur les 2 sans bam
- 19 patients sans fastq ni bam
* Nouveau workflow
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** STRT Télécharger
- [X] Genome de référence
- [X] dbSNP
- [X] OMIM
- [X] VEP 20G
- [X] transcriptome (spip)
- [ ] Refseq
*** DONE Télécharger les données avec nextflow
CLOSED: [2022-09-13 Tue 21:37]
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
genome=$dir/genome/GRCh38.p13/genomeRef.sdf
srun rtg vcfeval -b $dbSNP -c $clinvar -o common-not-patho -t $genome --sample ALT
#+end_src
****** HOLD Voir pour régions complexes non traitées
***** DONE bcftools isec : non
CLOSED: [2022-11-27 Sun 00:38]
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -p common
#+end_src
On vérifie bien que les 2 fichiers commons on le même nombre de lignes
#+begin_src sh
$ grep -e '^NC'  0002.vcf | wc -l
74302
alex@gentoo ~/code/bisonex/data/common $ grep -e '^NC'  0003.vcf | wc -l
74302
#+end_src
****** DONE Impact option -n
CLOSED: [2022-10-23 Sun 13:56]
Mais en spécifiant -n =2:
#+begin_src sh
$ bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
74978
#+end_src
Si on ne regarde que les variants, on retrouve bien 74302
#+begin_src sh
rg "^NC" none_sorted.vcf  | wc -l
#+end_src
NB : test fait avec
#+begin_src
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none.vcf
sort common/0003.vcf > common/0003_sorted.vcf
comm -13 common/0003_sorted.vcf none_sorted.vcf
#+end_src
****** DONE Géstion des duplicates: -c none
CLOSED: [2022-10-23 Sun 13:56]
Si on ne garde que ceux avec REF et ALT identiques
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | wc -l
74978
#+end_src
Si on garde tout
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | wc -l
137777
#+end_src
Pour regarder la différence :
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none_sorted.vcf
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | sort > all_sorted.vcf
comm -13 none_sorted.vcf all_sorted.vcf | head
#+end_src
Sur un exemple,on a bien des variants différents
****** DONE Suppression des clinvar patho
CLOSED: [2022-10-23 Sun 18:55]
Semble faire le travail vu que dbSNP_commo a 23194960 lignes (donc ~80 000 de moins)
 #+begin_src sh
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10  dbSNP_common.vcf.gz clinvar.vcf.gz | wc -l
Note: -w option not given, printing list of sites...
23119984
 #+end_src
 Par contre, l'o'ption -w ou -p fait des ficher "data"...
Après un nouvel essai, plus de problème
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n=1 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
À noter le choix de l'option -n qui change entre "=1" et "~10"...
En effet "=1" = au moins 1 fichier et "~10" fait exactement dans le premier et non dans le second
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
****** DONE Valider avec Alexis : bcftool isec
CLOSED: [2022-11-07 Mon 21:42   ]
****** DONE Pourquoi nombre de lignes différentes avec la version d'Alexis -> isec ne gère pas plusieurs ALT
CLOSED: [2022-11-26 Sat 23:36]
Grosse différence !
#+begin_src
$ wc -l ID_of_common_snp_not_clinvar_patho.txt
23119915 ID_of_common_snp_not_clinvar_patho.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
85820 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
À noter que tout dbSNP = 23194960
******* Clinvar classe 4 ? Moins mais toujours trop
#+begin_src
$ zgrep '^NC' tmp.vcf.gz  | wc -l
21081654
#+end_src
******* Comparer les ID et regarder ceux en plus
#+begin_src sh
bcftools isec -e 'INFO/CLNSIG="Pathogenic"' -c none -n~10 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -w 1 -o tmp.vcf.gz
zgrep -o -e 'rs[[:digit:]]\' tmp.vcf.gz | sort | id_sorted.txt
sort ../database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  > reference_sorted.txt
comm -23 id_sorted.txt reference_sorted.txt > unique1.txt
#+end_src
Par exemple
#+begin_src sh
zgrep rs1000000561 ../database/dbSNP/dbSNP_common.vcf.gz
#+end_src
NC_000002.12	136732859	rs1000000561	ACG	A,ACGCG	.	PASS	RS=1000000561;dbSNPBuildID=151;SSR=0;VC=INDEL;GNO;FREQ=ALSPAC:0.2506,0.7494,.|TOMMO:0.9971,0.002865,.|TWINSUK:0.2473,0.7527,.|dbGaP_PopFreq:0.993,0.006943,8.902e-05;COMMON
Attention, clinvar est en numéro de chromosomoe et dbSNP en NC...
Normalement, géré lors du calcul d'intersection !
Ce SNP n'est pas dans clinvar (vérifié dans UCSC)
******* Tester sur chromosome 20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools view --regions NC_000020.11 ../database/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_chr20.vcf.gz
bcftools view --regions 20 ../database/clinvar/clinvar.vcf.gz -o clinvar_chr20.vcf.gz
tabix dbSNP_common_chr20.vcf.gz
tabix clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
Attention à bien renommer clinvar !
#+begin_src sh :dir ~/code/bisonex/test_isec
mv clinvar_chr20.vcf.gz clinvar_chr20_notremapped.vcf.gz
bcftools annotate --rename-chrs chromosome_mapping.txt clinvar_chr20_notremapped.vcf.gz -o clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
*ATTENTION*: sans indexer les vcf, les fichiers seront *VIDES*
*ATTENTION*: par défaut les filtres s'appliquent sur les 2. Cela est un problème si on joue sur l'inclusion et non l'exclusion
Attention: vérifier la conventdion de nommage des chromosomes
******** Test pathogene: ne prend pas en compte les multi-allèles ????
On teste l'intersection dbsnp et clinvar patho ainsi que le complémentaire
#+begin_src sh :dir ~/code/bisonex/test_isec
clinvar=clinvar_chr20_patho.vcf.gz
snp=dbSNP_common_chr20.vcf.gz
bcftools index $clinvar
bcftools index $snp
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz -o $clinvar
bcftools isec  $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
#+RESULTS:
| tmp/0000.vcf |
|       518846 |
| tmp/0001.vcf |
|            0 |
| tmp/0002.vcf |
|            0 |
| tmp/0003.vcf |
|            0 |
Aucun clinvar patho... Clairement faux !
Autre méthode : on inclut tous les SNP et clinvar patho et on regarde ceux uniquement dans dbsnp
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -n=2 -i - -i 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
 # grep '^[^#]' tmp/0000.vcf | wc -l
#+end_src
#+RESULTS:
Soit tout dbsnp donc rien
Note : on ne peut pas exclure les clinvar patho directement
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -i - -e 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]'  $i | wc -l; done
#+end_src
Car on ne peut plus faire la différence !
Si on utilise la version d'Alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
Si on cherche les clinvar patho (donc non présent dans la sortie)
#+begin_src sh :dir ~/code/bisonex/test_isec
  bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > all.txt
  sort common-notpatho-alexis.txt > alexis.txt
  comm -23 all.txt alexis.txt > patho.txt
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS\n' -i 'ID=@patho.txt' dbSNP_common_chr20.vcf.gz -o pos.txt
for pos in $(cat pos.txt); do
  bcftools query -f '%CHROM %POS %ID %REF %ALT\n' -i 'POS='$pos dbSNP_common_chr20.vcf.gz
  bcftools query -f '%CHROM %POS %ID %REF %ALT %INFO/CLNSIG\n' -i 'POS='$pos  clinvar_chr20.vcf.gz
  echo "------"
done
#+end_src
#+RESULTS:
| NC_000020.11 |  3234173 |   rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 |  3234173 |      262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 |  3234173 |     1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 |  3234173 |      208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 |  3234173 |        1312 | TGGCGAAGC | T         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 |  4699605 |   rs1799990 | A         | G         |                                              |
| NC_000020.11 |  4699605 |       13397 | A         | G         | Benign/Likely_benign                         |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10652589 |   rs1131695 | G         | A,C,T     |                                              |
| NC_000020.11 | 10652589 |      163705 | G         | .         | Benign                                       |
| NC_000020.11 | 10652589 |      143063 | G         | A         | Benign                                       |
| NC_000020.11 | 10652589 |      234555 | G         | C         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10658574 |   rs1801138 | G         | A,T       |                                              |
| NC_000020.11 | 10658574 |       42481 | G         | A         | Benign                                       |
| NC_000020.11 | 10658574 |      992651 | G         | T         | Likely_pathogenic                            |
| NC_000020.11 | 10658574 |      213550 | GC        | A         | Pathogenic                                   |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 10672794 |  rs79338570 | G         | A,C       |                                              |
| NC_000020.11 | 10672794 |      255557 | G         | A         | Benign/Likely_benign                         |
| NC_000020.11 | 10672794 |      594067 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 10672794 |     1324603 | G         | GGA       | Likely_pathogenic                            |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 18525868 | rs146917730 | C         | T         |                                              |
| NC_000020.11 | 18525868 |      811603 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 25390747 | rs373200654 | G         | C         |                                              |
| NC_000020.11 | 25390747 |      338000 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 32800145 |   rs2424926 | C         | G,T       |                                              |
| NC_000020.11 | 32800145 |      338173 | C         | G         | Benign                                       |
| NC_000020.11 | 32800145 |      338174 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 33412656 |  rs35938843 | C         | G,T       |                                              |
| NC_000020.11 | 33412656 |      220958 | C         | T         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 45891622 | rs181943893 | G         | A,C,T     |                                              |
| NC_000020.11 | 45891622 |      459632 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 45891622 |      797035 | G         | T         | Likely_benign                                |
| NC_000020.11 | 45891622 |     1572689 | GCTA      | G         | Likely_benign                                |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 54171651 |  rs35873579 | G         | A,T       |                                              |
| NC_000020.11 | 54171651 |      285894 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 54171651 |     1373583 | G         | C         | Uncertain_significance                       |
| NC_000020.11 | 54171651 |      895614 | G         | T         | Benign/Likely_benign                         |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 62172726 |  rs36106901 | G         | A         |                                              |
| NC_000020.11 | 62172726 |      981031 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 63349782 |   rs1044396 | G         | A,C       |                                              |
| NC_000020.11 | 63349782 |       93427 | G         | A         | Benign                                       |
| NC_000020.11 | 63349782 |      857384 | G         | C         | Conflicting_interpretations_of_pathogenicity |
| ------       |          |             |           |           |                                              |
| NC_000020.11 | 63414925 |   rs1801545 | G         | A,C,T     |                                              |
| NC_000020.11 | 63414925 |      194284 | G         | A         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 63414925 |      129337 | G         | C         | Benign                                       |
| NC_000020.11 | 63414925 |      851545 | GG        | CA        | Uncertain_significance                       |
| ------       |          |             |           |           |                                              |
On a donc plusieurs problèmes :
1. isec devrait fonctionner au moins sur
| NC_000020.11 | 25390747 | rs373200654 | G         | C         |                                              |
| NC_000020.11 | 25390747 |      338000 | G         | C         | Conflicting_interpretations_of_pathogenicity |
On teste juste sur cette ligne
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools filter -i 'POS=25390747' dbSNP_common_chr20.vcf.gz -o dbSNP_test.vcf.gz
#+end_src
On retrouve bien la ligne dans l'intersection...
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools index dbSNP_test.vcf.gz dbSNP_test.vcf.gz
bcftools index dbSNP_test.vcf.gz clinvar_test.vcf.gz
bcftools isec dbSNP_test.vcf.gz clinvar_test.vcf.gz -p test
#+end_src
#+RESULTS:
2. isec ne semble pas fonctionner sur en cas d'ALT multiples
| NC_000020.11 | 32800145 | rs2424926 | C | G,T |                                              |
| NC_000020.11 | 32800145 |    338173 | C | G   | Benign                                       |
| NC_000020.11 | 32800145 |    338174 | C | T   | Conflicting_interpretations_of_pathogenicity |
|              |          |           |   |     |                                              |
3. s'il y a plusieurs variantions à une position, il faut bien vérifier que tous ne sont pas patho.
   La version d'Alexis le fait bien
| NC_000020.11 | 3234173 | rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 | 3234173 |    262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 |   1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 | 3234173 |    208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 | 3234173 |      1312 | TGGCGAAGC | T         | Pathogenic                                   |
****** DONE Voir si isec gère les multiallélique (chr20) : non, impossible de faire marcher
CLOSED: [2022-11-27 Sun 00:37]
******* DONE chr20 en prenant un patho clinvar aussi dans dbSNP
CLOSED: [2022-11-27 Sun 00:37]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter dbSNP_common_chr20.vcf.gz -i 'POS=10652589' -o test_dbsnp.vcf.gz
bcftools filter clinvar_chr20.vcf.gz -i 'POS=10652589' -o test_clinvar.vcf.gz
bcftools index test_dbsnp.vcf.gz
bcftools index test_clinvar.vcf.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec test_dbsnp.vcf.gz test_clinvar.vcf.gz -p tmp
grep '^[^#]' tmp/0002.vcf
grep '^[^#]' tmp/0003.vcf
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Testé en modifiant test_dbsnp !
Fonctionne avec un variant par ligne
****** DONE isec en coupant les sites multialléliques: non
CLOSED: [2022-11-27 Sun 00:37]
******* DONE Exemple simple ok
CLOSED: [2022-11-27 Sun 00:34]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=10652589' dbSNP_common_chr20.vcf.gz -o dbsnp_mwi.vcf.gz
bcftools filter -i 'POS=10652589' clinvar_chr20.vcf.gz -o clinvar_mwi.vcf.gz
bcftools index -f dbsnp_mwi.vcf.gz
bcftools index -f clinvar_mwi.vcf.gz
bcftools isec dbsnp_mwi.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Chr 20
Avec les fichiers du teste précédent
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbsnp_mwi.vcf.gz -o dbsnp_mwi_norm.vcf.gz
bcftools index dbsnp_mwi_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
** DONE Preprocessing avec nextflow
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Map to reference
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
** TODO Annotation avec nextflow
*** TODO VEP
*** TODO Spip
*** TODO Filtrer après VEP
On doit pouvoir se passer d'un script R avec bcftools
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* TODO Tests
** TODO Test de non régression avec version ALexis avec nix
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$  zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
 bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535194 | new.txt |
| 1070349 | total   |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz  -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz  -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
|  535155 | old.txt |
|  535155 | new.txt |
| 1070310 | total   |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_20_old.vcf.gz
bcftools filter -i 'CHROM="19" | CHROM="20"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_20_old.vcf.gz
#+end_src
#+RESULTS:
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19_20.vcf.gz --dbSNP dbSNP_common_19_20.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_20_old.vcf.gz --dbSNP dbSNP_common_19_20_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
***** DONE Regarder la répartition des différences
CLOSED: [2022-12-11 Sun 16:29]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.old
comm -23 notpatho.new notpatho.old > nopatho.diff
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
 bcftools query -i 'ID=@nopatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
 #+end_src
 On a principalement des coordonnées non consensuelles (non "NC_", voir notes)
 #+RESULTS:
  :     2 NC_000002.12
  :    18 NC_000003.12
  :     2 NC_000004.12
  :     2 NC_000005.10
  :    14 NC_000006.12
  :     6 NC_000007.14
  :     2 NC_000009.12
  :     1 NC_000010.11
  :     6 NC_000014.9
  :     1 NC_000015.10
  :     3 NC_000016.10
  :     3 NC_000017.11
  :     1 NC_000019.10
  :     1 NC_000020.11
  :     1 NC_000021.9
  :     2 NC_000022.11
  : 16018 NT_113793.3
  : 17010 NT_113796.3
  :    14 NT_113891.3
  :     1 NT_167244.2
  :    13 NT_167245.2
  :     2 NT_167246.2
  :    13 NT_167247.2
  :     7 NT_167248.2
  :    14 NT_167249.2
  : 14857 NT_187361.1
  :    92 NT_187367.1
  :     1 NT_187369.1
  :    13 NT_187381.1
  :    54 NT_187383.1
  :     6 NT_187499.1
  :    46 NT_187502.1
  : 13754 NT_187513.1
  :   611 NT_187517.1
  :     1 NT_187520.1
  :     1 NT_187524.1
  :   249 NT_187526.1
  :    18 NT_187532.1
  :     1 NT_187546.1
  :   886 NT_187562.1
  :     1 NT_187564.1
  :   346 NT_187576.1
  :    13 NT_187600.1
  :     5 NT_187601.1
  :   494 NT_187606.1
  :     1 NT_187607.1
  :    12 NT_187613.1
  :   307 NT_187614.1
  :     1 NT_187625.1
  :   445 NT_187633.1
  :    43 NT_187648.1
  :    18 NT_187649.1
  :     1 NT_187652.1
  :   512 NT_187661.1
  :    18 NT_187678.1
  :    49 NT_187681.1
  :     1 NT_187682.1
  :    18 NT_187688.1
  :    12 NT_187689.1
  :    18 NT_187690.1
  :    18 NT_187691.1
  :   404 NT_187693.1
  :     2 NW_003315952.3
  :     1 NW_003315970.2
  :   203 NW_003571054.1
  :   322 NW_003571055.2
  :    16 NW_003571056.2
  :    16 NW_003571057.2
  :    16 NW_003571058.2
  :    16 NW_003571059.2
  :    16 NW_003571060.1
  :   213 NW_003571061.2
  :     2 NW_009646201.1
  :   322 NW_009646205.1
  :   321 NW_009646206.1
  :   371 NW_012132914.1
  :     1 NW_012132915.1
  :    13 NW_012132918.1
  :     2 NW_013171801.1
  :     1 NW_013171807.1
  :    49 NW_015148966.1
  :    14 NW_015495298.1
  :     2 NW_015495299.1
  :     1 NW_016107298.1
  :     4 NW_017363813.1
  :     2 NW_017852933.1
  :     1 NW_018654722.1
  :    38 NW_021160001.1
  :     1 NW_021160003.1
  :     1 NW_021160007.1
  :     7 NW_021160017.1
***** DONE Regarder la différence avec la version sans les sites non consensuels: ok !
CLOSED: [2022-12-11 Sun 20:11]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > notpatho.diff
wc -l
#+end_src
#+RESULTS:
: 528 notpatho.diff
Il manque 528 variants
rs1057520103
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
 bcftools query -i 'ID=@notpatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
 #+end_src
 #+RESULTS:
 : 528 NC_012920.1
 Donc la nouvelle version fonctionne mieux !
 ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt  | wc -l
528
$ grep -w -f notpatho.diff  /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
**** DONE Supprimer les sites non consensuels
CLOSED: [2022-12-11 Sun 19:51]
**** DONE Rajouter les mitochondries (vu avec Paul)
CLOSED: [2022-12-13 Tue 17:26]
Ok avec notre version générée. Sur le J: 21155134
$ wc -l dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
21155065 dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
$ wc -l ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
21155065 ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
La différence vient probablement d'une vieille version de clinvar
*** TODO alignement + variant:
**** filtre DP : différent
On en a en plus
$ grep '^NC' filter-depth.vcf | wc -l
86580
Alexis
$  zgrep '^NC' 63003856_S135_DP_over_30.vcf  | wc -l
82033
Ne vient pas du filtre sur la profondeur
On a testé
bcftools filter -i 'FORMAT/AD[0:1]<=10' 63003856_S135_DP_over_30.vcf
cftools filter -i 'FORMAT/DP<=30' 63003856_S135_DP_over_30.vcf
Idem pour notre version. Rien ne sort.
Haplotypecaller a les même arguments ??
**** Après haplotypecaller : différent
$ zgrep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135.vcf | wc -l
1506894
$ zgrep '^NC' /Work/Users/apraga/bisonex/out/63003856_S135/variantCalling/haplotypecaller/63003856_S135.vcf | wc -l
1631935
**** Aligneur
3 lignes de différences
$  samtools view -c /Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/mapped/63003856_S135.bam
128077211
$  samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
**** Clean sam
corrigé bug : fichier nettoyé après markduplicates... On relance. En attendant :
$ samtools view -c work/3f/532cde42fabd8973384a817b7539de/sorted.bam
85710928
$ samtools view -c work/3f/532cde42fabd8973384a817b7539de/62903840_S29.bam
85710928
mais :
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_cleaned.bam           128077207
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam                   128077207
*** TODO 63003856_S135
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
** TODO Genome in a bottle ?
On n'a pas l'ADN.. séquencer à Centogène ?
* Améliorations
** TODO Quality score recalibration avec un ensemble de fichier
Voir GATK best practice
** TODO Utiliser T-to-T comme références
Semble compliqué avec les nouvelles bases de données
** TODO Macro excel
** TODO Utiliser le XML de clinvar
Extraction sous VCF possible avec
https://github.com/SeqOne/clinvcf
** Annotation
Liste complète
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9252745/
*** TODO Utilise une version allégée de GnomAD (une seule colonne)
*** TODO Digenisme (cf nomenclature omim)
C’est dans le nom de la maladie
* HOLD Implémenter d’autres pipeline
Voir https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04407-x
** KILL GATK
CLOSED: [2022   -11-11 Fri 20:01]
https://broadinstitute.github.io/warp/docs/Pipelines/Exome_Germline_Single_Sample_Pipeline/README
A priori, respecte les bonnes pratiques
** KILL Essayer snmake avec bonne pratiques
https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/.github/workflows/main.yml
Installer Mamba (micromamba ne fonctionne pas sous nix)
Ne fonctionne pas sous WSL2... MultiQC n’est pas assez à jour
Problèmes de versions...
** KILL Sarek
CLOSED: [2022-12-11 Sun 11:09]
*** Dépendences
**** Nix
#+begin_src sh
 nix profile install nixpkgs#mosdepth nixpkgs#python3
  nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
***** KILL derivation nix pour profile complet
CLOSED: [2022-12-11 Sun 11:09]
**** KILL Sans nix
CLOSED: [2022-09-24 Sat 10:20]
On utilise conda
#+begin_src sh
module unload nix
module load anaconda3@2021.05/gcc-12.1.0
module load nextflow@22.04.0/gcc-12.1.0
module load openjdk@11.0.14.1_1/gcc-12.1.0
nextflow run nf-core/sarek -profile conda,test --executor slurm --queue smp --outdir test -resume
#+end_src
Essai 1: erreurs de permissions, corrigé en relancant le programme
#+begin_quote
  Failed to create Conda environment
  command: conda create --mkdir --yes --quiet --prefix /Work/Users/apraga/test-sarek/work/conda/env-2d53b1db50de676670cf1a91ef0cf6db bioconda::tabix=1.11
  status : 1
  message:
    NotWritableError: The current user does not have write permissions to a required path.
      path: /Home/Users/apraga/.conda/pkgs/urls.txt
      uid: 1696
      gid: 513
    If you feel that permissions on this path are set incorrectly, you can manually
    change them by executing
      $ sudo chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_quote
Corrigé avec
#+begin_src sh
      chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_src
Mais problème de proxy
*** KILL Dérivation nix pour modules python
CLOSED: [2022-12-11 Sun 11:09]
*** KILL Lancer sarek en mode test
CLOSED: [2022-12-11 Sun 11:09]
#+begin_src sh
  nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
*** KILL Lancer sarek sur données allégées
CLOSED: [2022-12-11 Sun 11:09]