Z6B2FRJWT6EF4MC4GTIDBMULUKEI62ZGPYZMCWRJDJNUM7YUJJMAC
PL526DIB3OIAMC35BV5DRTUHRZDIJDEUCPPK2AUY5CHLWJNEWI4AC
7F3ARXB7ZM4WPC3FOH5RHXFHWB3MGZ6XJYJX5IH6RFQNUKOWHAOQC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
WU76EHIWT7E3CEKGV6LQJUSOC63JYDPPDRV3KKOEGLP2FB5UEUEQC
XBXXQ7NGCA2AM7F6DODF75VNIA52J3MNYSLNAYRRC2KKYJUVCN2QC
5K375NP7U7JITYTRY2SOW3PQ6C5FTJIPCVGLQZ2SH3SQ5HVGCWHQC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
T7GS3FVFOFBEFQXCLESYLY7JJ6D2ZVA2CAH5U4FRTKHTMPLINNXAC
JH5INW2G6JKUQ3QD2VMSAFEIGUUBCBD5RJGSQOVEVF3I7BKL7SKQC
TV5QPGKWYZO7GU2ULJN4KC5WY3GNAG3UKRVJGKFC4JLEJ42NTHMAC
LCKDIFG2ZAOKZXWFEQBTV7GSTYU4KB6DPHF3EDRCWCCHX6QYSUTQC
5XXDOQUZ5ORUD6TSVRSOHABWOB4OFPDPQCN4DFU6IF5YDZE3LNRQC
JMVLHRX7RFJCQFXESQT75DUKOLBQEO7LMKOZCI4QWQBQKAWLC2PAC
:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] => 1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** PROJ PR python 3 upstream
*** PROJ nixpkgs en l'état
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID: 5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 }
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name } | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 } | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
Puis on supprime
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz" | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam" | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12
:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] => 1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** PROJ PR python 3 upstream
*** PROJ nixpkgs en l'état
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID: 5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 }
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name } | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 } | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
Puis on supprime
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz" | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam" | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12
epIntron | 0 | O
utside SPiCE Interpretation | 0 | 0 | No | 10.00000 | 89894644 | Don | 0.0001360257829 | No | Don | 89894485 | 159 | 0 | 0.0000000000000 | No | 89894485 | 0.07177992 | Yes | 0.07177992 | Yes |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_028034:g.89894645:A>G | NTR | 00 % [00 % - 00.92 %] | 0.000 | + | 89894645 | substitution | A>G | Intron 3 | 1398 | NR_028034 | FAS | donor | 160 | DeepIntron | 0 | Outside SPiCE Interpretation | 0 | 0 | No | 10.00000 | 89894644 | Don | 0.0001360257829 | No | Don | 89894485 | 159 | 0 | 0.0000000000000 | No | 89894485 | 0.07177992 | Yes | 0.07177992 | Yes |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_028035:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 4 | 63 | NR_028035 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_028036:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 5 | 63 | NR_028036 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135313:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 5 | 63 | NR_135313 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NM_001410956:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 6 | 63 | NM_001410956 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135314:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 6 | 63 | NR_135314 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135315:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 4 | 63 | NR_135315 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** DONE Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
CLOSED: [2023-08-28 Mon 17:21] SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une correspondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;" GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;" GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+
begin_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** HOLD OMIM sur nom du gène :annotation:
*** DONE Mobidetails API
CLOSED: [2023-09-10 Sun 16:44]
Trop long ... 1h à 1h30 d'exécution
Disponible dans module
*** DONE Filtre vep avec spip for T2T et spliceAI pour GRCh38
CLOSED: [2023-09-16 Sat 22:47]
*** DONE Repasser tests en GRCh38 avec nouveau filtre (spip ou splice ai) :sanger:
CLOSED: [2023-09-17 Sun 09:07] SCHEDULED: <2023-09-16 Sat>
*** HOLD Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** KILL Regarder si clinique disponible avec vep :annotation:
CLOSED: [2023-09-10 Sun 16:44]
** DONE [#B] Indicateurs qualité :qualité:
CLOSED: [2023-09-10 Sun 16:46]
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
*** DONE Diminuer mémoire pour haplotypecaller
CLOSED: [2023-09-20 Wed 21:44] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:30]
Medium = 32Go pour 6 coeurs => 4 jobs (donc tout le noeud) prend plus que les 96GB...
On essaie 16Gb
Puis commit
*** DONE Report multiqc avec 10 runs
CLOSED: [2023-09-19 Tue 15:31] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:31]
Cf mail 2023-09-19
*** WAIT Bug: variant sur 7788314 pour patient 62982193 filtré : DP < 30
SCHEDULED: <2023-09-25 Mon>
/Entered on/ [2023-09-22 Fri 22:59]
35 selon IGV mais 27 en pratique dans le VCF.
VCF cento: 26 reads également...
Non confirmé sanger
Mail envoyé Alexis
** DONE Mettre à jour spip pour corriger bug 62982239 : variant trop long (?)
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
/Entered on/ [2023-09-21 Thu 23:11]
Rapporté par https://github.com/raphaelleman/SPiP/issues/9
*** DONE Relance run
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE Mise à jour spip
CLOSED: [2023-09-21 Thu 23:41] SCHEDULED: <2023-09-21 Thu>
** DONE nixpkgs unstable -> 23.05
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE repasser tests sanger
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Ale
epIntron | 0 | Outside SPiCE Interpretation | 0 | 0 | No | 10.00000 | 89894644 | Don | 0.0001360257829 | No | Don | 89894485 | 159 | 0 | 0.0000000000000 | No | 89894485 | 0.07177992 | Yes | 0.07177992 | Yes |
|chr10129957338-T-C chr10 | 89894645 | lol | A | G | . | . | . | NR_028034:g.89894645:A>G | NTR | 00 % [00 % - 00.92 %] | 0.000 | + | 89894645 | substitution | A>G | Intron 3 | 1398 | NR_028034 | FAS | donor | 160 | DeepIntron | 0 | Outside SPiCE Interpretation | 0 | 0 | No | 10.00000 | 89894644 | Don | 0.0001360257829 | No | Don | 89894485 | 159 | 0 | 0.0000000000000 | No | 89894485 | 0.07177992 | Yes | 0.07177992 | Yes |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_028035:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 4 | 63 | NR_028035 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_028036:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 5 | 63 | NR_028036 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135313:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 5 | 63 | NR_135313 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NM_001410956:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 6 | 63 | NM_001410956 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135314:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 6 | 63 | NR_135314 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| chr10 | 89894645 | lol | A | G | . | . | . | NR_135315:g.89894645:A>G | Alter ESR | 35.81 % [28.11 % - 44.1 %] | 0.288 | + | 89894645 | substitution | A>G | Exon 4 | 63 | NR_135315 | FAS | acceptor | 8 | ExonESR | 0 | Outside SPiCE Interpretation | 0 | 0 | No | -1.67753 | 89894644 | Acc | 0.0000003317384 | No | Acc | 89894637 | 7 | 89894644 | 0.0000002205815 | No | 89894637 | 0.02545572 | No | 0.02545572 | No |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** DONE Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
CLOSED: [2023-08-28 Mon 17:21] SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une chr10129957338-T-Ccorrespondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;" GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;" GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+begin_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** HOLD OMIM sur nom du gène :annotation:
*** DONE Mobidetails API
CLOSED: [2023-09-10 Sun 16:44]
Trop long ... 1h à 1h30 d'exécution
Disponible dans module
*** DONE Filtre vep avec spip for T2T et spliceAI pour GRCh38
CLOSED: [2023-09-16 Sat 22:47]
*** DONE Repasser tests en GRCh38 avec nouveau filtre (spip ou splice ai) :sanger:
CLOSED: [2023-09-17 Sun 09:07] SCHEDULED: <2023-09-16 Sat>
*** HOLD Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** KILL Regarder si clinique disponible avec vep :annotation:
CLOSED: [2023-09-10 Sun 16:44]
*** TODO Tester filtre sans splice
SCHEDULED: <2023-09-27 Wed>
Mail Paul: Exome donc hors splice, peu intéressant
**** DONE Enlever complètement condition splice: 6130 variants restants... comme spliceAI
CLOSED: [2023-09-27 Wed 19:37] SCHEDULED: <2023-09-26 Tue>
Cf [[id:c9b2009a-503b-4561-94c6-29ae21a3188d][Filtre vep avec spliceAI: 37365 -> 6130]]
Dans tests/splicai
#+begin_src sh
filter_vep -i output-all-gpu.vcf --format vcf --filter " not(Consequence matches non_coding_transcript or Consequence matches stream or Consequence matches intergenic_variant or Consequence matches UTR or Consequence matches intron_variant or Consequence matches synonymous or BIOTYPE matches pseudogene or BIOTYPE matches misc_RNA)" --only_matched -o test.vcf
grep -c -v '^#' test.vcf
6130
#+end_src
**** DONE Remplacer par impact fonctionnel: peu d'impact : majorité = MODERATE
CLOSED: [2023-09-27 Wed 19:45] SCHEDULED: <2023-09-26 Tue>
filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is HIGH" --only_matched | grep -c -v '^#'
258
filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is LOW" --only_matched | grep -c -v '^#'
11
filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is MODERATE" --only_matched | grep -c -v '^#'
5824
**** DONE Regarder les conséquences pour tes les transcripts
CLOSED: [2023-09-27 Wed 21:04]
/Work/Users/apraga/bisonex/out/annotate/vep/NA12878-sanger-all-T2T
filter_vep -i NA12878-sanger-all-T2T.vep.vcf.gz --format vcf --filter " not(Consequence matches non_coding_transcript or Consequence matches stream or Consequence matches intergenic_variant or Consequence matches UTR or Consequence matches intron_variant or Consequence matches synonymous or BIOTYPE matches pseudogene or BIOTYPE matches misc_RNA)" --only_matched -o filtered.vcf
bcftools +split-vep filtered.vcf -f '%Consequence\n' -d | sort | uniq -c
94 coding_sequence_variant
13 coding_sequence_variant&NMD_transcript_variant
257 frameshift_variant
21 frameshift_variant&NMD_transcript_variant
2 frameshift_variant&splice_donor_region_variant
20 frameshift_variant&splice_region_variant
1 frameshift_variant&splice_region_variant&NMD_transcript_variant
1 incomplete_terminal_codon_variant&coding_sequence_variant
211 inframe_deletion
18 inframe_deletion&NMD_transcript_variant
6 inframe_deletion&splice_region_variant
242 inframe_insertion
22 inframe_insertion&NMD_transcript_variant
4 inframe_insertion&splice_region_variant
14689 missense_variant
1416 missense_variant&NMD_transcript_variant
6 missense_variant&splice_donor_5th_base_variant
374 missense_variant&splice_region_variant
34 missense_variant&splice_region_variant&NMD_transcript_variant
53 splice_acceptor_variant
11 splice_acceptor_variant&NMD_transcript_variant
79 splice_donor_variant
6 splice_donor_variant&NMD_transcript_variant
30 start_lost
5 start_lost&NMD_transcript_variant
135 stop_gained
13 stop_gained&frameshift_variant
3 stop_gained&frameshift_variant&NMD_transcript_variant
2 stop_gained&frameshift_variant&splice_region_variant
14 stop_gained&NMD_transcript_variant
5 stop_gained&splice_region_variant
2 stop_gained&splice_region_variant&NMD_transcript_variant
4 stop_lost
1 stop_lost&NMD_transcript_variant
9 stop_retained_variant
6 stop_retained_variant&NMD_transcript_variant
1 transcript_ablation
Idem tests/spliceai
bcftools +split-vep output-all-gpu-filtered.vcf -f '%Consequence\n' -d | sort | uniq -c
94 coding_sequence_variant
13 coding_sequence_variant&NMD_transcript_variant
257 frameshift_variant
21 frameshift_variant&NMD_transcript_variant
2 frameshift_variant&splice_donor_region_variant
20 frameshift_variant&splice_region_variant
1 frameshift_variant&splice_region_variant&NMD_transcript_variant
1 incomplete_terminal_codon_variant&coding_sequence_variant
211 inframe_deletion
18 inframe_deletion&NMD_transcript_variant
6 inframe_deletion&splice_region_variant
242 inframe_insertion
22 inframe_insertion&NMD_transcript_variant
4 inframe_insertion&splice_region_variant
14689 missense_variant
1416 missense_variant&NMD_transcript_variant
6 missense_variant&splice_donor_5th_base_variant
374 missense_variant&splice_region_variant
34 missense_variant&splice_region_variant&NMD_transcript_variant
53 splice_acceptor_variant
11 splice_acceptor_variant&NMD_transcript_variant
79 splice_donor_variant
6 splice_donor_variant&NMD_transcript_variant
30 start_lost
5 start_lost&NMD_transcript_variant
135 stop_gained
13 stop_gained&frameshift_variant
3 stop_gained&frameshift_variant&NMD_transcript_variant
2 stop_gained&frameshift_variant&splice_region_variant
14 stop_gained&NMD_transcript_variant
5 stop_gained&splice_region_variant
2 stop_gained&splice_region_variant&NMD_transcript_variant
4 stop_lost
1 stop_lost&NMD_transcript_variant
9 stop_retained_variant
6 stop_retained_variant&NMD_transcript_variant
1 transcript_ablation
**** DONE Regarder les conséquences pour -s worst
CLOSED: [2023-09-27 Wed 21:04]
/Work/Users/apraga/bisonex/out/annotate/vep/NA12878-sanger-all-T2T
Après filtre_vep sans splice
]$ bcftools +split-vep filtered.vcf -f '%Consequence\n' -d -s worst | sort | uniq -c
48 coding_sequence_variant
6 coding_sequence_variant&nmd_transcript_variant
121 frameshift_variant
9 frameshift_variant&nmd_transcript_variant
1 frameshift_variant&splice_donor_region_variant
9 frameshift_variant&splice_region_variant
79 inframe_deletion
3 inframe_deletion&nmd_transcript_variant
2 inframe_deletion&splice_region_variant
85 inframe_insertion
2 inframe_insertion&nmd_transcript_variant
1 inframe_insertion&splice_region_variant
5309 missense_variant
207 missense_variant&nmd_transcript_variant
3 missense_variant&splice_donor_5th_base_variant
110 missense_variant&splice_region_variant
9 missense_variant&splice_region_variant&nmd_transcript_variant
19 splice_acceptor_variant
1 splice_acceptor_variant&nmd_transcript_variant
21 splice_donor_variant
1 splice_donor_variant&nmd_transcript_variant
14 start_lost
44 stop_gained
4 stop_gained&frameshift_variant
2 stop_gained&frameshift_variant&splice_region_variant
3 stop_gained&nmd_transcript_variant
3 stop_gained&splice_region_variant
2 stop_gained&splice_region_variant&nmd_transcript_variant
2 stop_lost
1 stop_lost&nmd_transcript_variant
6 stop_retained_variant
2 stop_retained_variant&nmd_transcript_variant
1 transcript_ablation
Dans tests/spliceai
$ bcftools +split-vep output-all-gpu-filtered.vcf -f '%Consequence\n' -s worst -d | sort | uniq -c
48 coding_sequence_variant
6 coding_sequence_variant&nmd_transcript_variant
121 frameshift_variant
9 frameshift_variant&nmd_transcript_variant
1 frameshift_variant&splice_donor_region_variant
9 frameshift_variant&splice_region_variant
79 inframe_deletion
3 inframe_deletion&nmd_transcript_variant
2 inframe_deletion&splice_region_variant
85 inframe_insertion
2 inframe_insertion&nmd_transcript_variant
1 inframe_insertion&splice_region_variant
5309 missense_variant
207 missense_variant&nmd_transcript_variant
3 missense_variant&splice_donor_5th_base_variant
110 missense_variant&splice_region_variant
9 missense_variant&splice_region_variant&nmd_transcript_variant
19 splice_acceptor_variant
1 splice_acceptor_variant&nmd_transcript_variant
21 splice_donor_variant
1 splice_donor_variant&nmd_transcript_variant
14 start_lost
44 stop_gained
4 stop_gained&frameshift_variant
2 stop_gained&frameshift_variant&splice_region_variant
3 stop_gained&nmd_transcript_variant
3 stop_gained&splice_region_variant
2 stop_gained&splice_region_variant&nmd_transcript_variant
2 stop_lost
1 stop_lost&nmd_transcript_variant
6 stop_retained_variant
2 stop_retained_variant&nmd_transcript_variant
1 transcript_ablation
**** TODO Vérifier si tests sanger passent: non
SCHEDULED: <2023-09-27 Wed>
│ String Float64 Int64
─────┼───────────────────────────────────────
1 │ chr10:g.130884530 60.0 67
2 │ chr10:g.240362 60.0 79
3 │ chr14:g.52665581 60.0 51
4 │ chr19:g.41325390 60.0 180
** DONE [#B] Indicateurs qualité :qualité:
CLOSED: [2023-09-10 Sun 16:46]
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
*** DONE Diminuer mémoire pour haplotypecaller
CLOSED: [2023-09-20 Wed 21:44] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:30]
Medium = 32Go pour 6 coeurs => 4 jobs (donc tout le noeud) prend plus que les 96GB...
On essaie 16Gb
Puis commit
*** DONE Report multiqc avec 10 runs
CLOSED: [2023-09-19 Tue 15:31] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:31]
Cf mail 2023-09-19
*** WAIT Bug: variant sur 7788314 pour patient 62982193 filtré : DP < 30
SCHEDULED: <2023-09-25 Mon>
/Entered on/ [2023-09-22 Fri 22:59]
35 selon IGV mais 27 en pratique dans le VCF.
VCF cento: 26 reads également...
Non confirmé sanger
Mail envoyé Alexis
** DONE Mettre à jour spip pour corriger bug 62982239 : variant trop long (?)
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
/Entered on/ [2023-09-21 Thu 23:11]
Rapporté par https://github.com/raphaelleman/SPiP/issues/9
*** DONE Relance run
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE Mise à jour spip
CLOSED: [2023-09-21 Thu 23:41] SCHEDULED: <2023-09-21 Thu>
** DONE nixpkgs unstable -> 23.05
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE repasser tests sanger
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Ale
data)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 2111, in predict_step
return self(x, training=False)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 558, in __call__
return super().__call__(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 512, in call
return self._run_internal_graph(inputs, training=training, mask=mask)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 669, in _run_internal_graph
outputs = node.layer(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 290, in call
outputs = self.convolution_op(inputs, self.kernel)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 262, in convolution_op
return tf.nn.convolution(
Node: 'model_1/conv1d_3/Conv1D'
DNN library is not found.
[[{{node model_1/conv1d_3/Conv1D}}]] [Op:__inference_predict_function_22195]
#+end_quote
***** DONE GPU: chr20 ok
CLOSED: [2023-09-26 Tue 11:50]
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-20-2-T2T.vep.vcf.gz -O output-20-2-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
temps d'exécution : 5min
***** STRT GPU: toutes les données
SCHEDULED: <2023-09-26 Tue>
#+begin_src slurm
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="spliceai-gpu"
#SBATCH --output=%x.%J.out ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH -t 24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem=32G
## votre dresse mail pour les notifs
#SBATCH --mail-user=apraga@chu-besancon.fr
#SBATCH --mail-type=END,FAIL
nvidia-smi
module purge
module load nix/2.11.0
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-all-T2T.vep.vcf.gz -O output-all-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
#+end_src
***** TODO Avec pip: echec
2023-09-24 08:28:46.361434: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
***** KILL Tester conda: echec
CLOSED: [2023-09-23 Sat 21:43] SCHEDULED: <2023-09-23 Sat>
N'arrive pas à installer
#+begin_quote
- feature:/linux-64::__glibc==2.28=0
- python=3.11 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
- spliceai -> tensorflow[version='>=1.13.0'] -> __cuda
- spliceai -> tensorflow[version='>=1.13.0'] -> __glibc[version='>=2.17']
Your installed version is: 2.28
#+end_quote
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
-
data)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 2111, in predict_step
return self(x, training=False)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 558, in __call__
return super().__call__(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 512, in call
return self._run_internal_graph(inputs, training=training, mask=mask)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 669, in _run_internal_graph
outputs = node.layer(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 290, in call
outputs = self.convolution_op(inputs, self.kernel)
File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 262, in convolution_op
return tf.nn.convolution(
Node: 'model_1/conv1d_3/Conv1D'
DNN library is not found.
[[{{node model_1/conv1d_3/Conv1D}}]] [Op:__inference_predict_function_22195]
#+end_quote
***** DONE GPU: chr20 ok
CLOSED: [2023-09-26 Tue 11:50]
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-20-2-T2T.vep.vcf.gz -O output-20-2-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
temps d'exécution : 5min
***** STRT GPU: toutes les données :GPU:spliceai:
****** DONE Run : 70GB, 3h30
CLOSED: [2023-09-27 Wed 10:37] SCHEDULED: <2023-09-26 Tue>
32G insufissant ! Il faut 70GB :
Job ID: 17340
Cluster: mesoubfc
User/Group: apraga/mesousers
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 03:11:53
CPU Efficiency: 93.55% of 03:25:07 core-walltime
Job Wall-clock time: 03:25:07
Memory Utilized: 67.75 GB
Memory Efficiency: 52.93% of 128.00 GB
#+begin_src slurm
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="spliceai-gpu"
#SBATCH --output=%x.%J.out ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH -t 24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem=64G
## votre dresse mail pour les notifs
#SBATCH --mail-user=apraga@chu-besancon.fr
#SBATCH --mail-type=END,FAIL
nvidia-smi
module purge
module load nix/2.11.0
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-all-T2T.vep.vcf.gz -O output-all-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
#+end_src
****** TODO Annoter la sortie de VEP avec ce VCF
Générer un fichier d'annotation
#+begin_src
bcftools annotate -x INFO/CSQ output-all-gpu.vcf -o spliceai.vcf.gz
bcftools index spliceai.vcf.gz
#+end_src
Annoter avec vep
#+begin_src sh
ln -s /Work/Projects/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa .
ln -s /Work/Projects/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa.fai .
ln -s /Work/Projects/bisonex/data/vep/chm13v2.0/106 .
ln -s /Work/Projects/bisonex/data/clinvar/chm13v2.0/clinvar.vcf.gz .
ln -s /Work/Projects/bisonex/data/clinvar/chm13v2.0/clinvar.vcf.gz.tbi .
#+end_src
#+begin_src sh
#vep -i output-all-gpu.vcf -o output-all-gpu-filtered.vcf --appris --biotype --canonical --ccds --compress_output bgzip --domains --exclude_predicted --flag_pick --hgvs --hgvsg --gene_phenotype --numbers --mane --protein --offline --uniprot --symbol --tsl --use_given_ref --variant_class --vcf --plugin NMD --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN --plugin SpliceAI,snv=spliceai.vcf.gz,indel=spliceai.vcf.gz --fasta chm13v2.0.fa --assembly T2T-CHM13v2.0 --species homo_sapiens_gca009914755v4/ --cache --cache_version 106 --dir_cache 106
vep -i lol.vcf --force -o test.vcf.gz --appris --biotype --canonical --ccds --compress_output bgzip --domains --exclude_predicted --flag_pick --hgvs --hgvsg --gene_phenotype --numbers --mane --protein --offline --uniprot --symbol --tsl --use_given_ref --variant_class --vcf --plugin NMD --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN --custom spliceai.vcf.gz,SpliceAI,vcf,exact,0,DS_AG%DS_AL --fasta chm13v2.0.fa --assembly T2T-CHM13v2.0 --species homo_sapiens_gca009914755v4/ --cache --cache_version 106 --dir_cache ${PWD}/106
#+end_src
****** KILL Save
CLOSED: [2023-09-27 Wed 21:40]
# ****** DONE Filtre vep avec spliceAI: 37365 -> 6130. SpliceAI n'apporte rien
# CLOSED: [2023-09-27 Wed 19:37] SCHEDULED: <2023-09-27 Wed>
# :PROPERTIES:
# :ID: c9b2009a-503b-4561-94c6-29ae21a3188d
# :END:
# #+begin_src sh
# filter_vep -i output-all-gpu.vcf --format vcf --filter " not(Consequence matches non_coding_transcript or Consequence matches stream or Consequence matches intergenic_variant or Consequence matches UTR or Consequence matches intron_variant or Consequence matches synonymous or BIOTYPE matches pseudogene or BIOTYPE matches misc_RNA) or (SpliceAI_pred_DS_AG and SpliceAI_pred_DS_AG >= 0.2) or (SpliceAI_pred_DS_AL and SpliceAI_pred_DS_AL >= 0.2) or (SpliceAI_pred_DS_DG and SpliceAI_pred_DS_DG >= 0.2) or (SpliceAI_pred_DS_DL and SpliceAI_pred_DS_DL >= 0.2) " --only_matched -o output-all-gpu-filtered.vcf
# #+end_src
# filter_vep -i output-all-gpu.vcf --format vcf --filter " not(Consequence matches non_coding_transcript or Consequence matches stream or Consequence matches intergenic_variant or Consequence matches UTR or Consequence matches intron_variant or Consequence matches synonymous or BIOTYPE matches pseudogene or BIOTYPE matches misc_RNA)" --only_matched | grep -c -v '^#'
# 6130
# $ grep -c -v '^#' output-all-gpu-filtered.vcf
# 6130
# ****** DONE Re-vérifier filtre avec spip: 7730 -> probable problème avec spip
# CLOSED: [2023-09-27 Wed 20:54] SCHEDULED: <2023-09-27 Wed>
# filter_vep -i NA12878-sanger-all-T2T.vep.vcf.gz --format vcf --filter " not(Consequence matches non_coding_transcript or Consequence matches stream or Consequence matches intergenic_variant or Consequence matches UTR or Consequence matches intron_variant or Consequence matches synonymous or BIOTYPE matches pseudogene or BIOTYPE matches misc_RNA) or (SPIP_spipScore and SPIP_spipScore >= 20)" --only_matched | grep -c -v '^#'
# perl: warning: Setting locale failed.
# perl: warning: Please check that your locale settings:
# LANGUAGE = (unset),
# LC_ALL = (unset),
# LANG = "en_US.utf8"
# are supported and installed on your system.
# perl: warning: Falling back to the standard locale ("C").
# 7730
****** TODO vérifier si tests sanger passent
SCHEDULED: <2023-09-27 Wed>
***** TODO Avec pip: echec
2023-09-24 08:28:46.361434: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
***** DONE Tester conda: echec
CLOSED: [2023-09-23 Sat 21:43] SCHEDULED: <2023-09-23 Sat>
Ananconda: N'arrive pas à installer
#+begin_quote
- feature:/linux-64::__glibc==2.28=0
- python=3.11 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
- spliceai -> tensorflow[version='>=1.13.0'] -> __cuda
- spliceai -> tensorflow[version='>=1.13.0'] -> __glibc[version='>=2.17']
Your installed version is: 2.28
#+end_quote
Il faut utiliser mamba
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
-