apraga/org - Change Z6B2FRJWT6EF4MC4GTIDBMULUKEI62ZGPYZMCWRJDJNUM7YUJJMAC

Trying to annotate using spliceai

Created by Alexis Praga on September 27, 2023

Z6B2FRJWT6EF4MC4GTIDBMULUKEI62ZGPYZMCWRJDJNUM7YUJJMAC

Dependencies

In channels

main

Change contents

Replacement in projects.org at line 373 [4.123895]
B:BD[3.532] → [3.532:560]
```
SCHEDULED: <2023-09-26 Tue>
```
[3.532]
[5.676]
```
SCHEDULED: <2023-09-30 Sat>
```

Replacement in projects/bisonex.org at line 4 [8.35]

B:BD[6.8438] → [7.29:8221]

:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] =>  1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** PROJ PR python 3 upstream
*** PROJ nixpkgs en l'état
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam  chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID:       5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view   /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name }  | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }  | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
 Puis on supprime
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz"  | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam"  | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12

[6.8438]

[9.29]

:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 19:13]--[2023-07-30 Sun 20:50] =>  1:37
:END:
Modification nécessaire pour kent :
- plus de patch
- suppression d'une boucle dans postPatch
On supprime aussi NIX_BUILD_TOP
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186459][BioDBHTS]]
CLOSED: [2023-05-06 Sat 08:49] SCHEDULED: <2023-04-15 Sat>
/Entered on/ [2022-08-10 Wed 14:28]
Correction pour review faites <2022-10-10 Mon>
*** DONE [[https://github.com/NixOS/nixpkgs/pull/186464][BioExtAlign]]
CLOSED: [2022-10-22 Sat 12:43] SCHEDULED: <2022-08-10 Wed>
/Entered on/ [2022-08-10 Wed 14:28]
Review <2022-10-10 Mon>, correction dans la journée.
Correction 2e passe, attente
Impossible de faire marcher les tests Car il ne trouve pas le module Bio::Tools::Align, qui est dans un dossier ailleurs dans le dépôt. Même en compilant tout le dépôt, cela ne fonctionne pas... On skip les tests.
*** TODO VEP
** WAIT [[https://github.com/NixOS/nixpkgs/pull/230394][rtg-tools]] :vcfeval:
Soumis
** WAIT Package Spip https://github.com/NixOS/nixpkgs/pull/247476
** TODO Happy :happy:
*** PROJ PR python 3 upstream
*** PROJ nixpkgs en l'état
** PROJ SpliceAI
** TODO Bamsurgeon
/Entered on/ [2023-05-13 Sat 19:11]
*** TODO Velvet
** TODO PR Picard avec option pour gérer la mémoire
Similaire à
https://github.com/bioconda/bioconda-recipes/blob/master/recipes/picard/picard.sh
* Julia :julia:
** KILL XAM.jl: PR pour modification record :julia:
CLOSED: [2023-05-29 Mon 15:40] SCHEDULED: <2023-05-28 Sun>
/Entered on/ [2023-05-27 Sat 22:39]
** TODO XAMscissors.jl :xamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
*** WAIT Implémenter les SNV avec VAF :snv:
Stratégie :
1. calculer la profondeur sur les positions
2. créer un dictionnaire { nom du reads : position dataframe }
3. itérer sur tous les reads et changer ceux marqués
**** DONE VAF = 1
CLOSED: [2023-05-29 Mon 15:34]
**** DONE VAF selon loi normale
CLOSED: [2023-05-29 Mon 15:35]
Tronquée si > 1
**** WAIT Tests unitaires
***** DONE NA12878: 1 gène sur chromosome 22
CLOSED: [2023-05-30 Tue 23:55]
root = "https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/"
#+begin_src sh
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878.bwa.markDuplicates.bam  chr22 -o project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam
samtools view project.NIST_NIST7035_H7AP8ADXX_NA12878_chr22.bam chr22:19419700-19424000 -o NIST7035_H7AP8ADXX_NA12878_chr22_MRPL40_hg19.bam
#+end_src
***** WAIT Pull request formatspeciment
https://github.com/BioJulia/FormatSpecimens.jl/pull/8
***** DONE Formatspecimens
CLOSED: [2023-05-29 Mon 23:03]
****** DONE 1 read
CLOSED: [2023-05-29 Mon 23:02]
****** DONE VAF sur 1 exon
CLOSED: [2023-05-29 Mon 23:03]
**** DONE [#A] Bug: perte de nombreux reads avec NA12878
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-18 Fri>
:PROPERTIES:
:ID:       5c1c36f3-f68e-4e6d-a7b6-61dca89abc37
:END:
Ex: chrX:g.124056226 : on passe de 65 reads à 1
Test xamscissors: pas de soucis...
On teste sur cette position +/- 200bp
#+begin_src sh :dir /home/alex/roam/research/bisonex/code/sanger
samtools view   /home/alex/code/bisonex/out/2300346867_NA12878-63118093_S260-GRCh38/preprocessing/mapped/2300346867_NA12878-63118093_S260-GRCh38.bam chrX:124056026-124056426 -o chrXsmall.bam
#+end_src
#+RESULTS:
***** DONE Vérifier profondeur avec dernière version :
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
****** DONE chr20: profondeur ok
SCHEDULED: <2023-08-19 Sat>
****** DONE toutes les données
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-19 Sat>
Ok pour 7 variants (IGV) notament chromosome X
*** TODO Implémenter les indel avec VAF :indel:
*** TODO Soumission paquet
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name }  | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }  | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
 Puis on supprime
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz"  | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam"  | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12

Replacement in projects/bisonex.org at line 21 [8.35]

B:BD[10.18219] → [10.18219:18249]

B:BD[10.18249] → [11.203:8276]

∅:D[11.8276] → [12.29:118]

B:BD[13.8221] → [12.29:118]

B:BD[12.118] → [14.201:8393]

epIntron |               0 | O
utside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028034:g.89894645:A>G    | NTR            | 00 % [00 % - 00.92 %]      |     0.000 | +      | 89894645 | substitution | A>G      | Intron 3 |     1398 | NR_028034    | FAS  | donor     |           160 | DeepIntron |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028035:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_028035    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028036:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_028036    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135313:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_135313    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NM_001410956:g.89894645:A>G | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NM_001410956 | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135314:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NR_135314    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135315:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_135315    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
|       |          |     |     |     |      |        |      |                             |                |                            |           |        |          |              |          |          |          |              |      |           |               |            |                 |                              |    |             |               |             |                |               |                    |                  |                     |                      |            |              |                   |             |               |                    |                  |                       |     |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** DONE Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
CLOSED: [2023-08-28 Mon 17:21] SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une correspondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
 zgrep -o "ID=gene[^;]*;"  GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
 wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;"  GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+
begin_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** HOLD OMIM sur nom du gène :annotation:
*** DONE Mobidetails API
CLOSED: [2023-09-10 Sun 16:44]
Trop long ... 1h à 1h30 d'exécution
Disponible dans module
*** DONE Filtre vep avec spip for T2T et spliceAI pour GRCh38
CLOSED: [2023-09-16 Sat 22:47]
*** DONE Repasser tests en GRCh38 avec nouveau filtre (spip ou splice ai) :sanger:
CLOSED: [2023-09-17 Sun 09:07] SCHEDULED: <2023-09-16 Sat>
*** HOLD Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** KILL Regarder si clinique disponible avec vep :annotation:
CLOSED: [2023-09-10 Sun 16:44]
** DONE [#B] Indicateurs qualité :qualité:
CLOSED: [2023-09-10 Sun 16:46]
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
*** DONE Diminuer mémoire pour haplotypecaller
CLOSED: [2023-09-20 Wed 21:44] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:30]
Medium = 32Go pour 6 coeurs => 4 jobs (donc tout le noeud) prend plus que les 96GB...
On essaie 16Gb
Puis commit
*** DONE Report multiqc avec 10 runs
CLOSED: [2023-09-19 Tue 15:31] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:31]
Cf mail 2023-09-19
*** WAIT Bug: variant sur 7788314 pour patient 62982193 filtré : DP < 30
SCHEDULED: <2023-09-25 Mon>
/Entered on/ [2023-09-22 Fri 22:59]
35 selon IGV mais 27 en pratique dans le VCF.
VCF cento: 26 reads également...
Non confirmé sanger
Mail envoyé Alexis
** DONE Mettre à jour spip pour corriger bug 62982239 : variant trop long (?)
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
/Entered on/ [2023-09-21 Thu 23:11]
Rapporté par https://github.com/raphaelleman/SPiP/issues/9
*** DONE Relance run
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE Mise à jour spip
CLOSED: [2023-09-21 Thu 23:41] SCHEDULED: <2023-09-21 Thu>
** DONE nixpkgs unstable -> 23.05
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE repasser tests sanger
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Ale

[10.18219]

[2.29]

epIntron |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
|chr10129957338-T-C chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028034:g.89894645:A>G    | NTR            | 00 % [00 % - 00.92 %]      |     0.000 | +      | 89894645 | substitution | A>G      | Intron 3 |     1398 | NR_028034    | FAS  | donor     |           160 | DeepIntron |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    10.00000 |       89894644 | Don           |    0.0001360257829 | No               | Don                 |             89894485 |        159 |            0 |   0.0000000000000 | No          |      89894485 |         0.07177992 | Yes              |            0.07177992 | Yes |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028035:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_028035    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_028036:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_028036    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135313:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 5   |       63 | NR_135313    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NM_001410956:g.89894645:A>G | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NM_001410956 | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135314:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 6   |       63 | NR_135314    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
| chr10 | 89894645 | lol | A   | G   | .    | .      | .    | NR_135315:g.89894645:A>G    | Alter ESR      | 35.81 % [28.11 % - 44.1 %] |     0.288 | +      | 89894645 | substitution | A>G      | Exon 4   |       63 | NR_135315    | FAS  | acceptor  |             8 | ExonESR    |               0 | Outside SPiCE Interpretation |  0 |           0 | No            |    -1.67753 |       89894644 | Acc           |    0.0000003317384 | No               | Acc                 |             89894637 |          7 |     89894644 |   0.0000002205815 | No          |      89894637 |         0.02545572 | No               |            0.02545572 | No  |
|       |          |     |     |     |      |        |      |                             |                |                            |           |        |          |              |          |          |          |              |      |           |               |            |                 |                              |    |             |               |             |                |               |                    |                  |                     |                      |            |              |                   |             |               |                    |                  |                       |     |
**** DONE Vérifier multiples transcripts en hg38 avec coordonées génomiquues: ok
CLOSED: [2023-08-10 Thu 23:00]
Beaucoup plus de transcrits en T2T
Ex: 1 transcrit refseq curated
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108257446%2D108257496&hgsid=1672963428_J5aWAqack2FpJ7mvhFTNVw7bKzxo
vs 2 transcrits en T2T
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr11%3A108264969%2D108265019&hgsid=1672963612_Eso9frdQ7z6RkKkcKsIf2Waq3pec
C'est bien ce qu'on retrouve avec spip
*** DONE [#A] Filtre vep avec spip
CLOSED: [2023-08-13 Sun 00:39] SCHEDULED: <2023-08-12 Sat 19:00>
*** DONE Annotation CADD + spliceAI GRCh38 avec nouvelle version :annotation:
CLOSED: [2023-08-28 Mon 17:21] SCHEDULED: <2023-08-20 Sun>
*** DONE OMIM: possible seulement sur nom du gènes:annotation:
CLOSED: [2023-08-13 Sun 11:57] SCHEDULED: <2023-08-13 Sun 16:00>
Base de données non disponible et compliqué de faire la mise à jour nous.
Si on essaie de prendre les gènes de GRCH38, ils ne sont pas forcément en T2T
Ex: DDX11L17 n'existe pas dans T2T à ces coordonées
zgrep DDX11L17 GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
Note: c'est un pseudogene
https://www.genecards.org/cgi-bin/carddisp.pl?gene=DDX11L17
Si on prend les gènes de T2T, il y en a des nouveaux.
Ex: le premier est LOC101928626.
À cette position, rien en GRCh38
Si on essaye avec ENSEMBL: non car n'ont pas le même identifiant
Ex: ACHE
Idéalement, il faudrait l'identifiant NCBI (disponible dans OMIM) mais n'est pas en sortie de VEP
Et cela demande la version "merged" donc impossible en T2T
Est-ce faisable de faire une chr10129957338-T-Ccorrespondance sur le nom du gène ?
Tous les gènes de T2T:
#+begin_src sh :dir ~/Downloads
 zgrep -o "ID=gene[^;]*;"  GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > t2t-genes.txt
 wc -l t2t-genes.txt
#+end_src
#+RESULTS:
: 57660 t2t-genes.txt
#+begin_src sh :dir ~/Downloads
zgrep -o "ID=gene[^;]*;"  GCF_000001405.40_GRCh38.p14_genomic.gff.gz | sed 's/ID=gene-//;s/;//' | sort | uniq > hg38-genes.txt
wc -l hg38-genes.txt
#+end_src
#+RESULTS:
: 67127 hg38-genes.txt
Gènes communs aux 2
#+begin_src sh :dir ~/Downloads
comm -12 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 54506
Gènes uniquements dans t2t
#+begin_src sh :dir ~/Downloads
comm -23 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 3154
Gènes uniquements dans GRCh38
#+begin_src sh :dir ~/Downloads
comm -13 t2t-genes.txt hg38-genes.txt | wc -l
#+end_src
#+RESULTS:
: 12621
*** HOLD OMIM sur nom du gène :annotation:
*** DONE Mobidetails API
CLOSED: [2023-09-10 Sun 16:44]
Trop long ... 1h à 1h30 d'exécution
Disponible dans module
*** DONE Filtre vep avec spip for T2T et spliceAI pour GRCh38
CLOSED: [2023-09-16 Sat 22:47]
*** DONE Repasser tests en GRCh38 avec nouveau filtre (spip ou splice ai) :sanger:
CLOSED: [2023-09-17 Sun 09:07] SCHEDULED: <2023-09-16 Sat>
*** HOLD Franklin API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
*** KILL Regarder si clinique disponible avec vep :annotation:
CLOSED: [2023-09-10 Sun 16:44]
*** TODO Tester filtre sans splice
SCHEDULED: <2023-09-27 Wed>
Mail Paul: Exome donc hors splice, peu intéressant
**** DONE Enlever complètement condition splice: 6130 variants restants... comme spliceAI
CLOSED: [2023-09-27 Wed 19:37] SCHEDULED: <2023-09-26 Tue>
Cf [[id:c9b2009a-503b-4561-94c6-29ae21a3188d][Filtre vep avec spliceAI: 37365 -> 6130]]
Dans tests/splicai
#+begin_src sh
 filter_vep -i output-all-gpu.vcf --format vcf --filter "        not(Consequence matches non_coding_transcript or Consequence matches stream                 or Consequence matches intergenic_variant                 or Consequence matches UTR                 or Consequence matches intron_variant                 or Consequence matches synonymous                 or BIOTYPE  matches pseudogene                 or BIOTYPE  matches misc_RNA)"                --only_matched         -o test.vcf
 grep -c -v '^#' test.vcf
6130
#+end_src
**** DONE Remplacer par impact fonctionnel: peu d'impact : majorité = MODERATE
CLOSED: [2023-09-27 Wed 19:45] SCHEDULED: <2023-09-26 Tue>
 filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is HIGH"  --only_matched | grep -c -v '^#'
258
filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is LOW"  --only_matched | grep -c -v '^#'
11
filter_vep -i output-all-gpu-filtered.vcf --format vcf --filter "IMPACT is MODERATE"  --only_matched | grep -c -v '^#'
5824
**** DONE Regarder les conséquences pour tes les transcripts
CLOSED: [2023-09-27 Wed 21:04]
/Work/Users/apraga/bisonex/out/annotate/vep/NA12878-sanger-all-T2T
 filter_vep -i NA12878-sanger-all-T2T.vep.vcf.gz --format vcf --filter "        not(Consequence matches non_coding_transcript or Consequence matches stream                 or Consequence matches intergenic_variant                 or Consequence matches UTR                 or Consequence matches intron_variant                 or Consequence matches synonymous                 or BIOTYPE  matches pseudogene                 or BIOTYPE  matches misc_RNA)"                --only_matched         -o filtered.vcf
 bcftools +split-vep filtered.vcf -f '%Consequence\n' -d | sort | uniq -c
     94 coding_sequence_variant
     13 coding_sequence_variant&NMD_transcript_variant
    257 frameshift_variant
     21 frameshift_variant&NMD_transcript_variant
      2 frameshift_variant&splice_donor_region_variant
     20 frameshift_variant&splice_region_variant
      1 frameshift_variant&splice_region_variant&NMD_transcript_variant
      1 incomplete_terminal_codon_variant&coding_sequence_variant
    211 inframe_deletion
     18 inframe_deletion&NMD_transcript_variant
      6 inframe_deletion&splice_region_variant
    242 inframe_insertion
     22 inframe_insertion&NMD_transcript_variant
      4 inframe_insertion&splice_region_variant
  14689 missense_variant
   1416 missense_variant&NMD_transcript_variant
      6 missense_variant&splice_donor_5th_base_variant
    374 missense_variant&splice_region_variant
     34 missense_variant&splice_region_variant&NMD_transcript_variant
     53 splice_acceptor_variant
     11 splice_acceptor_variant&NMD_transcript_variant
     79 splice_donor_variant
      6 splice_donor_variant&NMD_transcript_variant
     30 start_lost
      5 start_lost&NMD_transcript_variant
    135 stop_gained
     13 stop_gained&frameshift_variant
      3 stop_gained&frameshift_variant&NMD_transcript_variant
      2 stop_gained&frameshift_variant&splice_region_variant
     14 stop_gained&NMD_transcript_variant
      5 stop_gained&splice_region_variant
      2 stop_gained&splice_region_variant&NMD_transcript_variant
      4 stop_lost
      1 stop_lost&NMD_transcript_variant
      9 stop_retained_variant
      6 stop_retained_variant&NMD_transcript_variant
      1 transcript_ablation
Idem tests/spliceai
bcftools +split-vep output-all-gpu-filtered.vcf  -f '%Consequence\n' -d | sort | uniq -c
     94 coding_sequence_variant
     13 coding_sequence_variant&NMD_transcript_variant
    257 frameshift_variant
     21 frameshift_variant&NMD_transcript_variant
      2 frameshift_variant&splice_donor_region_variant
     20 frameshift_variant&splice_region_variant
      1 frameshift_variant&splice_region_variant&NMD_transcript_variant
      1 incomplete_terminal_codon_variant&coding_sequence_variant
    211 inframe_deletion
     18 inframe_deletion&NMD_transcript_variant
      6 inframe_deletion&splice_region_variant
    242 inframe_insertion
     22 inframe_insertion&NMD_transcript_variant
      4 inframe_insertion&splice_region_variant
  14689 missense_variant
   1416 missense_variant&NMD_transcript_variant
      6 missense_variant&splice_donor_5th_base_variant
    374 missense_variant&splice_region_variant
     34 missense_variant&splice_region_variant&NMD_transcript_variant
     53 splice_acceptor_variant
     11 splice_acceptor_variant&NMD_transcript_variant
     79 splice_donor_variant
      6 splice_donor_variant&NMD_transcript_variant
     30 start_lost
      5 start_lost&NMD_transcript_variant
    135 stop_gained
     13 stop_gained&frameshift_variant
      3 stop_gained&frameshift_variant&NMD_transcript_variant
      2 stop_gained&frameshift_variant&splice_region_variant
     14 stop_gained&NMD_transcript_variant
      5 stop_gained&splice_region_variant
      2 stop_gained&splice_region_variant&NMD_transcript_variant
      4 stop_lost
      1 stop_lost&NMD_transcript_variant
      9 stop_retained_variant
      6 stop_retained_variant&NMD_transcript_variant
      1 transcript_ablation
**** DONE Regarder les conséquences pour -s worst
CLOSED: [2023-09-27 Wed 21:04]
/Work/Users/apraga/bisonex/out/annotate/vep/NA12878-sanger-all-T2T
Après filtre_vep sans splice
]$ bcftools +split-vep filtered.vcf -f '%Consequence\n' -d -s worst | sort | uniq -c
     48 coding_sequence_variant
      6 coding_sequence_variant&nmd_transcript_variant
    121 frameshift_variant
      9 frameshift_variant&nmd_transcript_variant
      1 frameshift_variant&splice_donor_region_variant
      9 frameshift_variant&splice_region_variant
     79 inframe_deletion
      3 inframe_deletion&nmd_transcript_variant
      2 inframe_deletion&splice_region_variant
     85 inframe_insertion
      2 inframe_insertion&nmd_transcript_variant
      1 inframe_insertion&splice_region_variant
   5309 missense_variant
    207 missense_variant&nmd_transcript_variant
      3 missense_variant&splice_donor_5th_base_variant
    110 missense_variant&splice_region_variant
      9 missense_variant&splice_region_variant&nmd_transcript_variant
     19 splice_acceptor_variant
      1 splice_acceptor_variant&nmd_transcript_variant
     21 splice_donor_variant
      1 splice_donor_variant&nmd_transcript_variant
     14 start_lost
     44 stop_gained
      4 stop_gained&frameshift_variant
      2 stop_gained&frameshift_variant&splice_region_variant
      3 stop_gained&nmd_transcript_variant
      3 stop_gained&splice_region_variant
      2 stop_gained&splice_region_variant&nmd_transcript_variant
      2 stop_lost
      1 stop_lost&nmd_transcript_variant
      6 stop_retained_variant
      2 stop_retained_variant&nmd_transcript_variant
      1 transcript_ablation
Dans tests/spliceai
$ bcftools +split-vep output-all-gpu-filtered.vcf  -f '%Consequence\n' -s worst -d | sort | uniq -c
     48 coding_sequence_variant
      6 coding_sequence_variant&nmd_transcript_variant
    121 frameshift_variant
      9 frameshift_variant&nmd_transcript_variant
      1 frameshift_variant&splice_donor_region_variant
      9 frameshift_variant&splice_region_variant
     79 inframe_deletion
      3 inframe_deletion&nmd_transcript_variant
      2 inframe_deletion&splice_region_variant
     85 inframe_insertion
      2 inframe_insertion&nmd_transcript_variant
      1 inframe_insertion&splice_region_variant
   5309 missense_variant
    207 missense_variant&nmd_transcript_variant
      3 missense_variant&splice_donor_5th_base_variant
    110 missense_variant&splice_region_variant
      9 missense_variant&splice_region_variant&nmd_transcript_variant
     19 splice_acceptor_variant
      1 splice_acceptor_variant&nmd_transcript_variant
     21 splice_donor_variant
      1 splice_donor_variant&nmd_transcript_variant
     14 start_lost
     44 stop_gained
      4 stop_gained&frameshift_variant
      2 stop_gained&frameshift_variant&splice_region_variant
      3 stop_gained&nmd_transcript_variant
      3 stop_gained&splice_region_variant
      2 stop_gained&splice_region_variant&nmd_transcript_variant
      2 stop_lost
      1 stop_lost&nmd_transcript_variant
      6 stop_retained_variant
      2 stop_retained_variant&nmd_transcript_variant
      1 transcript_ablation
**** TODO Vérifier si tests sanger passent: non
SCHEDULED: <2023-09-27 Wed>
     │ String                Float64   Int64
─────┼───────────────────────────────────────
   1 │ chr10:g.130884530      60.0     67
   2 │ chr10:g.240362         60.0     79
   3 │ chr14:g.52665581       60.0     51
   4 │ chr19:g.41325390       60.0    180
** DONE [#B] Indicateurs qualité :qualité:
CLOSED: [2023-09-10 Sun 16:46]
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
*** DONE FastqQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Mosdepth
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
Pour exomple, il faut le fichier de capture
subworkflows/local/bam_markduplicates/
*** DONE Samtools stats
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE [#B] Compte-redu exécution avec MultiQC
CLOSED: [2023-08-15 Tue 21:43] SCHEDULED: <2023-08-13 Sun>
*** DONE Résultats sur NA12878 : 98% à 20x
CLOSED: [2023-08-19 Sat 20:45] SCHEDULED: <2023-08-17 Thu>
**** DONE Comprendre 91% à 20x seulement: SNVs inséré
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester autre kit : Twist exome comprehensive
CLOSED: [2023-08-18 Fri 22:24]
Moins bon
***** DONE Tester génome sans alt
CLOSED: [2023-08-18 Fri 22:25]
Idem
***** DONE Tester NA12878 sans SNVs inséré: cause !!
CLOSED: [2023-08-18 Fri 22:25]
***** DONE Tester hg19 sur NA12878 non inséré
CLOSED: [2023-08-18 Fri 22:25]
**** DONE Comprendre pourquoi SNVs diminuent le score: reads manquants
CLOSED: [2023-08-19 Sat 20:34] SCHEDULED: <2023-08-18 Fri>
Voir [[id:5c1c36f3-f68e-4e6d-a7b6-61dca89abc37][Bug: perte de nombreux reads avec NA12878]]
*** DONE Relancer résultats avec NA1287 et NA12878 + sanger
CLOSED: [2023-08-29 Tue 10:30] SCHEDULED: <2023-08-29 Tue>
*** DONE Comparer avec hg19
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec autres kit de capture
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
*** DONE Comparer avec no-alt
CLOSED: [2023-08-28 Mon 17:22] SCHEDULED: <2023-08-20 Sun>
** HOLD vérifier si normalisation
** KILL [#B] Vérification nomenclature hgvs :hgvs:
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-15 Tue>
*** KILL mutalyzer
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
*** KILL API variantvalidator
CLOSED: [2023-08-16 Wed 19:07] SCHEDULED: <2023-08-13 Sun>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
*** DONE Diminuer mémoire pour haplotypecaller
CLOSED: [2023-09-20 Wed 21:44] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:30]
Medium = 32Go pour 6 coeurs => 4 jobs (donc tout le noeud) prend plus que les 96GB...
On essaie 16Gb
Puis commit
*** DONE Report multiqc avec 10 runs
CLOSED: [2023-09-19 Tue 15:31] SCHEDULED: <2023-09-19 Tue>
/Entered on/ [2023-09-19 Tue 15:31]
Cf mail 2023-09-19
*** WAIT Bug: variant sur 7788314 pour patient 62982193 filtré : DP < 30
SCHEDULED: <2023-09-25 Mon>
/Entered on/ [2023-09-22 Fri 22:59]
35 selon IGV mais 27 en pratique dans le VCF.
VCF cento: 26 reads également...
Non confirmé sanger
Mail envoyé Alexis
** DONE Mettre à jour spip pour corriger bug 62982239 : variant trop long (?)
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
/Entered on/ [2023-09-21 Thu 23:11]
Rapporté par https://github.com/raphaelleman/SPiP/issues/9
*** DONE Relance run
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE Mise à jour spip
CLOSED: [2023-09-21 Thu 23:41] SCHEDULED: <2023-09-21 Thu>
** DONE nixpkgs unstable -> 23.05
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
*** DONE repasser tests sanger
CLOSED: [2023-09-22 Fri 22:22] SCHEDULED: <2023-09-21 Thu>
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Ale

Replacement in projects/bisonex.org at line 23 [8.35]

B:BD[2.8221] → [2.8221:16413]

data)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 2111, in predict_step
      return self(x, training=False)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 558, in __call__
      return super().__call__(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 669, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 290, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 262, in convolution_op
      return tf.nn.convolution(
Node: 'model_1/conv1d_3/Conv1D'
DNN library is not found.
         [[{{node model_1/conv1d_3/Conv1D}}]] [Op:__inference_predict_function_22195]
#+end_quote
***** DONE GPU: chr20 ok
CLOSED: [2023-09-26 Tue 11:50]
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-20-2-T2T.vep.vcf.gz -O output-20-2-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
temps d'exécution : 5min
***** STRT GPU: toutes les données
SCHEDULED: <2023-09-26 Tue>
#+begin_src slurm
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="spliceai-gpu"
#SBATCH --output=%x.%J.out   ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH -t 24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem=32G
## votre dresse mail pour les notifs
#SBATCH --mail-user=apraga@chu-besancon.fr
#SBATCH --mail-type=END,FAIL
nvidia-smi
module purge
module load nix/2.11.0
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-all-T2T.vep.vcf.gz -O output-all-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
#+end_src
***** TODO Avec pip: echec
2023-09-24 08:28:46.361434: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
***** KILL Tester conda: echec
CLOSED: [2023-09-23 Sat 21:43] SCHEDULED: <2023-09-23 Sat>
N'arrive pas à installer
#+begin_quote
  - feature:/linux-64::__glibc==2.28=0
  - python=3.11 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
  - spliceai -> tensorflow[version='>=1.13.0'] -> __cuda
  - spliceai -> tensorflow[version='>=1.13.0'] -> __glibc[version='>=2.17']
Your installed version is: 2.28
#+end_quote
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
  -

[2.8221]

[2.16413]

data)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 2111, in predict_step
      return self(x, training=False)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/training.py", line 558, in __call__
      return super().__call__(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/functional.py", line 669, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 290, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/Softs/helios/gpu/anaconda3/2023.03-1/envs/tensorflow-gpu-2.12.0+py3.9/lib/python3.9/site-packages/keras/layers/convolutional/base_conv.py", line 262, in convolution_op
      return tf.nn.convolution(
Node: 'model_1/conv1d_3/Conv1D'
DNN library is not found.
         [[{{node model_1/conv1d_3/Conv1D}}]] [Op:__inference_predict_function_22195]
#+end_quote
***** DONE GPU: chr20 ok
CLOSED: [2023-09-26 Tue 11:50]
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-20-2-T2T.vep.vcf.gz -O output-20-2-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
temps d'exécution : 5min
***** STRT GPU: toutes les données :GPU:spliceai:
****** DONE Run : 70GB, 3h30
CLOSED: [2023-09-27 Wed 10:37] SCHEDULED: <2023-09-26 Tue>
32G insufissant ! Il faut 70GB :
Job ID: 17340
Cluster: mesoubfc
User/Group: apraga/mesousers
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 03:11:53
CPU Efficiency: 93.55% of 03:25:07 core-walltime
Job Wall-clock time: 03:25:07
Memory Utilized: 67.75 GB
Memory Efficiency: 52.93% of 128.00 GB
#+begin_src slurm
#!/bin/bash -l
# Fichier submission.SBATCH
#SBATCH --job-name="spliceai-gpu"
#SBATCH --output=%x.%J.out   ## %x=nom_du_job, %J=id du job
#SBATCH --error=%x.%J.out
# walltime (hh:mm::ss) max is 8 days
#SBATCH -t 24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
## To request more memory, use --mem option.
## Please don't use more than 128g.
#SBATCH --mem=64G
## votre dresse mail pour les notifs
#SBATCH --mail-user=apraga@chu-besancon.fr
#SBATCH --mail-type=END,FAIL
nvidia-smi
module purge
module load nix/2.11.0
LD_PRELOAD=/lib64/libcuda.so spliceai -I NA12878-sanger-all-T2T.vep.vcf.gz -O output-all-gpu.vcf -R /Work/Groups/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa -A ~/t2t.txt
#+end_src
****** TODO Annoter la sortie de VEP avec ce VCF
Générer un fichier d'annotation
#+begin_src
bcftools annotate -x INFO/CSQ output-all-gpu.vcf  -o spliceai.vcf.gz
bcftools index spliceai.vcf.gz
#+end_src
Annoter avec vep
#+begin_src sh
ln -s  /Work/Projects/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa .
ln -s  /Work/Projects/bisonex/data/fasta/chm13v2.0/chm13v2.0.fa.fai .
ln -s  /Work/Projects/bisonex/data/vep/chm13v2.0/106 .
ln -s  /Work/Projects/bisonex/data/clinvar/chm13v2.0/clinvar.vcf.gz .
ln -s  /Work/Projects/bisonex/data/clinvar/chm13v2.0/clinvar.vcf.gz.tbi .
#+end_src
#+begin_src sh
#vep -i  output-all-gpu.vcf -o output-all-gpu-filtered.vcf --appris --biotype --canonical --ccds --compress_output bgzip --domains --exclude_predicted --flag_pick --hgvs --hgvsg --gene_phenotype --numbers --mane --protein --offline --uniprot --symbol --tsl --use_given_ref --variant_class --vcf --plugin NMD --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN --plugin SpliceAI,snv=spliceai.vcf.gz,indel=spliceai.vcf.gz --fasta chm13v2.0.fa --assembly T2T-CHM13v2.0 --species homo_sapiens_gca009914755v4/ --cache  --cache_version 106 --dir_cache 106
vep -i  lol.vcf --force -o test.vcf.gz --appris --biotype --canonical --ccds --compress_output bgzip --domains --exclude_predicted --flag_pick --hgvs --hgvsg --gene_phenotype --numbers --mane --protein --offline --uniprot --symbol --tsl --use_given_ref --variant_class --vcf --plugin NMD --custom clinvar.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN --custom spliceai.vcf.gz,SpliceAI,vcf,exact,0,DS_AG%DS_AL --fasta chm13v2.0.fa --assembly T2T-CHM13v2.0 --species homo_sapiens_gca009914755v4/ --cache  --cache_version 106 --dir_cache ${PWD}/106
#+end_src
****** KILL Save
CLOSED: [2023-09-27 Wed 21:40]
# ****** DONE Filtre vep avec spliceAI: 37365 -> 6130. SpliceAI n'apporte rien
# CLOSED: [2023-09-27 Wed 19:37] SCHEDULED: <2023-09-27 Wed>
# :PROPERTIES:
# :ID:       c9b2009a-503b-4561-94c6-29ae21a3188d
# :END:
# #+begin_src sh
# filter_vep -i output-all-gpu.vcf --format vcf --filter "        not(Consequence matches non_coding_transcript or Consequence matches stream                 or Consequence matches intergenic_variant                 or Consequence matches UTR                 or Consequence matches intron_variant                 or Consequence matches synonymous                 or BIOTYPE  matches pseudogene                 or BIOTYPE  matches misc_RNA) or (SpliceAI_pred_DS_AG and SpliceAI_pred_DS_AG >= 0.2)             or (SpliceAI_pred_DS_AL and SpliceAI_pred_DS_AL >= 0.2)             or (SpliceAI_pred_DS_DG and SpliceAI_pred_DS_DG >= 0.2)             or (SpliceAI_pred_DS_DL and SpliceAI_pred_DS_DL >= 0.2) "                --only_matched         -o output-all-gpu-filtered.vcf
# #+end_src
# filter_vep -i output-all-gpu.vcf --format vcf --filter "        not(Consequence matches non_coding_transcript or Consequence matches stream                 or Consequence matches intergenic_variant                 or Consequence matches UTR                 or Consequence matches intron_variant                 or Consequence matches synonymous                 or BIOTYPE  matches pseudogene                 or BIOTYPE  matches misc_RNA)"                --only_matched        | grep -c -v '^#'
# 6130
# $ grep -c -v '^#' output-all-gpu-filtered.vcf
# 6130
# ****** DONE Re-vérifier filtre avec spip: 7730 -> probable problème avec spip
# CLOSED: [2023-09-27 Wed 20:54] SCHEDULED: <2023-09-27 Wed>
#  filter_vep -i NA12878-sanger-all-T2T.vep.vcf.gz --format vcf --filter "        not(Consequence matches non_coding_transcript or Consequence matches stream                 or Consequence matches intergenic_variant                 or Consequence matches UTR                 or Consequence matches intron_variant                 or Consequence matches synonymous                 or BIOTYPE  matches pseudogene                 or BIOTYPE  matches misc_RNA) or (SPIP_spipScore and SPIP_spipScore >= 20)"                --only_matched        | grep -c -v '^#'
# perl: warning: Setting locale failed.
# perl: warning: Please check that your locale settings:
#         LANGUAGE = (unset),
#         LC_ALL = (unset),
#         LANG = "en_US.utf8"
#     are supported and installed on your system.
# perl: warning: Falling back to the standard locale ("C").
# 7730
****** TODO vérifier si tests sanger passent
SCHEDULED: <2023-09-27 Wed>
***** TODO Avec pip: echec
2023-09-24 08:28:46.361434: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
***** DONE Tester conda: echec
CLOSED: [2023-09-23 Sat 21:43] SCHEDULED: <2023-09-23 Sat>
Ananconda: N'arrive pas à installer
#+begin_quote
  - feature:/linux-64::__glibc==2.28=0
  - python=3.11 -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
  - spliceai -> tensorflow[version='>=1.13.0'] -> __cuda
  - spliceai -> tensorflow[version='>=1.13.0'] -> __glibc[version='>=2.17']
Your installed version is: 2.28
#+end_quote
Il faut utiliser mamba
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** HOLD Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE plI et LOEUF depuis gnomad
CLOSED: [2023-08-31 Thu 10:38] SCHEDULED: <2023-08-29 Tue>
**** DONE Grantham
CLOSED: [2023-08-31 Thu 22:08] SCHEDULED: <2023-08-30 Wed>
**** DONE Corriger spliceAI
CLOSED: [2023-08-31 Thu 13:51] SCHEDULED: <2023-08-31 Thu>
Pas d'annotation
- chromosome ? essai 1 au lieu de chr1 : idem. Et fonctionne pour CADD
- index ?
  -