apraga/org - Change 32N63XKSVRLHNPTCGWSTW4LNFZ2MQ7F5F3JQOMWZYU5FN4XP2GHQC

Bisonex + workout

Created by Alexis Praga on August 6, 2023

32N63XKSVRLHNPTCGWSTW4LNFZ2MQ7F5F3JQOMWZYU5FN4XP2GHQC

Dependencies

In channels

main

Change contents

Insertion in workout.org at line 6522 [4.1]

[3.363]

* <2023-08-06 Sun> Workout
** RTO
- 30-16-16
** Muscle-up
-  2+2 - 4neg
-  2+2 - 4neg
-  2+2 - 4neg
** Extension:
-  22
-  22
-  22
** FL tucked row :
-  3+2
-  3+2
-  3+2
** Pistols :
-  4
-  4
-  4
** Planche tucked push-up:
-  5+5
-  5+5
-  5+5
** Compression:
-  9
-  10
-  10
** Norwegian roll
-  4

Replacement in projects/bisonex.org at line 9 [6.35]

B:BD[5.16311] → [5.16311:16316]

B:BD[5.16316] → [2.16419:24606]


***
 DONE Haplotypecaller
CLOSED: [2023-06-26 Mon 19:42] SCHEDULED: <2023-06-15 Thu>
*** DONE Faire fonctionner le filtre technical variant
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Wed 10:30>
*** DONE Annotation vep seule
CLOSED: [2023-08-05 Sat 08:59] SCHEDULED: <2023-08-05 Sat>
T2T n'a pas
- de version merged
- polyphen
- gnomAD
On désactive l'annotation spip pour le moment
*** PROJ [#A] Porter Spip
*** TODO Générer la base de donnée spip
SCHEDULED: <2023-08-03 Thu 11:30>
**** PROJ Vérifier la génération du transcriptome en hg38: checksum différent
- [X] Nettoyer et vérifier sur hg38 avec ediff les RData : différent
- [X] Sinon, ne pas nettoyer et générer: idem
**** TODO Récupérer ncbi RefSeq curated
.txt sur UCSC mais pas en T2T: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Format: https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1173061381_UepaHnvaOKFZKMOV4o7DtcNUHGVa&hgta_doSchemaDb=chlSab2&hgta_doSchemaTable=ncbiRefSeqCurated
Avec "*", les champs conservés (a priori)
|  1 | bin          | Indexing field to speed chromosome range queries.                       |
|  2 | *name        | Name of gene (usually transcript_id from GTF)                           |
|  3 | *chrom       | Reference sequence chromosome or scaffold                               |
|  4 | *strand      | + or - for strand                                                       |
|  5 | *txStart     | Transcription start position (or end position for minus strand item)    |
|  6 | *txEnd       | Transcription end position (or start position for minus strand item)    |
|  7 | *cdsStart    | Coding region start (or end position for minus strand item)             |
|  8 | *cdsEnd      | Coding region end (or start position for minus strand item)             |
|  9 | *exonCount   | Number of exons                                                         |
| 10 | *exonStarts  | Exon start positions (or end positions for minus strand item)           |
| 11 | *exonEnds    | Exon end positions (or start positions for minus strand item)           |
| 12 | *score       | score                                                                   |
| 13 | *name2       | Alternate name (e.g. gene_id from GTF)                                  |
| 14 | cdsStartStat | Status of CDS start annotation (none, unknown, incomplete, or complete) |
| 15 | cdsEndStat   | Status of CDS end annotation (none, unknown, incomplete, or complete)   |
| 16 | exonFrames   | Exon frame {0,1,2}, or -1 if no frame for exon                          |
En T2T, seulement au format bigBed : https://hgdownload.soe.ucsc.edu/gbdb/hs1/ncbiRefSeq/
Il y a un exécutable pour convertir en bed : http://hgdownload.soe.ucsc.edu/admin/exe/
Sous gentoo, il faut instaler mit-krb5 (pour libkrb5)
#+begin_src
./bigBedToBed ncbiRefSeqCurated.bb ncbiRefSeqCurated.bed
#+end_src
Ne pas oublier les headers car ils sont dans un ordre différent:
#chrom  chromStart      chromEnd        name    score   strand  thickStart      thickEnd        reserved        blockCount      blockSizes      chromStarts     name2   cdsStartStat    cdsEndStat      exonFrames      type    geneName               geneName2       geneType
*** PROJ [#A] Filtre vep (avec spip ?)
** PROJ [#B] Indicateurs qualité
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** PROJ [#B] Compte-redu exécution avec MultiQC
** PROJ vérifier si normalisation
** PROJ [#B] Vérification nomenclature hgvs avec mutalyzer
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f pos

[5.16311]

[2.24606]


*** DONE Haplotypecaller
CLOSED: [2023-06-26 Mon 19:42] SCHEDULED: <2023-06-15 Thu>
*** DONE Faire fonctionner le filtre technical variant
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Wed 10:30>
*** DONE Annotation vep seule
CLOSED: [2023-08-05 Sat 08:59] SCHEDULED: <2023-08-05 Sat>
T2T n'a pas
- de version merged
- polyphen
- gnomAD
On désactive l'annotation spip pour le moment
*** PROJ [#A] Porter Spip
*** TODO Générer la base de donnée spip
SCHEDULED: <2023-08-03 Thu 11:30>
**** PROJ Vérifier la génération du transcriptome en hg38: checksum différent
- [X] Nettoyer et vérifier sur hg38 avec ediff les RData : différent
- [X] Sinon, ne pas nettoyer et générer: idem
**** TODO Récupérer ncbi RefSeq curated
SCHEDULED: <2023-08-06 Sun>
.txt sur UCSC mais pas en T2T: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Format: https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1173061381_UepaHnvaOKFZKMOV4o7DtcNUHGVa&hgta_doSchemaDb=chlSab2&hgta_doSchemaTable=ncbiRefSeqCurated
Avec "*", les champs conservés (a priori)
|  1 | bin          | Indexing field to speed chromosome range queries.                       |
|  2 | *name        | Name of gene (usually transcript_id from GTF)                           |
|  3 | *chrom       | Reference sequence chromosome or scaffold                               |
|  4 | *strand      | + or - for strand                                                       |
|  5 | *txStart     | Transcription start position (or end position for minus strand item)    |
|  6 | *txEnd       | Transcription end position (or start position for minus strand item)    |
|  7 | *cdsStart    | Coding region start (or end position for minus strand item)             |
|  8 | *cdsEnd      | Coding region end (or start position for minus strand item)             |
|  9 | *exonCount   | Number of exons                                                         |
| 10 | *exonStarts  | Exon start positions (or end positions for minus strand item)           |
| 11 | *exonEnds    | Exon end positions (or start positions for minus strand item)           |
| 12 | *score       | score                                                                   |
| 13 | *name2       | Alternate name (e.g. gene_id from GTF)                                  |
| 14 | cdsStartStat | Status of CDS start annotation (none, unknown, incomplete, or complete) |
| 15 | cdsEndStat   | Status of CDS end annotation (none, unknown, incomplete, or complete)   |
| 16 | exonFrames   | Exon frame {0,1,2}, or -1 if no frame for exon                          |
ex:
585     NR_046018.2     chr1    +       11873   14409   14409   14409   3       11873,12612,13220,      12227,12721,14409,      0       DDX11L1 none    none    -1,-1,-1,
En T2T, seulement au format bigBed : https://hgdownload.soe.ucsc.edu/gbdb/hs1/ncbiRefSeq/
Il y a un exécutable pour convertir en bed : http://hgdownload.soe.ucsc.edu/admin/exe/
Sous gentoo, il faut instaler mit-krb5 (pour libkrb5)
#+begin_src
./bigBedToBed ncbiRefSeqCurated.bb ncbiRefSeqCurated.bed
#+end_src
Exemple:
chr1    7505    13582   NR_182076.1     0       -       13582   13582   0       2       5477,138,       0,5939, LOC127239154    none    none    -1,-1,          NR_182076.1     LOC127239154
Dans R:
   V1        V2   V3 V4    V5    V6    V7    V8 V9                V10
1 585 NR_046018 chr1  + 11873 14409 14409 14409  3 11873,12612,13220,
                 V11 V12     V13  V14  V15       V16           V17         V18
1 12227,12721,14409,   0 DDX11L1 none none -1,-1,-1, 354,109,1189, 0,739,1347,
Ne pas oublier les headers car ils sont dans un ordre différent:
 1 #chrom
 2 chromStart
 3 chromEnd
 4 name
 5 score
 6  strand
 7 thickStart
 8 thickEnd
 9 reserved
10 blockCount
11 blockSizes
12 chromStarts
13 name2
14 cdsStartStat
15 cdsEndStat
16 exonFrames
17 type
18 geneName
19 geneName2
20 geneType
Colonnes en GRGh38 =
3, 5, 6, 2, 12, 4, 7,  8, 12, 9, 17, 18, 13
Correspondance en T2T
1, 7, 8, 4, 5,  6, 14, 15, 5, ?,  ?,  ?, 13
En fait, il suffit d'avoir
- le gène
- le début du transcrit
- la fin du transcrit
- le brin
  pour générer
*** TODO Tester correspondance partielle ?
pas de CDS et pas de colonne 17 et 18
seules les colonnes (dans la nouvelle dataframe) 10,11,12 causent problèmes (9,17,18 dans les ancienne)
NB: on peut retrouver le nombre d'exons colonnes 9 à partir de la lons
*** PROJ [#A] Filtre vep (avec spip ?)
** PROJ [#B] Indicateurs qualité
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** PROJ [#B] Compte-redu exécution avec MultiQC
** PROJ vérifier si normalisation
** PROJ [#B] Vérification nomenclature hgvs avec mutalyzer
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f pos