KJNYAKSJRMLIGUAM22N7CTOBWYHZ2TFD2NY5WSFWRRDU5IEWOMIAC
LGNTJMWSZ3DVZNAQMQQWSRSP46RXH5WXMBQ546LL6PVUTLSFJAUAC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
3WBY7ELOD4XA65WEV3UCLE6VUAGGF4T27WUYXN75QAOSYRPFEEMAC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
YWWEIWM4CNSJTZE2FPFYBB3OMQOP62BKPL5PXEDUL5M3KLNRJQ3QC
T3IPJM6TYF25RE2EQGASVSGIYJTJSJNZXBCEK3BNJ2LUYIBMTSSAC
MHIFI3P3R5PVLHHZRDZH3FPFN5IEDJRTWIVCNEUHNUKVSGYZW6YQC
QWCRR5CXIUZYZADCPUIE35VYCH4IEIEPPCOP2RV6TETADPS6H6SAC
* <2023-01-09 Mon> Parkour
Débloqué tic-tac + atterissage sur mur, kong
* <2023-01-14 Sat> Tricking
Traavil enchaînement : font, tornado, b-kick
** Validation
*** GIAB: NA12878
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
puis on compare le VCF avec les "high confidence" (Article : https://www.nature.com/articles/s41587-019-0054-x
)
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/
2. On séquence directement NA12878
puis on compare le VCF avec les "high confidence"
2. On séquence directement NA12878 -> inutile pour le pipeline seul
** TODO MAJ avec picard
Normalement, GATK inclut picard mais la dernière version utilise picard pour certains outils
https://gatk.broadinstitute.org/hc/en-us/articles/9570266920219--Tool-Documentation-Index
A compléter après validation
*** TODO markduplicates
La dernière version dans la documentation utilise picard !!
** TODO Bwa-mem2 au lieu de bwa mem
https://github.com/bwa-mem2/bwa-mem2
** TODO Parallélisation haplotypecaller
spark est en beta, ne pas utiliser
parallélisation du pauvre : se restreindre à un chromosome avec -L et paralléliser sur le nombre de chromosome
** KILL CRAM au lieu de SAM ?
CLOSED: [2022-12-30 Fri 20:38]
Version compressée de bam mais :
#+begin_quote
All GATK tools that take in mapped read data expect a BAM file as primary format. Some support the CRAM format, but we have observed performance issues when working directly from CRAM files, so in our own work we convert CRAM to BAM first, and we only use CRAM for archival purposes
#+end_quote
Source: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
**** Relancer avec la version de prod + nix
**** TODO Relancer avec la version de prod + nix
/Work/Projects/bisonex/ref-63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:04:02 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf.gz │ 84708 │
│ 1 │ 63003856_S135_DP_over_30.vcf.gz.tbi │ 1 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11362 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8864 │
│ 4 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6478 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:05:23 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf │ 84724 │
│ 1 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11377 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8884 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6759 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
**** TODO GATK : tests de non régression ??
**** TODO Compiler gatk
***** DONE Compilation
CLOSED: [2023-01-14 Sat 22:45]
Requirements :
- java8 avec jDK
- git lfs
#+begin_src sh
git clone https://github.com/broadinstitute/gatk
git checkout tags/4.2.4.0 -b 4.2.4.0
nix-shell -p jdk8 git-lfs
# We need the datasets for testing (otherwise it fails)
git lfs install
git lfs pull --include src/main/resources/large
./gradlew bundle
#+end_src
***** WAIT Vérifier tests
#+begin_src
./gradlew test
#+end_src
267064 tests completed, 444 failed, 1966 skipped
> There were failing tests. See the report at: file:///Home/Users/apraga/gatk/build/reports/tests/test/index.html
BUILD FAILED in 37m 1s
6 actionable tasks: 1 executed, 5 up-to-date
Ceux qui ont planté sont liés à spark + erreurs d'authentification kerberos
***** TODO Lancer calcul : maison
JAVA_HOME=/nix/store/r1r5jr7gv6hcchpiggjmfqjkzbi8y5ja-openjdk-8u322-ga/lib/openjdk PATH=$PATH:$JAVA_HOME/bin ./gatk --version
**** DONE Version prod à la maison
CLOSED: [2023-01-15 Sun 23:22]
preprocessing + variant calling sans alignement
***** DONE GATK nix
CLOSED: [2023-01-15 Sun 23:22]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
On a le même nombre
***** DONE GATK compilé: idem
CLOSED: [2023-01-15 Sun 23:22]
script/files-src/tmp_63003856_S135 on prod [$!?]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
* Améliorations
* Amélioration
** TODO MAJ avec picard
Normalement, GATK inclut picard mais la dernière version utilise picard pour certains outils
https://gatk.broadinstitute.org/hc/en-us/articles/9570266920219--Tool-Documentation-Index
A compléter après validation
*** TODO markduplicates
La dernière version dans la documentation utilise picard !!
** TODO Bwa-mem2 au lieu de bwa mem
https://github.com/bwa-mem2/bwa-mem2
** TODO Parallélisation haplotypecaller
spark est en beta, ne pas utiliser
parallélisation du pauvre : se restreindre à un chromosome avec -L et paralléliser sur le nombre de chromosome
** KILL CRAM au lieu de SAM ?
CLOSED: [2022-12-30 Fri 20:38]
Version compressée de bam mais :
#+begin_quote
All GATK tools that take in mapped read data expect a BAM file as primary format. Some support the CRAM format, but we have observed performance issues when working directly from CRAM files, so in our own work we convert CRAM to BAM first, and we only use CRAM for archival purposes
#+end_quote
Source: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
* HOLD Implémenter d’autres pipeline
Voir https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04407-x
** KILL GATK
CLOSED: [2022 -11-11 Fri 20:01]
https://broadinstitute.github.io/warp/docs/Pipelines/Exome_Germline_Single_Sample_Pipeline/README
A priori, respecte les bonnes pratiques
** KILL Essayer snmake avec bonne pratiques
https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/blob/main/.github/workflows/main.yml
Installer Mamba (micromamba ne fonctionne pas sous nix)
Ne fonctionne pas sous WSL2... MultiQC n’est pas assez à jour
Problèmes de versions...
** KILL Sarek
CLOSED: [2022-12-11 Sun 11:09]
*** Dépendences
**** Nix
#+begin_src sh
nix profile install nixpkgs#mosdepth nixpkgs#python3
nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
***** KILL derivation nix pour profile complet
CLOSED: [2022-12-11 Sun 11:09]
**** KILL Sans nix
CLOSED: [2022-09-24 Sat 10:20]
On utilise conda
#+begin_src sh
module unload nix
module load anaconda3@2021.05/gcc-12.1.0
module load nextflow@22.04.0/gcc-12.1.0
module load openjdk@11.0.14.1_1/gcc-12.1.0
nextflow run nf-core/sarek -profile conda,test --executor slurm --queue smp --outdir test -resume
#+end_src
Essai 1: erreurs de permissions, corrigé en relancant le programme
#+begin_quote
Failed to create Conda environment
command: conda create --mkdir --yes --quiet --prefix /Work/Users/apraga/test-sarek/work/conda/env-2d53b1db50de676670cf1a91ef0cf6db bioconda::tabix=1.11
status : 1
message:
NotWritableError: The current user does not have write permissions to a required path.
path: /Home/Users/apraga/.conda/pkgs/urls.txt
uid: 1696
gid: 513
If you feel that permissions on this path are set incorrectly, you can manually
change them by executing
$ sudo chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_quote
Corrigé avec
#+begin_src sh
chown 1696:513 /Home/Users/apraga/.conda/pkgs/urls.txt
#+end_src
Mais problème de proxy
*** KILL Dérivation nix pour modules python
CLOSED: [2022-12-11 Sun 11:09]
*** KILL Lancer sarek en mode test
CLOSED: [2022-12-11 Sun 11:09]
#+begin_src sh
nix-shell -p python310Packages.pyyaml --run "nextflow run nf-core/sarek -profile test --executor slurm --queue smp --outdir test -resume"
#+end_src
*** KILL Lancer sarek sur données allégées
CLOSED: [2022-12-11 Sun 11:09]