apraga/org - Change RUR2HQCDDUOIP7GD2GZV2UPCUF5QS6COVIVDKVIGZF22TVRFBKPAC

Bisonex update

Created by Alexis Praga on July 18, 2023

RUR2HQCDDUOIP7GD2GZV2UPCUF5QS6COVIVDKVIGZF22TVRFBKPAC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 1 [3.35]

B:BD[3.35] → [4.33532:41746]

∅:D[4.41746] → [5.29:8221]

B:BD[6.8203] → [5.29:8221]

B:BD[5.8221] → [7.29:8199]

B:BD[7.8199] → [8.29:8221]

#+title: Bisonex
#+FILETAGS: @bisonex
* Biblio :biblio:
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
*** Phred score
https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
    / : genotype unphased
    | : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
  SNV
| Exome | Pipeline |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | BWA_GATK | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | BWA_GATK | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
indel
 |   TP | FP | FN | Sensitivity | Precision | F-Score |   FDR |   |
 | 1254 | 72 | 75 |       0.944 |     0.946 |   0.945 | 0.054 |   |
 | 1309 | 10 | 20 |       0.985 |     0.992 |   0.989 | 0.008 |   |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
-  Cornish and Guda, 2015
Variant Type
|                       | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source   |
| NA12878               |         100%a |          40% |   0 |                      0 |           0 |   0 |               0 | Zook et  al18   | NIST     |
| Other NIST standard   |           71% |          40% | 50% |                      0 |           0 |   0 |               0 | Zook  et al18   |          |
| (e.g. AJ/Asian trios) |               |              |     |                        |             |     |                 |                 |          |
| Platinum              |           29% |            0 |   0 |                      0 |           0 |   0 |               0 | Eberle et  al8  | Platinum |
| Genomes               |               |              |     |                        |             |     |                 |                 |          |
| Venter/HuRef          |           14% |          40% |   0 |                      0 |           0 |   0 |               0 | Trost et al1    | HuRef    |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
  title     = "Systematic comparison of germline variant calling pipelines
               cross multiple next-generation sequencers",
  author    = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
               Yuhuan and Du, Hongli",
  abstract  = "The development and innovation of next generation sequencing
               (NGS) and the subsequent analysis tools have gain popularity in
               scientific researches and clinical diagnostic applications.
               Hence, a systematic comparison of the sequencing platforms and
               variant calling pipelines could provide significant guidance to
               NGS-based scientific and clinical genomics. In this study, we
               compared the performance, concordance and operating efficiency
               of 27 combinations of sequencing platforms and variant calling
               pipelines, testing three variant calling pipelines-Genome
               Analysis Tool Kit HaplotypeCaller, Strelka2 and
               Samtools-Varscan2 for nine data sets for the NA12878 genome
               sequenced by different platforms including BGISEQ500,
               MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
               calling performance of 12 combinations in WES datasets, all
               combinations displayed good performance in calling SNPs, with
               their F-scores entirely higher than 0.96, and their performance
               in calling INDELs varies from 0.75 to 0.91. And all 15
               combinations in WGS datasets also manifested good performance,
               with F-scores in calling SNPs were entirely higher than 0.975
               and their performance in calling INDELs varies from 0.71 to
               0.93. All of these combinations manifested high concordance in
               variant identification, while the divergence of variants
               identification in WGS datasets were larger than that in WES
               datasets. We also down-sampled the original WES and WGS datasets
               at a series of gradient coverage across multiple platforms, then
               the variants calling period consumed by the three pipelines at
               each coverage were counted, respectively. For the GIAB datasets
               on both BGI and Illumina platforms, Strelka2 manifested its
               ultra-performance in detecting accuracy and processing
               efficiency compared with other two pipelines on each sequencing
               platform, which was recommended in the further promotion and
               application of next generation sequencing technology. The
               results of our researches will provide useful and comprehensive
               guidelines for personal or organizational researchers in
               reliable and consistent variants identification.",
  journal   = "Sci. Rep.",
  publisher = "Springer Science and Business Media LLC",
  volume    =  9,
  number    =  1,
  pages     = "9345",
  month     =  jun,
  year      =  2019,
  copyright = "https://creativecommons.org/licenses/by/4.0",
  language  = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
  Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
 For WES datasets, the BGI platforms displayed the superior performance in SNPs
 calling while Illumina platforms manifested the better variants calling
 performance in INDELs calling, which could
 be explained by their divergence in
 sequencing strategy that producing different length of reads (all BGI platforms
 were 100 base pair read length while all Illumina platforms were 150 base pair
 read length). The read length effects, as a key factor between two platforms,
 would bring alignment bias and error which are higher for short reads and
 ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD    1      String  Decision for call (TP/FP/FN/N)
: FORMAT BK    1      String  Sub-type for decision (match/mismatch type)
: FORMAT BVT   1      String  High-level variant type (SNP|INDEL).
: FORMAT BLT   1      String  High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence  = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    19196584        .       T       C       50      PASS    platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ    0/1:.:781:109,123:138,150:348
: NC_000022.11    19196584        rs1061325       T       C       59.32   PASS    AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303      GT:AD:DP:GQ:PL  1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence  = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
 bcftools filter -i 'POS=31277416'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
 bcftools filter -i 'POS=31277416'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    31277416        .       CA      C       50      PASS    platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5  GT:PS:DP:ADALL:AD:GQ    1/1:.:465:16,229:0,190:129
: NC_000022.11    31277416        rs57244615      CAA     C,CA    389.02  PASS    AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572    GT:AD:DP:GQ:PL  1/2:5,10,14:29:64:406,202,313,64,0,88
*** Génération de reads
Biblio récente
https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf
Parmi ceux qui gèrent les variations
- *simuscop* reads non centré sur les zones de capture
- *NEAT: exome* mais trop lent en pratique
- *Reseq* exome
- gensim : pas d'exome
- pIRS : non plus
- varsim : non plus
  ...
  Temps de calcul selon l'article de reseq https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
  #+begin_quote
  Due to ReSeq’s effective parallelization, its elapsed times are low for this benchmark with 48 virtual CPUs (Additional file 1: Figure S34b,e). In contrast, the single-threaded processes implemented in perl or python have strikingly high elapsed times. This is well visible in Hs-HiX-TruSeq and applies to the training of pIRS (over a week), NEAT (several days), and BEAR (half a week) as well as the simulation of NEAT (close to 2 weeks) and BEAR (several weeks).
Biblio : https://www.nature.com/articles/s41437-022-00577-3
  #+end_quote
Divers
- Liste ancienne : https://www.biostars.org/p/128762/
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina  https://github.com/genome-in-a-bottle/giab_data_indexes
   puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
   Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
     https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
   Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
   Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
   La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
   Ajout des zones "difficiles"
   https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop (exome)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
- T2T: notation chromose = chR1,2 : ok genome, clinvar, dbSNP
- GRCh38: notation chromose = NC_... : ok genome, clinvar, dbSNP
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (ut
ilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
** Validation : Quelles données de référence ?
Discussion avec Alexis
- Platinum genomes = génome seul
*** [[https://github.com/genome-in-a-bottle/giab_data_indexes][Genome in a bottle]]
  - NA12878 :
    - Illumina HiSeq Exome : fastq + capture en hg37
    - Illumina TruSeq Exome : bam, pas de capture
    - Exomes en hg37 https://zenodo.org/record/3597727 avec capture
      - HiSeq2000
      - NextSeq 500
      - HiSeq 2500
  - HG002,3,4
    - Illumina Whole Exome  : bam. le kit de capture est "Agilent SureSelect Human All Exon V5 kit" selon [[https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]. On il faut les régions [[https://kb.10xgenomics.com/hc/en-us/articles/115004150923-Where-can-I-find-the-Agilent-Target-BED-files-][selon ce site]]
      Un autre fichier est disponible (capture ???)
    https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
  "target region" +/- 50bp
    testé sur chr311780-312086 : ok
Autres technologies non adaptées au pipeline (vu avec Alexis)
*** [[https://www.illumina.com/platinumgenomes.html][Platinum genome
]] Que du génome « sequenced to 50x depth on a HiSeq 2000 system”
Genome possible
*** 1000 genomes
- intersection des capture + CCDS  [[id:b77e64fa-06a8-4ffa-8b5b-ab3fda684b61][Données brutes exome 1000 Genomes (fastq + capture)]]
- Broad instute : SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
*** Zone de capture
GIAB fourni le .bed pour l'exome . INfo : https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
*** Valider la méthode
- 1000 genomes + SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
- GIAB + liftover du fichire de capture en hg38
Ce qui est aussi fait par
https://bcbio-nextgen.readthedocs.io/en/stable/contents/germline_variants.html
Mais avec UCSC liftover
** Centogène
https://www.twistbioscience.com/node/23906
Bed non fourni pour exactement cette capture
On prend https://www.twistbioscience.com/resources/data-files/twist-alliance-vcgs-exome-401mb-bed-files
qui content la majeure partie
* Données :data:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name }  | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }  | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
 Puis on supprime
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz"  | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam"  | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12 Mon 23:29]
Ajout vérification checksum -> à vérifier
**** DONE transcriptome (spip)
CLOSED: [2023-06-12 Mon 23:29]
Rajouter checksum manuel
**** KILL Refseq
**** KILL OMIM
CLOSED: [2023-06-12 Mon 23:29]
codé, à vérifier
**** KILL ACMG incidental
CLOSED: [2023-06-12 Mon 23:29]
*** TODO Données :T2T:
SCHEDULED: <2023-06-15 Thu>
:PROPERTIES:
:ID:       5d915178-ca96-44ef-87f1-6702af114f2b
:END:
**** DONE fasta
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE fasta index
CLOSED: [2023-06-13 Tue 00:07]
***** DONE compatibilité hg38
CLOSED: [2023-06-13 Tue 00:07]
**** DONE fasta dictionnaire
CLOSED: [2023-06-13 Tue 00:07]
**** DONE dbSNP
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE commonSNP
CLOSED: [2023-06-14 Wed 22:32]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:32]
cd /Work/Groups/bisonex/data/dbsnp/GRCh38.p14
❯ ga@mesointeractive GRCh38.p14]$ zgrep -c '^NC' dbSNP_common.vcf.gz
21340485
[apraga@mesointeractive GRCh38.p14]$ pwd
[apraga@mesointeractive GRCh38.p14]$ zgrep -c '^NC'
dbSNP_common.vcf.gz                     ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz.tbi                 ID_of_common_snp.txt
[apraga@mesointeractive dbsnp]$ cd chm13v2.0/
[apraga@mesointeractive chm13v2.0]$ ls
chm13v2.0_dbSNPv155.vcf.gz      dbSNP_common.vcf.gz.tbi                 versions.yml
chm13v2.0_dbSNPv155.vcf.gz.tbi  ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz             ID_of_common_snp.txt
[apraga@mesointeractive chm13v2.0]$ zgrep -c '^chr' dbSNP_common.vcf.gz
19433713
[apraga@mesointeractive chm13v2.0] $
❯ man tmux
**** DONE com
monSNP non patho
CLOSED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** TODO cache vep
SCHEDULED: <2023-07-16 Sun>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/cl

[3.35]

[9.9704]

#+title: Bisonex
* Biblio
:PROPERTIES:
:CATEGORY: biblio
:END:
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
*** Phred score
https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
    / : genotype unphased
    | : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
  SNV
| Exome | Pipeline |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | BWA_GATK | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | BWA_GATK | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
indel
 |   TP | FP | FN | Sensitivity | Precision | F-Score |   FDR |   |
 | 1254 | 72 | 75 |       0.944 |     0.946 |   0.945 | 0.054 |   |
 | 1309 | 10 | 20 |       0.985 |     0.992 |   0.989 | 0.008 |   |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
-  Cornish and Guda, 2015
Variant Type
|                       | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source   |
| NA12878               |         100%a |          40% |   0 |                      0 |           0 |   0 |               0 | Zook et  al18   | NIST     |
| Other NIST standard   |           71% |          40% | 50% |                      0 |           0 |   0 |               0 | Zook  et al18   |          |
| (e.g. AJ/Asian trios) |               |              |     |                        |             |     |                 |                 |          |
| Platinum              |           29% |            0 |   0 |                      0 |           0 |   0 |               0 | Eberle et  al8  | Platinum |
| Genomes               |               |              |     |                        |             |     |                 |                 |          |
| Venter/HuRef          |           14% |          40% |   0 |                      0 |           0 |   0 |               0 | Trost et al1    | HuRef    |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
  title     = "Systematic comparison of germline variant calling pipelines
               cross multiple next-generation sequencers",
  author    = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
               Yuhuan and Du, Hongli",
  abstract  = "The development and innovation of next generation sequencing
               (NGS) and the subsequent analysis tools have gain popularity in
               scientific researches and clinical diagnostic applications.
               Hence, a systematic comparison of the sequencing platforms and
               variant calling pipelines could provide significant guidance to
               NGS-based scientific and clinical genomics. In this study, we
               compared the performance, concordance and operating efficiency
               of 27 combinations of sequencing platforms and variant calling
               pipelines, testing three variant calling pipelines-Genome
               Analysis Tool Kit HaplotypeCaller, Strelka2 and
               Samtools-Varscan2 for nine data sets for the NA12878 genome
               sequenced by different platforms including BGISEQ500,
               MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
               calling performance of 12 combinations in WES datasets, all
               combinations displayed good performance in calling SNPs, with
               their F-scores entirely higher than 0.96, and their performance
               in calling INDELs varies from 0.75 to 0.91. And all 15
               combinations in WGS datasets also manifested good performance,
               with F-scores in calling SNPs were entirely higher than 0.975
               and their performance in calling INDELs varies from 0.71 to
               0.93. All of these combinations manifested high concordance in
               variant identification, while the divergence of variants
               identification in WGS datasets were larger than that in WES
               datasets. We also down-sampled the original WES and WGS datasets
               at a series of gradient coverage across multiple platforms, then
               the variants calling period consumed by the three pipelines at
               each coverage were counted, respectively. For the GIAB datasets
               on both BGI and Illumina platforms, Strelka2 manifested its
               ultra-performance in detecting accuracy and processing
               efficiency compared with other two pipelines on each sequencing
               platform, which was recommended in the further promotion and
               application of next generation sequencing technology. The
               results of our researches will provide useful and comprehensive
               guidelines for personal or organizational researchers in
               reliable and consistent variants identification.",
  journal   = "Sci. Rep.",
  publisher = "Springer Science and Business Media LLC",
  volume    =  9,
  number    =  1,
  pages     = "9345",
  month     =  jun,
  year      =  2019,
  copyright = "https://creativecommons.org/licenses/by/4.0",
  language  = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
  Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
 For WES datasets, the BGI platforms displayed the superior performance in SNPs
 calling while Illumina platforms manifested the better variants calling
 performance in INDELs calling, which could be explained by their divergence in
 sequencing strategy that producing different length of reads (all BGI platforms
 were 100 base pair read length while all Illumina platforms were 150 base pair
 read length). The read length effects, as a key factor between two platforms,
 would bring alignment bias and error which are higher for short reads and
 ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD    1      String  Decision for call (TP/FP/FN/N)
: FORMAT BK    1      String  Sub-type for decision (match/mismatch type)
: FORMAT BVT   1      String  High-level variant type (SNP|INDEL).
: FORMAT BLT   1      String  High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence  = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    19196584        .       T       C       50      PASS    platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ    0/1:.:781:109,123:138,150:348
: NC_000022.11    19196584        rs1061325       T       C       59.32   PASS    AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303      GT:AD:DP:GQ:PL  1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence  = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
 bcftools filter -i 'POS=31277416'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
 bcftools filter -i 'POS=31277416'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    31277416        .       CA      C       50      PASS    platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5  GT:PS:DP:ADALL:AD:GQ    1/1:.:465:16,229:0,190:129
: NC_000022.11    31277416        rs57244615      CAA     C,CA    389.02  PASS    AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572    GT:AD:DP:GQ:PL  1/2:5,10,14:29:64:406,202,313,64,0,88
*** Génération de reads
Biblio récente
https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf
Parmi ceux qui gèrent les variations
- *simuscop* reads non centré sur les zones de capture
- *NEAT: exome* mais trop lent en pratique
- *Reseq* exome
- gensim : pas d'exome
- pIRS : non plus
- varsim : non plus
  ...
  Temps de calcul selon l'article de reseq https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
  #+begin_quote
  Due to ReSeq’s effective parallelization, its elapsed times are low for this benchmark with 48 virtual CPUs (Additional file 1: Figure S34b,e). In contrast, the single-threaded processes implemented in perl or python have strikingly high elapsed times. This is well visible in Hs-HiX-TruSeq and applies to the training of pIRS (over a week), NEAT (several days), and BEAR (half a week) as well as the simulation of NEAT (close to 2 weeks) and BEAR (several weeks).
Biblio : https://www.nature.com/articles/s41437-022-00577-3
  #+end_quote
Divers
- Liste ancienne : https://www.biostars.org/p/128762/
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina  https://github.com/genome-in-a-bottle/giab_data_indexes
   puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
   Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
     https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
   Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
   Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
   La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
   Ajout des zones "difficiles"
   https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Cento
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop (exome)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
- T2T: notation chromose = chR1,2 : ok genome, clinvar, dbSNP
- GRCh38: notation chromose = NC_... : ok genome, clinvar, dbSNP
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
** Validation : Quelles données de référence ?
Discussion avec Alexis
- Platinum genomes = génome seul
*** [[https://github.com/genome-in-a-bottle/giab_data_indexes][Genome in a bottle]]
  - NA12878 :
    - Illumina HiSeq Exome : fastq + capture en hg37
    - Illumina TruSeq Exome : bam, pas de capture
    - Exomes en hg37 https://zenodo.org/record/3597727 avec capture
      - HiSeq2000
      - NextSeq 500
      - HiSeq 2500
  - HG002,3,4
    - Illumina Whole Exome  : bam. le kit de capture est "Agilent SureSelect Human All Exon V5 kit" selon [[https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]. On il faut les régions [[https://kb.10xgenomics.com/hc/en-us/articles/115004150923-Where-can-I-find-the-Agilent-Target-BED-files-][selon ce site]]
      Un autre fichier est disponible (capture ???)
    https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
  "target region" +/- 50bp
    testé sur chr311780-312086 : ok
Autres technologies non adaptées au pipeline (vu avec Alexis)
*** [[https://www.illumina.com/platinumgenomes.html][Platinum genome
]] Que du génome « sequenced to 50x depth on a HiSeq 2000 system”
Genome possible
*** 1000 genomes
- intersection des capture + CCDS  [[id:b77e64fa-06a8-4ffa-8b5b-ab3fda684b61][Données brutes exome 1000 Genomes (fastq + capture)]]
- Broad instute : SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
*** Zone de capture
GIAB fourni le .bed pour l'exome . INfo : https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
*** Valider la méthode
- 1000 genomes + SureSelect human all exon v2 target capture kit : non disponible sur le site d'agilent (V6 ou plus)
  https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
- GIAB + liftover du fichire de capture en hg38
Ce qui est aussi fait par
https://bcbio-nextgen.readthedocs.io/en/stable/contents/germline_variants.html
Mais avec UCSC liftover
** Centogène
https://www.twistbioscience.com/node/23906
Bed non fourni pour exactement cette capture
On prend https://www.twistbioscience.com/resources/data-files/twist-alliance-vcgs-exome-401mb-bed-files
qui content la majeure partie
* Données
:PROPERTIES:
:CATEGORY: data
:END:
** DONE Remplacer bam par fastq sur mesocentre
CLOSED: [2023-04-16 Sun 16:33]
Commande
*** DONE Supprimer les fastq non "paired"
CLOSED: [2023-04-16 Sun 16:33]
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name }  | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt  | lines | each {|e| ls $"fastq/*_($in)/*" | get 0  }  | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
 Puis on supprime
 open single.txt  | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** DONE Supprimer bam qui ont des fastq
CLOSED: [2023-04-16 Sun 16:33]
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz"  | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam"  | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow
:PROPERTIES:
:CATEGORY: workflow
:END:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** KILL Télécharger les données avec nextflow: hg38
CLOSED: [2023-06-12 Mon 23:29]
**** DONE Genome de référence
**** DONE dbSNP
**** DONE VEP 20G
CLOSED: [2023-06-12 Mon 23:29]
Ajout vérification checksum -> à vérifier
**** DONE transcriptome (spip)
CLOSED: [2023-06-12 Mon 23:29]
Rajouter checksum manuel
**** KILL Refseq
**** KILL OMIM
CLOSED: [2023-06-12 Mon 23:29]
codé, à vérifier
**** KILL ACMG incidental
CLOSED: [2023-06-12 Mon 23:29]
*** TODO Données :T2T:
:PROPERTIES:
:ID:       5d915178-ca96-44ef-87f1-6702af114f2b
:END:
**** DONE fasta
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE fasta index
CLOSED: [2023-06-13 Tue 00:07]
***** DONE compatibilité hg38
CLOSED: [2023-06-13 Tue 00:07]
**** DONE fasta dictionnaire
CLOSED: [2023-06-13 Tue 00:07]
**** DONE dbSNP
CLOSED: [2023-06-12 Mon 23:30]
***** DONE compatibilité hg38
CLOSED: [2023-06-12 Mon 23:30]
**** DONE commonSNP
CLOSED: [2023-06-14 Wed 22:32]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:32]
cd /Work/Groups/bisonex/data/dbsnp/GRCh38.p14
❯ ga@mesointeractive GRCh38.p14]$ zgrep -c '^NC' dbSNP_common.vcf.gz
21340485
[apraga@mesointeractive GRCh38.p14]$ pwd
[apraga@mesointeractive GRCh38.p14]$ zgrep -c '^NC'
dbSNP_common.vcf.gz                     ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz.tbi                 ID_of_common_snp.txt
[apraga@mesointeractive dbsnp]$ cd chm13v2.0/
[apraga@mesointeractive chm13v2.0]$ ls
chm13v2.0_dbSNPv155.vcf.gz      dbSNP_common.vcf.gz.tbi                 versions.yml
chm13v2.0_dbSNPv155.vcf.gz.tbi  ID_of_common_snp_not_clinvar_patho.txt
dbSNP_common.vcf.gz             ID_of_common_snp.txt
[apraga@mesointeractive chm13v2.0]$ zgrep -c '^chr' dbSNP_common.vcf.gz
19433713
[apraga@mesointeractive chm13v2.0] $
❯ man tmux
**** DONE commonSNP non patho
CLOSED: [2023-06-14 Wed 22:35]
***** DONE compatibilité hg38
CLOSED: [2023-06-14 Wed 22:35]
**** TODO cache vep
SCHEDULED: <2023-07-25 Tue>
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
  Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 |          |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . |  . | 12 |    |   |   |   |   |   |   |   |
| . | 12 |  0 |  2 |   |   |   |   |   |   |   |
| 2 |  3 |  2 |  2 | 2 | 5 | . |   |   |   |   |
| . |  2 |  3 |  2 | 2 | 4 |   |   |   |   |   |
| . |  . |  3 | 12 | 3 |   |   |   |   |   |   |
| . |  5 |  2 |  . |   |   |   |   |   |   |   |
| . |  . |  . |  5 | 2 | 2 |   |   |   |   |   |
| . |  9 |  9 |  9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
 #+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
    --clinvar $clinvar \
    --chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
    --dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
 #+end_src
 #+RESULTS:
 : 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039  |
| rs3833341  |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read  -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
|   764018 | A | ACAGGTCAAT,ACAGGT | .,5     | 2,. |   |
| 23637790 | C | G,T               | .,.,12  |     |   |
| 44651586 | C | A,G,T             | .,.,.,5 |   2 | 2 |
|   764018 | A | ACAGGTCAAT        | Benign  |     |   |
| 23637790 | C | T                 | Benign  |     |   |
| 44651586 | C | T                 | Benign  |     |   |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
   #+begin_src sh
rtg format genomeRef.fna  -o genomeRef.sdf
   #+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
 #+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11	10001	rs1570391677	A,C	.	PASS	RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
 #+end_src
 Essai intersection clinvar (patho ou non) dbSNP
   - faux négatif = dbSNP common qui ne sont pas dans clinvar
   - faux positif = clinvar qui ne sont pas dbSNP common
   - vrai positif = clinvar qui sont dans dbSNP common
   - vrai positif baseline = dbSNP common qui sont dans clinvar
 On calcule le nombre de lignes
 #+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
 #+end_src
 | clinvar            |  1493470 |
 | fn.vcf.gz          | 22330220 |
 | fp.vcf.gz          |  1222529 |
 | tp-baseline.vcf.gz |   131040 |
 | tp.vcf.gz          |   136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/cl

Replacement in projects/bisonex.org at line 4 [3.35]

B:BD[10.12744] → [10.12744:13567]

B:BD[10.13567] → [7.9023:14909]

B:BD[7.14909] → [11.6717:8200]

B:BD[11.8200] → [2.29:8358]

∅:D[12.8221] → [4.49917:49939]

∅:D[2.8358] → [4.49917:49939]

∅:D[8.16414] → [4.49917:49939]

B:BD[4.49917] → [4.49917:49939]

∅:D[9.17919] → [13.16414:16524]

∅:D[4.49939] → [13.16414:16524]

B:BD[13.16414] → [13.16414:16524]

∅:D[13.16524] → [7.24606:25407]

B:BD[7.24606] → [7.24606:25407]

∅:D[7.25407] → [14.8221:8354]

B:BD[14.8221] → [14.8221:8354]

∅:D[14.8354] → [15.8221:8850]

B:BD[15.8221] → [15.8221:8850]

∅:D[15.8850] → [16.24606:30966]

B:BD[16.24606] → [16.24606:30966]

tho.
   La version d'Alexis le fait bien
| NC_000020.11 | 3234173 | rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 | 3234173 |    262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 |   1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 | 3234173 |    208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 | 3234173 |      1312 | TGGCGAAGC | T         | Pathogenic                                   |
****** DONE Voir si isec gère les multiallélique (chr20) : non, impossible de faire marcher
CLOSED: [2022-11-27 Sun 00:37]
******* DONE chr20 en prenant un patho clinvar aussi dans dbSNP
CLOSED: [2022-11-27 Sun 00:37]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter dbSNP_common_chr20.vcf.gz -i 'POS=10652589' -o test_dbsnp.vcf.gz
bcftools filter clinvar_chr20.vcf.gz -i 'POS=10652589' -o test_clinvar.vcf.gz
bcftools index test_dbsnp.vcf.gz
bcftools index test_clinvar.vcf.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec test_dbsnp.vcf.gz test_clinvar.vcf.gz -p tmp
grep '^[^#]' tmp/0002.vcf
grep '^[^#]' tmp/0003.vcf
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Testé en modifiant test_dbsnp !
Fonctionne avec un variant par ligne
****** DONE isec en coupant les sites multialléliques: non
CLOSED: [2022-11-27 Sun 00:37]
******* DONE Exemple simple ok
CLOSED: [2022-11-27 Sun 00:34]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=10652589' dbSNP_common_chr20.vcf.gz -o dbsnp_mwi.vcf.gz
bcftools filter -i 'POS=10652589' clinvar_chr20.vcf.gz -o clinvar_mwi.vcf.gz
bcftools index -f dbsnp_mwi.vcf.gz
bcftools index -f clinvar_mwi.vcf.gz
bcftools isec dbsnp_mwi.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Chr 20
Avec les fichiers du teste précédent
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbsnp_mwi.vcf.gz -o dbsnp_mwi_norm.vcf.gz
bcftools index dbsnp_mwi_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE simuscop
CLOSED: [2022-12-30 Fri 22:31]
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
*** TODO hap.py
https://github.com/Illumina/hap.py
**** DONE Version sans rtgtools avec python 3
CLOSED: [2023-02-02 Thu 22:15]
Procédure pour tester
#+begin_src
nix develop .#hap-py
$ genericBuild
#+end_src
1. Supprimer l’appel à make_dependencies dans cmakelist.txt : on peut tout installer avec nix
2. Patch Roc.cpp pour avoir numeric_limits ( error: 'numeric_limits' is not a member of 'std')
3. ajout de flags de link (essai, error)
set(ZLIB_LIBRARIES -lz -lbz2 -lcurl -lcrypto -llzma)
4. Changer les appels à print en print() dans le code python et suppression de quelques import
[nix-shell:~/source]$ sed -i.orig 's/print \"\(.*\)"/print(\1)/' src/python/*.py
**** DONE Sérialiser json pour écrire données de sorties
CLOSED: [2023-02-17 Fri 19:25]
**** DONE Tester sur example
CLOSED: [2023-02-04 Sat 00:25]
#+begin_src sh
$ cd hap.py
$ ../result/bin/hap.py example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test -r example/chr21.fa
#+end_src
#+RESULTS:
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
| INDEL | ALL    |        8937 |     7839 |     1098 |       11812 |      343 |      3520 |    45 |   283 |      0.877140 |         0.958635 |       0.298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
SCHEDULED: <2023-07-09 Sun>
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** KILL Reproduire les performances precisionchallenge : attention à HG002 et HG001!
CLOSED: [2023-04-01 Sat 19:43]
https://www.nist.gov/programs-projects/genome-bottle
***** KILL 0GOOR
CLOSED: [2023-04-01 Sat 19:40]
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDE
L    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 pyth
on=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** TODO T2T :T2T:
Toutes les ressourcs sont décrites ici
https://github.com/marbl/CHM13
Détails sur le pipeline
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&c=CP068277.2&g=hub_3267197_hgLiftOver
*** DONE Alignement
CLOSED: [2023-06-26 Mon 19:42]
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input="/Work/Groups/bisonex/data/giab/*_R{1,2}_001.fastq.gz" --id=NA12878-T2T -bg
SCHEDULED: <2023-06-14 Wed>
*** DONE Haplotypecaller
CLOSED: [2023-06-26 Mon 19:42] SCHEDULED: <2023-06-15 Thu>
*** TODO Filtres
SCHEDULED: <2023-07-16 Sun>
*** Liftover pipelines
:PROPERTIES:
:ID:       d2280207-3f65-4a31-a291-41fa9a9658c2
:END:
Contient les chain files
** TODO Indicateurs qualité
SCHEDULED: <2023-07-26 Wed>
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** TODO vérifier si normalisation
SCHEDULED: <2023-07-26 Wed>
** TODO Rajouter vérification hgvs
SCHEDULED: <2023-07-26 Wed>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans
 tests/
vep -i 630049
25-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../scrip
t/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh3
8/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/1
3/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
**** TODO ACMG incidental
**** TODO Gnomad ?
**** TODO ACMG incidental
**** TODO Gnomad ?
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** STRT Tester sarek
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour c

[10.12744]

[16.30966]

tho.
   La version d'Alexis le fait bien
| NC_000020.11 | 3234173 | rs3827075 | T         | A,C,G     |                                              |
| NC_000020.11 | 3234173 |    262001 | T         | G         | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 |   1072511 | T         | TGGCGAAGC | Pathogenic                                   |
| NC_000020.11 | 3234173 |    208613 | TGGCGAAGC | G         | Pathogenic                                   |
| NC_000020.11 | 3234173 |      1312 | TGGCGAAGC | T         | Pathogenic                                   |
****** DONE Voir si isec gère les multiallélique (chr20) : non, impossible de faire marcher
CLOSED: [2022-11-27 Sun 00:37]
******* DONE chr20 en prenant un patho clinvar aussi dans dbSNP
CLOSED: [2022-11-27 Sun 00:37]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter dbSNP_common_chr20.vcf.gz -i 'POS=10652589' -o test_dbsnp.vcf.gz
bcftools filter clinvar_chr20.vcf.gz -i 'POS=10652589' -o test_clinvar.vcf.gz
bcftools index test_dbsnp.vcf.gz
bcftools index test_clinvar.vcf.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec test_dbsnp.vcf.gz test_clinvar.vcf.gz -p tmp
grep '^[^#]' tmp/0002.vcf
grep '^[^#]' tmp/0003.vcf
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Testé en modifiant test_dbsnp !
Fonctionne avec un variant par ligne
****** DONE isec en coupant les sites multialléliques: non
CLOSED: [2022-11-27 Sun 00:37]
******* DONE Exemple simple ok
CLOSED: [2022-11-27 Sun 00:34]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=10652589' dbSNP_common_chr20.vcf.gz -o dbsnp_mwi.vcf.gz
bcftools filter -i 'POS=10652589' clinvar_chr20.vcf.gz -o clinvar_mwi.vcf.gz
bcftools index -f dbsnp_mwi.vcf.gz
bcftools index -f clinvar_mwi.vcf.gz
bcftools isec dbsnp_mwi.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Chr 20
Avec les fichiers du teste précédent
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbsnp_mwi.vcf.gz -o dbsnp_mwi_norm.vcf.gz
bcftools index dbsnp_mwi_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE simuscop
CLOSED: [2022-12-30 Fri 22:31]
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
*** TODO hap.py
https://github.com/Illumina/hap.py
**** DONE Version sans rtgtools avec python 3
CLOSED: [2023-02-02 Thu 22:15]
Procédure pour tester
#+begin_src
nix develop .#hap-py
$ genericBuild
#+end_src
1. Supprimer l’appel à make_dependencies dans cmakelist.txt : on peut tout installer avec nix
2. Patch Roc.cpp pour avoir numeric_limits ( error: 'numeric_limits' is not a member of 'std')
3. ajout de flags de link (essai, error)
set(ZLIB_LIBRARIES -lz -lbz2 -lcurl -lcrypto -llzma)
4. Changer les appels à print en print() dans le code python et suppression de quelques import
[nix-shell:~/source]$ sed -i.orig 's/print \"\(.*\)"/print(\1)/' src/python/*.py
**** DONE Sérialiser json pour écrire données de sorties
CLOSED: [2023-02-17 Fri 19:25]
**** DONE Tester sur example
CLOSED: [2023-02-04 Sat 00:25]
#+begin_src sh
$ cd hap.py
$ ../result/bin/hap.py example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test -r example/chr21.fa
#+end_src
#+RESULTS:
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
| INDEL | ALL    |        8937 |     7839 |     1098 |       11812 |      343 |      3520 |    45 |   283 |      0.877140 |         0.958635 |       0.298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
SCHEDULED: <2023-07-25 Tue>
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** KILL Reproduire les performances precisionchallenge : attention à HG002 et HG001!
CLOSED: [2023-04-01 Sat 19:43]
https://www.nist.gov/programs-projects/genome-bottle
***** KILL 0GOOR
CLOSED: [2023-04-01 Sat 19:40]
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDEL    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** TODO T2T :T2T:
Toutes les ressourcs sont décrites ici
https://github.com/marbl/CHM13
Détails sur le pipeline
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&c=CP068277.2&g=hub_3267197_hgLiftOver
*** DONE Alignement
CLOSED: [2023-06-26 Mon 19:42]
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input="/Work/Groups/bisonex/data/giab/*_R{1,2}_001.fastq.gz" --id=NA12878-T2T -bg
SCHEDULED: <2023-06-14 Wed>
*** DONE Haplotypecaller
CLOSED: [2023-06-26 Mon 19:42] SCHEDULED: <2023-06-15 Thu>
*** TODO Filtres
SCHEDULED: <2023-07-27 Thu>
*** Liftover pipelines
:PROPERTIES:
:ID:       d2280207-3f65-4a31-a291-41fa9a9658c2
:END:
Contient les chain files
** TODO Indicateurs qualité
SCHEDULED: <2023-07-26 Wed>
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** TODO vérifier si normalisation
SCHEDULED: <2023-07-26 Wed>
** TODO Rajouter vérification hgvs
SCHEDULED: <2023-07-26 Wed>
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Package Nix spliceAI ?
nix profile install nixpkgs#python3Packages.tensorflow
+ ajouter dépendencs ("grep import" ou cnad)
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
**** TODO ACMG incidental
**** TODO Gnomad ?
**** TODO ACMG incidental
**** TODO Gnomad ?
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** KILL Tester version d'alexis avec Nix
CLOSED: [2023-06-14 Wed 22:37]
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** KILL Filter
CLOSED: [2023-06-14 Wed 22:37]
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** KILL variant annotation
CLOSED: [2023-06-14 Wed 22:37]
Besoin de vep
*** KILL Variant calling
CLOSED: [2023-06-14 Wed 22:37]
** STRT Tester sarek
#+begin_src sh
 module load apptainer/1.1.8
 nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
 rg -IN galaxyproject modules  | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
 Nettoyage à la main
 Puis
 #+begin_src sh
 cat deps.txt | xargs -L1 singularity pull
 #+end_src
* Amélioration :amelioration:
* Documentation
:PROPERTIES:
:CATEGORY: doc
:END:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript
:PROPERTIES:
:CATEGORY: manuscript
:END:
* Tests
:PROPERTIES:
:CATEGORY: tests
:END:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
 comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt  > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
  $ bcftools stats clinvar.gz
  clinvar (Alexis)
SN	0	number of samples:	0
SN	0	number of records:	1492828
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338007
SN	0	number of MNPs:	5562
SN	0	number of indels:	144580
SN	0	number of others:	3714
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
clinvar (new)
SN	0	number of samples:	0
SN	0	number of records:	1493470
SN	0	number of no-ALTs:	965
SN	0	number of SNPs:	1338561
SN	0	number of MNPs:	5565
SN	0	number of indels:	144663
SN	0	number of others:	3716
SN	0	number of multiallelic sites:	0
SN	0	number of multiallelic SNP sites:	0
***** DONE Mettre à jour c

Replacement in projects/bisonex.org at line 29 [3.35]

B:BD[17.8134] → [17.8134:8271]

B:BD[17.8271] → [12.8222:16277]

_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL 
         413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* TODO Examiner les FP
******* TODO Tester un FP
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* TODO Enlever les FP qui correspondent à un changement dans le génome
SCHEDULED: <2023-07-09 Sun>
Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
  Algorithme
  1. Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785

[17.8134]

[12.16277]

_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* HOLD Examiner les FP
******* HOLD Tester un FP
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* TODO Enlever les FP qui correspondent à un changement dans le génome
SCHEDULED: <2023-07-09 Sun>
Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
  Algorithme
  1. Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785

Replacement in projects/bisonex.org at line 35 [3.35]

B:BD[4.68004] → [4.68004:76364]

B:BD[4.76364] → [12.16446:32830]

∅:D[12.32830] → [8.33004:34581]

B:BD[8.33004] → [8.33004:34581]

∅:D[8.34581] → [14.21608:52631]

B:BD[14.21608] → [14.21608:52631]

6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-06-30 Fri 22:08] SCHEDULED: <2023-06-25 Sun>
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/HG001-SRX11061486_SRR14724513-GRCh38 --query=out/HG001-SRX11061486_SRR14724513-GRCh38/callVariant/haplotypecaller/HG001-SRX11061486_SRR14724513-GRCh38.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG001
#+end_src
Meilleurs résultats !
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |       0.89071 |          0.88551 |       0.378198 |        0.888102 |                        |                        |          1.86096256684492 |         2.247272727272727 |
| INDEL | PASS   |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |       0.89071 |          0.88551 |       0.378198 |        0.888102 |                        |                        |          1.86096256684492 |         2.247272727272727 |
| SNP   | ALL    |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |       0.162184 |        0.975588 |      3.007110300820419 |       2.78468624064479 |        1.5918102430965306 |        1.8161449399656946 |
| SNP   | PASS   |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |       0.162184 |        0.975588 |      3.007110300820419 |       2.78468624064479 |        1.5918102430965306 |        1.8161449399656946 |
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-07-14 Fri>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              19653          19501       6410      21657     0.7526       0.4757     0.5830
     None              19653          19501       6410      21657     0.7526       0.4757     0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 
|   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
Agilent
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              37241          36965        449       4069     0.9880       0.9015     0.9428
     None              37248          36972        461       4062     0.9877       0.9017     0.9427
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| INDEL | PASS   |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| SNP   | ALL    |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
| SNP   | PASS   |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-19 Wed>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-19 Wed>
**** TODO HG004 :hg38:hg004:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-19 Wed>
**** STRT HG001 :hg001:T2T:
SCHEDULED: <2023-07-03 Mon>
Avec liftover : 10x moins de variants...
Type,Filter,TRUTH.TOTAL,TRUTH.TP,TRUTH.FN,QUERY.TOTAL,QUERY.FP,QUERY.UNK,FP.gt,FP.al,METRIC.Recall,METRIC.Precision,METRIC.Frac_NA,METRIC.F1_Score,TRUTH.TOTAL.TiTv_ratio,QUERY.TOTAL.TiTv_ratio,TRUTH.TOTAL.het_hom_ratio,QUERY.TOTAL.het_hom_ratio
INDEL,ALL,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
INDEL,PASS,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
SNP,ALL,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
SNP,PASS,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** TODO Résumer résultats pour Paul + article :resultats:hg38:
SCHEDULED: <2023-07-16 Sun>
Refaire résultats
**** TODO Plot : ashkenazim trio :hg38:
SCHEDULED: <2023-07-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCentogene = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de centogene
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&n
onVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:

Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github  HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq  test-gatad2b-hiseq2000/single_2.fq | samtools sort  -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
******* KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
******* KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
samtools index single.bam
#+end_src
***** KILL Vérifier tous les variants sont retrouvés en 200x: hg38
CLOSED: [2023-06-12 Mon 23:25]
****** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
******* DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
******* DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******** DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******** DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
****** KILL Après haplotypecaller
CLOSED: [2023-06-12 Mon 23:24]
******* KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
******* DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******** snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******** DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
********* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
********* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
********* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
********* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
********* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
********* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
********* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
********* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
********* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
********* DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******** DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
******* KILL Comparer VCF avec vcfeval :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
******* KILL Comparer avec hap.py :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
******* DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** KILL Avant annotation
CLOSED: [2023-06-12 Mon 23:25] SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
******* DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:24]
****** KILL Après filtre annotation
CLOSED: [2023-06-12 Mon 23:25]
******* DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:25]
******* KILL VCf eval
CLOSED: [2023-06-12 Mon 23:25]
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
***** TODO Vérifier tous les variants sont retrouvés en 200x: hg38 :T2T:
****** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
******* DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
******* DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******** DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******** DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
****** KILL Après haplotypecaller
CLOSED: [2023-06-12 Mon 23:24]
******* KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
******* DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******** snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******** DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
********* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
********* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
********* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
********* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
********* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
********* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
********* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
********* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
********* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
********* DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******** DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
******* KILL Comparer VCF avec vcfeval :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
******* KILL Comparer avec hap.py :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
******* DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** KILL Avant annotation
CLOSED: [2023-06-12 Mon 23:25] SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
******* DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:24]
****** KILL Après filtre annotation
CLOSED: [2023-06-12 Mon 23:25]
******* DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:25]
******* KILL VCf eval
CLOSED: [2023-06-12 Mon 23:25]
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         5

[4.68004]

[14.52631]

6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-06-30 Fri 22:08] SCHEDULED: <2023-06-25 Sun>
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/HG001-SRX11061486_SRR14724513-GRCh38 --query=out/HG001-SRX11061486_SRR14724513-GRCh38/callVariant/haplotypecaller/HG001-SRX11061486_SRR14724513-GRCh38.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed  --id=HG001
#+end_src
Meilleurs résultats !
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |       0.89071 |          0.88551 |       0.378198 |        0.888102 |                        |                        |          1.86096256684492 |         2.247272727272727 |
| INDEL | PASS   |         549 |      489 |       60 |         899 |       64 |       340 |     8 |    17 |       0.89071 |          0.88551 |       0.378198 |        0.888102 |                        |                        |          1.86096256684492 |         2.247272727272727 |
| SNP   | ALL    |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |       0.162184 |        0.975588 |      3.007110300820419 |       2.78468624064479 |        1.5918102430965306 |        1.8161449399656946 |
| SNP   | PASS   |       21973 |    21462 |      511 |       26285 |      563 |      4263 |    68 |    16 |      0.976744 |         0.974435 |       0.162184 |        0.975588 |      3.007110300820419 |       2.78468624064479 |        1.5918102430965306 |        1.8161449399656946 |
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-07-25 Tue>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              19653          19501       6410      21657     0.7526       0.4757     0.5830
     None              19653          19501       6410      21657     0.7526       0.4757     0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              57443          57032        984       6557     0.9830       0.8975     0.9383
     None              57457          57046       1009       6543     0.9826       0.8978     0.9383
Happy
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| INDEL | PASS   |        6150 |     5007 |     1143 |        6978 |      556 |      1346 |   151 |   168 |      0.814146 |         0.901278 |       0.192892 |          0.8555 |                        |                        |        1.5434221840068787 |        1.9467178175618074 |
| SNP   | ALL    |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
| SNP   | PASS   |       57818 |    52464 |     5354 |       56016 |      500 |      3046 |    90 |    30 |      0.907399 |         0.990561 |       0.054377 |        0.947158 |     2.4892012548262548 |      2.426824047458871 |        1.5904527117884357 |        1.6107795598657217 |
Agilent
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              37241          36965        449       4069     0.9880       0.9015     0.9428
     None              37248          36972        461       4062     0.9877       0.9017     0.9427
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| INDEL | PASS   |        2909 |     2477 |      432 |        3229 |      207 |       519 |    52 |    50 |      0.851495 |         0.923616 |       0.160731 |        0.886091 |                        |                        |        1.4964850615114236 |        1.8339222614840989 |
| SNP   | ALL    |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
| SNP   | PASS   |       38406 |    34793 |     3613 |       36935 |      275 |      1868 |    37 |    15 |      0.905926 |         0.992158 |       0.050575 |        0.947083 |     2.6247759222568168 |     2.5752854654538417 |         1.588953331534934 |        1.6192536889897844 |
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-23 Sun>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src  sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003  --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    5.000              36745          36473        486       3988     0.9869       0.9021     0.9426
     None              36748          36476        495       3985     0.9866       0.9022     0.9425
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
INDEL   PASS         2731      2290       441         3092       208        577     62     53       0.838521          0.917296        0.186611         0.876141                     NaN                     NaN                   1.505145                   1.888993
  SNP    ALL        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.565915                   1.555780                   1.621727
  SNP   PASS        37997     34481      3516        36861       306       2074     33     13       0.907466          0.991204        0.056265         0.947488                2.611269                2.5659
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-23 Sun>
**** TODO HG004 :hg38:hg004:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios  --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    6.000              36938          36678        421       4040     0.9887       0.9014     0.9430
     None              36942          36682        432       4036     0.9884       0.9015     0.9429
happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
INDEL   PASS         2787      2388       399         3183       195        580     53     38       0.856835          0.925086        0.182218         0.889654                     NaN                     NaN                   1.507834                   1.848649
  SNP    ALL        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
  SNP   PASS        38185     34560      3625        36921       254       2107     46      7       0.905067          0.992704        0.057068         0.946862                2.589175                2.553546                   1.632595                   1.653534
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-07-23 Sun>
**** STRT HG001 :hg001:T2T:
SCHEDULED: <2023-07-03 Mon>
Avec liftover : 10x moins de variants...
Type,Filter,TRUTH.TOTAL,TRUTH.TP,TRUTH.FN,QUERY.TOTAL,QUERY.FP,QUERY.UNK,FP.gt,FP.al,METRIC.Recall,METRIC.Precision,METRIC.Frac_NA,METRIC.F1_Score,TRUTH.TOTAL.TiTv_ratio,QUERY.TOTAL.TiTv_ratio,TRUTH.TOTAL.het_hom_ratio,QUERY.TOTAL.het_hom_ratio
INDEL,ALL,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
INDEL,PASS,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
SNP,ALL,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
SNP,PASS,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** TODO Résumer résultats pour Paul + article :resultats:hg38:
SCHEDULED: <2023-07-29 Sat>
Refaire résultats
**** TODO Plot : ashkenazim trio :hg38:
SCHEDULED: <2023-07-29 Sat>
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878 :cento:hg001:
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** TODO Sauvegarder les données brutes
SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** STRT Comparer à GIAB
SCHEDULED: <2023-07-18 Tue>
#+begin_src sh
nextflow run main.nf -profile standard,helios --input="/Work/Groups/bisonex/centogene/2300346867_63118093_NA12878/63118093_S260_R{1,2}_001.fastq.gz"  --id=2300346867_63118093_NA12878-GRCh38 --genome=GRCh38 -bg
#+end_src
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel   l et nettoyée à la main.
On générer une liste de variant pour variant_rec            oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
  prepareVariantRecod   er "variant_success.csv" "renamed.csv"
  runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
:     Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
    foreach(rm, variantRecoderInput())
    foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
    dNew = mergeAll("recoded-success.csv",
                    "recoded-failures-mobidetails.csv",
                    "recoded-failures-mobidetails-redo.csv")
    dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
        @transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
        @select :file :transcript :classification :zygosity
        @rename :classificationCento = :classification
    end
    dTmp = outerjoin(dInit, dNew, on = :transcript)
    CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github  HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq  test-gatad2b-hiseq2000/single_2.fq | samtools sort  -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
******* KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
******* KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
samtools index single.bam
#+end_src
***** KILL Vérifier tous les variants sont retrouvés en 200x: hg38
CLOSED: [2023-06-12 Mon 23:25]
****** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
******* DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-cento/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
******* DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******** DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******** DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
****** KILL Après haplotypecaller
CLOSED: [2023-06-12 Mon 23:24]
******* KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
******* DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******** snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-cento-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******** DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
********* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
********* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
********* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
********* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
********* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
********* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
********* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
********* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
********* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
********* DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******** DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
******* KILL Comparer VCF avec vcfeval :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:cento_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/cento_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
******* KILL Comparer avec hap.py :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-cento-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=cento_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
******* DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** KILL Avant annotation
CLOSED: [2023-06-12 Mon 23:25] SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
******* DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:24]
****** KILL Après filtre annotation
CLOSED: [2023-06-12 Mon 23:25]
******* DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:25]
******* KILL VCf eval
CLOSED: [2023-06-12 Mon 23:25]
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/cento_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
***** TODO Vérifier tous les variants sont retrouvés en 200x: hg38 :T2T:
****** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
******* DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-cento/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
******* DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******** DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******** DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
****** KILL Après haplotypecaller
CLOSED: [2023-06-12 Mon 23:24]
******* KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
******* DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******** snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-cento-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******** DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
********* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
********* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
********* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
********* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
********* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
********* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
********* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
********* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
********* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
********* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
********* DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******** DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
******* KILL Comparer VCF avec vcfeval :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:cento_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/cento_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
******* KILL Comparer avec hap.py :haplotypecaller:
CLOSED: [2023-06-12 Mon 23:24]
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-cento-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=cento_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
******* DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** KILL Avant annotation
CLOSED: [2023-06-12 Mon 23:25] SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
******* DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:24]
****** KILL Après filtre annotation
CLOSED: [2023-06-12 Mon 23:25]
******* DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
******* KILL Comparer avec hap.py
CLOSED: [2023-06-12 Mon 23:25]
******* KILL VCf eval
CLOSED: [2023-06-12 Mon 23:25]
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/cento_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         5

Replacement in projects/bisonex.org at line 38 [3.35]

B:BD[14.69015] → [14.69015:77207]

gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
******* DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
******* DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
******* KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
****** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
****** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl  ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ }                            from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName }             from "${params.modulesDir}/samtools/sort/main"
workflow {
    f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
                         checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
    // Important: use "-n" option !!
    sortBamByName(f)
    SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
****** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run   main.nf -c nextflow.config  -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
**** HOLD Bamsurgeon :bamsurgeon:
***** HOLD Package nix
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
****** DONE Erreur ValueError: quality and sequence mismatch
CLOSED: [2023-05-19 Fri 18:44]
******* DONE Idem avec dernière version sur github
CLOSED: [2023-05-18 Thu 14:36]
******* DONE Version 1.3: ok mais
CLOSED: [2023-05-19 Fri 18:44]
Test sur chr22: variants ok mais VAF=1...
S'exécute "normalement" (échec selon nextflow) mais le bam de sorstie quasiment vide
La fusion du bam avec les variants et du fichier de référence n'a fonctionné correctement.
******** DONE Lancer replacereads.py à la main
CLOSED: [2023-05-18 Thu 21:41]
Dans /Work/Users/apraga/bisonex/work/f3/ce044f80ca91016d68d1bc4f4f5301
#+begin_src sh
 /nix/store/xw277la6w4sjqlsvw9h32cvrlacrfkgm-python3-3.10.9-env/bin/python3.10 /nix/store/abzangf0q8k37053p776cfkw181dzjn3-bamsurgeon-1.3/bin/bamsurgeon/replacereads.py -b cento.bam -r addsnv.e14561be-4fdd-45cc-9989-048ab6da6cc6.muts.bam -o snv-manual.bam
#+end_src
Puis
#+begin_src
samtools sort snv-manual.bam -o snv-manual-sorted.bam
#+end_src
A l'air de fonctionne
****** HOLD Corriger run nextflow pour éviter les erreurs
Trop de message d'erreur en sortie ?
****** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11	17651	17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
****** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
******* DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2
On retrouve bien un variant à cette position A > T
******* DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2 G
******* DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
******* DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
***** HOLD Variants cento
****** STRT SNV
Attention à la mémoire: 32G ne semble pas suffire avec 12 threads
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/bamsurgeon.nf -profile standard,helios --input=tests/bamsurgeon/snv-cento.tsv -bg
#+end_src
ET
#+begin_src nextflow
workflow {
    f = Channel.fromPath(params.input, checkIfExists: true)
    bam = Channel.fromPath("simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam",
                           checkIfExists: true)
    bamIndex = bam.map { it -> it + ".bai" }
    downloadGenome | indexGenome
    indexGenome.out.index | view
    addSNV(f, bam, bamIndex, downloadGenome.out, indexGenome.out.index, indexGenome.out.dict, indexGenome.out.fai)
}
#+end_src
******* DONE v1.3: Lancer le pipeline pour vérifier qu'on retrouve les variants
CLOSED: [2023-05-19 Fri 18:41] SCHEDULED: <2023-05-18 Thu>
******* HOLD Corrigier position pour avoir une bonne VAF
POS POS+1 VAF ALT
Attention, la base corrigée est à POS+1...
******* DONE Comparaison manuelle avec julia (VAF = 1...)
CLOSED: [2023-05-19 Fri 21:58]
552 found over 585
****** TODO del
****** TODO ins
**** TODO [[id:966a298c-948a-4694-a6f5-c326b1046a05][XAMscissors.jl]] :xamscissors:
***** TODO Test SNV
****** DONE Phase 1 : chr22, VAF=1
CLOSED: [2023-05-29 Mon 15:36]
******* DONE 1 SNV  : ok !
CLOSED: [2023-05-20 Sat 19:35] SCHEDULED: <2023-05-20 Sat>
#+begin_src
 make run READS="tests/bamscissors/corrected_{1,2}.fq.gz"
Puis on lance le pipeline sur correct_1
- [X] Variant visible dans IGV
- [X] Variant visible après alignement
- [X] Variant visible après appel de variant
******* KILL Tester SNV chromosome 22
CLOSED: [2023-05-29 Mon 15:36] SCHEDULED: <2023-05-20 Sat>
****** KILL PHase 2 : chr22, VAF variable
CLOSED: [2023-06-12 Mon 23:27] SCHEDULED: <2023-06-03 Sat>
******* DONE de nombreux reads sont perdus -> ok sur un SNV en alignant sur chromosome 22
CLOSED: [2023-05-29 Mon 15:38]
Problème dansr le BAM car sans les insertion
******** DONE Filtrer les reads sans pair : idem
CLOSED: [2023-05-23 Tue 00:01]
FOUND:Found 16662 unpaired mates
	at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:470)
	at picard.sam.SamToFastq.doWork(SamToFastq.java:224)
	at picard.cmdline.CommandLineProgr

[14.69015]

[14.77207]

gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
******* DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
******* DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
******* KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
****** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
****** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl  ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ }                            from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName }             from "${params.modulesDir}/samtools/sort/main"
workflow {
    f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
                         checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
    // Important: use "-n" option !!
    sortBamByName(f)
    SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
****** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run   main.nf -c nextflow.config  -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
**** HOLD Bamsurgeon :bamsurgeon:
***** HOLD Package nix
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
****** DONE Erreur ValueError: quality and sequence mismatch
CLOSED: [2023-05-19 Fri 18:44]
******* DONE Idem avec dernière version sur github
CLOSED: [2023-05-18 Thu 14:36]
******* DONE Version 1.3: ok mais
CLOSED: [2023-05-19 Fri 18:44]
Test sur chr22: variants ok mais VAF=1...
S'exécute "normalement" (échec selon nextflow) mais le bam de sorstie quasiment vide
La fusion du bam avec les variants et du fichier de référence n'a fonctionné correctement.
******** DONE Lancer replacereads.py à la main
CLOSED: [2023-05-18 Thu 21:41]
Dans /Work/Users/apraga/bisonex/work/f3/ce044f80ca91016d68d1bc4f4f5301
#+begin_src sh
 /nix/store/xw277la6w4sjqlsvw9h32cvrlacrfkgm-python3-3.10.9-env/bin/python3.10 /nix/store/abzangf0q8k37053p776cfkw181dzjn3-bamsurgeon-1.3/bin/bamsurgeon/replacereads.py -b cento.bam -r addsnv.e14561be-4fdd-45cc-9989-048ab6da6cc6.muts.bam -o snv-manual.bam
#+end_src
Puis
#+begin_src
samtools sort snv-manual.bam -o snv-manual-sorted.bam
#+end_src
A l'air de fonctionne
****** HOLD Corriger run nextflow pour éviter les erreurs
Trop de message d'erreur en sortie ?
****** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-cento-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11	17651	17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
****** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-cento-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
******* DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2
On retrouve bien un variant à cette position A > T
******* DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2 G
******* DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
******* DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
***** HOLD Variants cento
****** STRT SNV
Attention à la mémoire: 32G ne semble pas suffire avec 12 threads
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/bamsurgeon.nf -profile standard,helios --input=tests/bamsurgeon/snv-cento.tsv -bg
#+end_src
ET
#+begin_src nextflow
workflow {
    f = Channel.fromPath(params.input, checkIfExists: true)
    bam = Channel.fromPath("simuscop-cento-200x/cento/preprocessing/mapped/cento.bam",
                           checkIfExists: true)
    bamIndex = bam.map { it -> it + ".bai" }
    downloadGenome | indexGenome
    indexGenome.out.index | view
    addSNV(f, bam, bamIndex, downloadGenome.out, indexGenome.out.index, indexGenome.out.dict, indexGenome.out.fai)
}
#+end_src
******* DONE v1.3: Lancer le pipeline pour vérifier qu'on retrouve les variants
CLOSED: [2023-05-19 Fri 18:41] SCHEDULED: <2023-05-18 Thu>
******* HOLD Corrigier position pour avoir une bonne VAF
POS POS+1 VAF ALT
Attention, la base corrigée est à POS+1...
******* DONE Comparaison manuelle avec julia (VAF = 1...)
CLOSED: [2023-05-19 Fri 21:58]
552 found over 585
****** TODO del
****** TODO ins
**** TODO [[id:966a298c-948a-4694-a6f5-c326b1046a05][XAMscissors.jl]] :xamscissors:
***** TODO Test SNV
****** DONE Phase 1 : chr22, VAF=1
CLOSED: [2023-05-29 Mon 15:36]
******* DONE 1 SNV  : ok !
CLOSED: [2023-05-20 Sat 19:35] SCHEDULED: <2023-05-20 Sat>
#+begin_src
 make run READS="tests/bamscissors/corrected_{1,2}.fq.gz"
Puis on lance le pipeline sur correct_1
- [X] Variant visible dans IGV
- [X] Variant visible après alignement
- [X] Variant visible après appel de variant
******* KILL Tester SNV chromosome 22
CLOSED: [2023-05-29 Mon 15:36] SCHEDULED: <2023-05-20 Sat>
****** KILL PHase 2 : chr22, VAF variable
CLOSED: [2023-06-12 Mon 23:27] SCHEDULED: <2023-06-03 Sat>
******* DONE de nombreux reads sont perdus -> ok sur un SNV en alignant sur chromosome 22
CLOSED: [2023-05-29 Mon 15:38]
Problème dansr le BAM car sans les insertion
******** DONE Filtrer les reads sans pair : idem
CLOSED: [2023-05-23 Tue 00:01]
FOUND:Found 16662 unpaired mates
	at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:470)
	at picard.sam.SamToFastq.doWork(SamToFastq.java:224)
	at picard.cmdline.CommandLineProgr

Replacement in projects/bisonex.org at line 40 [3.35]

B:BD[14.85399] → [14.85399:93591]

 high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (158.00, 110.30)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 788)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::process] read 839618 sequences (114952336 bp)...
[M::mem_process_seqs] Processed 1752492 reads in 375.714 CPU sec, 17.645 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 336379, 0, 5)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (128, 174, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 440)
[M::mem_pestat] mean and std.dev: (184.73, 74.63)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 839618 reads in 183.039 CPU sec, 7.961 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 24 -o wtf.bam /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz
[main] Real time: 38.278 sec; CPU: 565.821 sec
Bon nombre de reads pourtant
 samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools sort -@24 -o wtf_sorted.bam wtf.sam
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
 samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
Effectivement, ce n'est pas un problème d'IGV
❯ samtools mpileup 63003856_chr22.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11	42213078	N	19	TTtTTtTTTtTTtTtTttt	kkk_kkkkFkFF_FkFkQk
bisonex/code/BamScissors.jl on  bamscissors [!?] via ஃ v1.9.0
❯ samtools mpileup aligned/wtf_sorted.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11	42213078	N	5	TTtTT	_FkFF
******** DONE Regarder où ont été aligné les reads (nouveau run)
CLOSED: [2023-05-24 Wed 23:18]
********* DONE Préparation
CLOSED: [2023-05-24 Wed 21:59]
On relance le pipeline pour avoir un BAM propre
On garde les reads non mappé à partsir de la sortie d'applybqsr
#+begin_src sh
NXF_OPTS=-D"user.name=apraga" nextflow run main.nf -c nextflow.config  -profile standard,helios --input="/Work/Projects/bisonex/centogene/fastq/2200467051_63003856/63003856_S135_R{1,2}_001.fastq.gz" --outdir=out -bg
cd out/63003856_S135_R/preprocessing/applybqsr/
samtools view 63003856_S135_R.bam NC_000022.11  -f 0x2 -o 63003856_chr22.bam
samtools sort -n 63003856_chr22.bam -o 63003856_chr22_sorted.bam
samtools fastq -1 63003856_chr22_1.fq.gz -2 63003856_chr22_2.fq.gz -0 /dev/null -s /dev/null -n 63003856_chr22_sorted.bam
make run BG= READS="out/63003856_S135_R/preprocessing/applybqsr/63003856_chr22_{1,2}.fq.gz"
cd out/63003856_chr22/preprocessing/mapped/
samtools index 63003856_chr22.bam
samtools mpileup 63003856_chr22.bam -r NC_000022.11:42213078-42213078
#+end_src
On récupère les 2 bam dans
#+begin_src
cd /home/alex/recherche/bisonex/code/BamScissors.jl/
rsync -avz meso:/Work/Users/apraga/bisonex/out/63003856_chr22/preprocessing/mapped data
rsync -avz meso:/Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/applybqsr/ data/init/
#+end_src
********* Vérification que le reads est ailleurs
On cherche un read manquant dans le second alignement
#+begin_src sh
samtools view data/init/63003856_chr22.bam | rg "A00853:477:HMLWYDSX3:1:1413:4390:28573"
#+end_src
#+RESULTS:
: A00853:477:HMLWYDSX3:1:1413:4390:28573	163	NC_000022.11	42212845	0	151M	=	42212883	189	CCCAGGGGCCCCAGTGGGGATTTTCTAATAGAGACCCAATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACT	ACC+FBCDCBBBAEAEDEEBBCCCECACBAEBEBDCCBCBFDCCCCFACEBEBCEEDCCCCFDCAEDCACBCEBBCFEACCFBDCACDCBCEBDBBCFEEDCCCFAFEACECCCECAEEDCADCBEDC7BEBCCCFBAFDCECCFBEAACA	MC:Z:151M	MD:Z:151	PG:Z:MarkDuplicates	RG:Z:sample	NM:i:0	AS:i:151	XS:i:151
: A00853:477:HMLWYDSX3:1:1413:4390:28573	83	NC_000022.11	42212883	0	151M	=	42212845	-189	ATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACTCCTAGAATATCTCCTGTCAGGGTGGTGGTGGTAACCCT	AADECCCBDCBFCE<?CDEEEEBDEACDEAC;:BFBCBCDCCBEAEACAEFCCEAFBCBCCDEECBDBCECBEECCEACDEEBBFGDEFGCCFFFFCFCCEFBFDCFCDAAEBEE:CECBABBEBEE;DBFCCCDBCDBCCBBC?@BEEDA	MC:Z:151M	MD:Z:151	PG:Z:MarkDuplicates	RG:Z:sample	NM:i:0	AS:i:151	XS:i:151
#+begin_src sh
samtools view data/mapped/63003856_chr22.bam | rg "A00853:477:HMLWYDSX3:1:1413:4390:28573"
#+end_src
#+RESULTS:
: A00853:477:HMLWYDSX3:1:1413:4390:28573	163	NW_014040930.1	115017	0	151M	=	115055	189	CCCAGGGGCCCCAGTGGGGATTTTCTAATAGAGACCCAATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACT	ACC+FBCDCBBBAEAEDEEBBCCCECACBAEBEBDCCBCBFDCCCCFACEBEBCEEDCCCCFDCAEDCACBCEBBCFEACCFBDCACDCBCEBDBBCFEEDCCCFAFEACECCCECAEEDCADCBEDC7BEBCCCFBAFDCECCFBEAACA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:151	RG:Z:sample
: A00853:477:HMLWYDSX3:1:1413:4390:28573	83	NW_014040930.1	115055	0	151M	=	115017	-189	ATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACTCCTAGAATATCTCCTGTCAGGGTGGTGGTGGTAACCCT	AADECCCBDCBFCE<?CDEEEEBDEACDEAC;:BFBCBCDCCBEAEACAEFCCEAFBCBCCDEECBDBCECBEECCEACDEEBBFGDEFGCCFFFFCFCCEFBFDCFCDAAEBEE:CECBABBEBEE;DBFCCCDBCDBCCBBC?@BEEDA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:151	RG:Z:sample
Effectivement, on aligne sur une zonne supprimée !
******** DONE Corriger la qualité: non
CLOSED: [2023-05-24 Wed 22:19]
********* DONE Comparaison avec le fastq de référénce : qualité !!
CLOSED: [2023-05-24 Wed 22:17]
#+begin_src sh
cd /Work/Users/apraga/bisonex/work/6e/8548fc90263830bf677f36585f11dc
zgrep -A 3 "A00853:477:HMLWYDSX3:1:1413:4390:28573" 63003856_chr22_1.fq.gz
#+end_src
@A00853:477:HMLWYDSX3:1:1413:4390:28573
AGGGTTACCACCACCACCCTGACAGGAGATATTCTAGGAGTACTCAAGAGCATCAGGGGATGGCTGGTAGCCTAGAAGGAACCACAAGGCCCAATGTCTTGGTTAGTCAAACCAATGAATTAGCTAGCAGGGGCCTTCTGAACAAAAGCAT
+
ADEEB@?CBBCCBDCBDCCCFBD;EEBEBBABCEC:EEBEAADCFCDFBFECCFCFFFFCCGFEDGFBBEEDCAECCEEBCECBDBCEEDCCBCBFAECCFEACAEAEBCCDCBCBFB:;CAEDCAEDBEEEEDC?<ECFBCDBCCCEDAA
#+begin_src
zgrep -A 3 "A00853:477:HMLWYDSX3:1:1413:4390:28573" /Work/Projects/bisonex/centogene/fastq/2200467051_63003856/63003856_S135_R1_001.fastq.gz
#+end_src
#+RESULTS:
: @A00853:477:HMLWYDSX3:1:1413:4390:28573 1:N:0:ATTCCACACA+TAGGCGATTG
AGGGTTACCACCACCACCCTGACAGGAGATATTCTAGGAGTACTCAAGAGCATCAGGGGATGGCTGGTAGCCTAGAAGGAACCACAAGGCCCAATGTCTTGGTTAGTCAAACCAATGAATTAGCTAGCAGGGGCCTTCTGAACAAAAGCAT
: +
: FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFF::FFFFFFFFFFFFFF
********* DONE Regarder la qualité après bwa mem vs applybqsr: différente
CLOSED: [2023-05-24 Wed 22:18]
Sur le mésocentre, dans /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing
$ samtools view mapped/63003856_S135_R.bam NC_0

[14.85399]

[14.93591]

 high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (158.00, 110.30)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 788)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::process] read 839618 sequences (114952336 bp)...
[M::mem_process_seqs] Processed 1752492 reads in 375.714 CPU sec, 17.645 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 336379, 0, 5)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (128, 174, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 440)
[M::mem_pestat] mean and std.dev: (184.73, 74.63)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 839618 reads in 183.039 CPU sec, 7.961 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 24 -o wtf.bam /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz
[main] Real time: 38.278 sec; CPU: 565.821 sec
Bon nombre de reads pourtant
 samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools sort -@24 -o wtf_sorted.bam wtf.sam
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
 samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
Effectivement, ce n'est pas un problème d'IGV
❯ samtools mpileup 63003856_chr22.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11	42213078	N	19	TTtTTtTTTtTTtTtTttt	kkk_kkkkFkFF_FkFkQk
bisonex/code/BamScissors.jl on  bamscissors [!?] via ஃ v1.9.0
❯ samtools mpileup aligned/wtf_sorted.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11	42213078	N	5	TTtTT	_FkFF
******** DONE Regarder où ont été aligné les reads (nouveau run)
CLOSED: [2023-05-24 Wed 23:18]
********* DONE Préparation
CLOSED: [2023-05-24 Wed 21:59]
On relance le pipeline pour avoir un BAM propre
On garde les reads non mappé à partsir de la sortie d'applybqsr
#+begin_src sh
NXF_OPTS=-D"user.name=apraga" nextflow run main.nf -c nextflow.config  -profile standard,helios --input="/Work/Projects/bisonex/cento/fastq/2200467051_63003856/63003856_S135_R{1,2}_001.fastq.gz" --outdir=out -bg
cd out/63003856_S135_R/preprocessing/applybqsr/
samtools view 63003856_S135_R.bam NC_000022.11  -f 0x2 -o 63003856_chr22.bam
samtools sort -n 63003856_chr22.bam -o 63003856_chr22_sorted.bam
samtools fastq -1 63003856_chr22_1.fq.gz -2 63003856_chr22_2.fq.gz -0 /dev/null -s /dev/null -n 63003856_chr22_sorted.bam
make run BG= READS="out/63003856_S135_R/preprocessing/applybqsr/63003856_chr22_{1,2}.fq.gz"
cd out/63003856_chr22/preprocessing/mapped/
samtools index 63003856_chr22.bam
samtools mpileup 63003856_chr22.bam -r NC_000022.11:42213078-42213078
#+end_src
On récupère les 2 bam dans
#+begin_src
cd /home/alex/recherche/bisonex/code/BamScissors.jl/
rsync -avz meso:/Work/Users/apraga/bisonex/out/63003856_chr22/preprocessing/mapped data
rsync -avz meso:/Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/applybqsr/ data/init/
#+end_src
********* Vérification que le reads est ailleurs
On cherche un read manquant dans le second alignement
#+begin_src sh
samtools view data/init/63003856_chr22.bam | rg "A00853:477:HMLWYDSX3:1:1413:4390:28573"
#+end_src
#+RESULTS:
: A00853:477:HMLWYDSX3:1:1413:4390:28573	163	NC_000022.11	42212845	0	151M	=	42212883	189	CCCAGGGGCCCCAGTGGGGATTTTCTAATAGAGACCCAATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACT	ACC+FBCDCBBBAEAEDEEBBCCCECACBAEBEBDCCBCBFDCCCCFACEBEBCEEDCCCCFDCAEDCACBCEBBCFEACCFBDCACDCBCEBDBBCFEEDCCCFAFEACECCCECAEEDCADCBEDC7BEBCCCFBAFDCECCFBEAACA	MC:Z:151M	MD:Z:151	PG:Z:MarkDuplicates	RG:Z:sample	NM:i:0	AS:i:151	XS:i:151
: A00853:477:HMLWYDSX3:1:1413:4390:28573	83	NC_000022.11	42212883	0	151M	=	42212845	-189	ATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACTCCTAGAATATCTCCTGTCAGGGTGGTGGTGGTAACCCT	AADECCCBDCBFCE<?CDEEEEBDEACDEAC;:BFBCBCDCCBEAEACAEFCCEAFBCBCCDEECBDBCECBEECCEACDEEBBFGDEFGCCFFFFCFCCEFBFDCFCDAAEBEE:CECBABBEBEE;DBFCCCDBCDBCCBBC?@BEEDA	MC:Z:151M	MD:Z:151	PG:Z:MarkDuplicates	RG:Z:sample	NM:i:0	AS:i:151	XS:i:151
#+begin_src sh
samtools view data/mapped/63003856_chr22.bam | rg "A00853:477:HMLWYDSX3:1:1413:4390:28573"
#+end_src
#+RESULTS:
: A00853:477:HMLWYDSX3:1:1413:4390:28573	163	NW_014040930.1	115017	0	151M	=	115055	189	CCCAGGGGCCCCAGTGGGGATTTTCTAATAGAGACCCAATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACT	ACC+FBCDCBBBAEAEDEEBBCCCECACBAEBEBDCCBCBFDCCCCFACEBEBCEEDCCCCFDCAEDCACBCEBBCFEACCFBDCACDCBCEBDBBCFEEDCCCFAFEACECCCECAEEDCADCBEDC7BEBCCCFBAFDCECCFBEAACA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:151	RG:Z:sample
: A00853:477:HMLWYDSX3:1:1413:4390:28573	83	NW_014040930.1	115055	0	151M	=	115017	-189	ATGCTTTTGTTCAGAAGGCCCCTGCTAGCTAATTCATTGGTTTGACTAACCAAGACATTGGGCCTTGTGGTTCCTTCTAGGCTACCAGCCATCCCCTGATGCTCTTGAGTACTCCTAGAATATCTCCTGTCAGGGTGGTGGTGGTAACCCT	AADECCCBDCBFCE<?CDEEEEBDEACDEAC;:BFBCBCDCCBEAEACAEFCCEAFBCBCCDEECBDBCECBEECCEACDEEBBFGDEFGCCFFFFCFCCEFBFDCFCDAAEBEE:CECBABBEBEE;DBFCCCDBCDBCCBBC?@BEEDA	NM:i:0	MD:Z:151	MC:Z:151M	AS:i:151	XS:i:151	RG:Z:sample
Effectivement, on aligne sur une zonne supprimée !
******** DONE Corriger la qualité: non
CLOSED: [2023-05-24 Wed 22:19]
********* DONE Comparaison avec le fastq de référénce : qualité !!
CLOSED: [2023-05-24 Wed 22:17]
#+begin_src sh
cd /Work/Users/apraga/bisonex/work/6e/8548fc90263830bf677f36585f11dc
zgrep -A 3 "A00853:477:HMLWYDSX3:1:1413:4390:28573" 63003856_chr22_1.fq.gz
#+end_src
@A00853:477:HMLWYDSX3:1:1413:4390:28573
AGGGTTACCACCACCACCCTGACAGGAGATATTCTAGGAGTACTCAAGAGCATCAGGGGATGGCTGGTAGCCTAGAAGGAACCACAAGGCCCAATGTCTTGGTTAGTCAAACCAATGAATTAGCTAGCAGGGGCCTTCTGAACAAAAGCAT
+
ADEEB@?CBBCCBDCBDCCCFBD;EEBEBBABCEC:EEBEAADCFCDFBFECCFCFFFFCCGFEDGFBBEEDCAECCEEBCECBDBCEEDCCBCBFAECCFEACAEAEBCCDCBCBFB:;CAEDCAEDBEEEEDC?<ECFBCDBCCCEDAA
#+begin_src
zgrep -A 3 "A00853:477:HMLWYDSX3:1:1413:4390:28573" /Work/Projects/bisonex/cento/fastq/2200467051_63003856/63003856_S135_R1_001.fastq.gz
#+end_src
AGGGTTACCACCACCACCCTGACAGGAGATATTCTAGGAGTACTCAAGAGCATCAGGGGATGGCTGGTAGCCTAGAAGGAACCACAAGGCCCAATGTCTTGGTTAGTCAAACCAATGAATTAGCTAGCAGGGGCCTTCTGAACAAAAGCAT
: +
: FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFF::FFFFFFFFFFFFFF
********* DONE Regarder la qualité après bwa mem vs applybqsr: différente
CLOSED: [2023-05-24 Wed 22:18]
Sur le mésocentre, dans /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing
$ samtools view mapped/63003856_S135_R.bam NC_0

Replacement in projects/bisonex.org at line 43 [3.35]

B:BD[4.92581] → [4.92581:92749]

B:BD[4.92749] → [12.32831:35141]

SED: [2023-06-04 Sun 21:51]
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTT
GAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******** DONE GRCh38 : ok
CLOSED: [2023-06-04 Sun 22:15]
 bwa mem /Work/Projects/bisonex/data/genome/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fna test1.fq test2.fq
******* DONE Vérifier que les reads ont la même qualité sur les fichiers d'origine: oui
CLOSED: [2023-06-04 Sun 21:07]
******* DONE Supprimer les NW_ ?
CLOSED: [2023-06-10 Sat 10:40] SCHEDULED: <2023-06-04 Sun>
@A00853:477:HMLWYDSX3:3:2114:14742:8860
CAGGCCAGCCGCTCAGCCCGCTCCTTTCACCCTCTGCAGGAGAGCCTCGTGGCAGGCCAGTGGAGGGACATGATGGACTACATGCTCCAAGGGGTGGCGCAGCCGAGCATGGAAGAGGGCTCTGGACAGCTCCTGGAAGGGCACTTGCAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00853:477:HMLWYDSX3:3:2114:14742:8860
CTTTTGCTTGTCCCCAGGACGCACCTCAGGGTGGTGAAGCAAAAAAACCACGGCCCAGGAGAGGGTGGGTGCTGTGGTCTCAGTGCCACCGATCAGGAGGTCCACTGCAGCCATGTGCAAGTGCCCTTCCAGGAGCTGTCCAGAGCCCTCT
+
FFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFF,FFFFFFFFFFFF:F:FFFF:FFFFF,,FFF:FFFFFFFFFF,FFFFFFF,FFFFFFFFFFF,FFFFFFFFF:FFFF,F:FFFFF:FFFFFFFFF:FFFF,FFFFFFFFF
******* DONE Supprimer NW_ et NT_
***** TODO Phase 2 : chr22, vaf variable :T2T:
SCHEDULED: <2023-08-02 Wed>
****** TODO Phase 3 : tous SNV, vaf variable :T2T:
SCHEDULED: <2023-07-12 Wed>
***** TODO Test Indel
**** Divers
***** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
*** KILL Liste varants "clinically relevent" (Clinge - CT-R d)
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-25 Sun>
[cite:@wilcox2021]
Vu avec alexis: pas notre cas d'usage

[4.92581]

SED: [2023-06-04 Sun 21:51]
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******** DONE GRCh38 : ok
CLOSED: [2023-06-04 Sun 22:15]
 bwa mem /Work/Projects/bisonex/data/genome/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fna test1.fq test2.fq
******* DONE Vérifier que les reads ont la même qualité sur les fichiers d'origine: oui
CLOSED: [2023-06-04 Sun 21:07]
******* DONE Supprimer les NW_ ?
CLOSED: [2023-06-10 Sat 10:40] SCHEDULED: <2023-06-04 Sun>
@A00853:477:HMLWYDSX3:3:2114:14742:8860
CAGGCCAGCCGCTCAGCCCGCTCCTTTCACCCTCTGCAGGAGAGCCTCGTGGCAGGCCAGTGGAGGGACATGATGGACTACATGCTCCAAGGGGTGGCGCAGCCGAGCATGGAAGAGGGCTCTGGACAGCTCCTGGAAGGGCACTTGCAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00853:477:HMLWYDSX3:3:2114:14742:8860
CTTTTGCTTGTCCCCAGGACGCACCTCAGGGTGGTGAAGCAAAAAAACCACGGCCCAGGAGAGGGTGGGTGCTGTGGTCTCAGTGCCACCGATCAGGAGGTCCACTGCAGCCATGTGCAAGTGCCCTTCCAGGAGCTGTCCAGAGCCCTCT
+
FFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFF,FFFFFFFFFFFF:F:FFFF:FFFFF,,FFF:FFFFFFFFFF,FFFFFFF,FFFFFFFFFFF,FFFFFFFFF:FFFF,F:FFFFF:FFFFFFFFF:FFFF,FFFFFFFFF
******* DONE Supprimer NW_ et NT_
***** TODO Phase 2 : chr22, vaf variable :T2T:
SCHEDULED: <2023-08-02 Wed>
****** TODO Phase 3 : tous SNV, vaf variable :T2T:
SCHEDULED: <2023-07-27 Thu>
***** TODO Test Indel
**** Divers
***** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
*** KILL Liste varants "clinically relevent" (Clinge - CT-R d)
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-25 Sun>
[cite:@wilcox2021]
Vu avec alexis: pas notre cas d'usage