apraga/org - Change VBMV6H3DSTLSVZCQUAKSA4CFT7LSANFNALLETHLCPFH6Y3XPEAZAC

Workout

Created by Alexis Praga on April 2, 2023

VBMV6H3DSTLSVZCQUAKSA4CFT7LSANFNALLETHLCPFH6Y3XPEAZAC

Dependencies

In channels

main

Change contents

Insertion in workout.org at line 6091 [4.1]

[3.332]

* <2023-04-02 Sun> Workout
**lol
 ;; - - lol
- test
- 3
* <2023-03-31 Fri> Tricking
- bkick 5
- aerial 5
- tornado 5
- preparation 540 x5 pour 3 exercices
- frontside 900 x5
- frontside 720 x5
Combo x3 : bkick - tornado - aerial - 720 - raiz
* <2023-04-01 Sat> Workout
 Circuit x3
- 1 muscle-up, 1 RTO, 1 L-sit, 1 push-up, 1 negative, 1 pull-up, 1 skin the cat, 1 straddle
- 4 pistols, 4 norwegian curl
* <2023-04-02 Sun> Running
45min, 2 montées

Replacement in projects.org at line 2147 [4.123895]

B:BD[5.47] → [5.47:77]

*** WAIT version script seule

[5.47]

[6.622]

*** KILL version script seule
CLOSED: [2023-04-01 Sat 18:29]

Insertion in projects.org at line 2153 [4.123895]

[6.708]

[5.153]

*** WAIT Bug report Version 22.10.6
https://github.com/NixOS/nixpkgs/issues/192396
Erreur :
ERROR: Cannot download nextflow required file -- make sure you can connect to the internet
Alternatively you can try to download this file:
    https://www.nextflow.io/releases/v22.10.6/nextflow-22.10.6-all.jar
and save it as:
    .//nix/store/md2b1ah4d7ivj82k8xxap30dmdci00pa-nextflow-22.10.6/bin/.nextflow-wrapped
Dans la mise à jour, il y a la création d'un environnement virtuel qui casse l'exécution de nextflow (besoin de télécharger)
Fix = désactiver

Replacement in projects.org at line 2219 [4.123895]
∅:D[2.221] → [7.993:1020]
B:BD[7.993] → [7.993:1020]
B:BD[7.1020] → [8.197:225]
```
** TODO Changer kit chaine
SCHEDULED: <2023-04-01 Sat>
```
[2.221]
[8.225]
```
** DONE Changer kit chaine
CLOSED: [2023-04-01 Sat 17:24] SCHEDULED: <2023-04-01 Sat>
```
Insertion in projects.org at line 2222 [4.123895]
[8.239]
[8.239]
```
41000km
```

Replacement in projects/bisonex.org at line 1 [9.35]

B:BD[9.35] → [2.226:16610]

#+title: Bisonex
* Biblio :biblio:
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
    / : genotype unphased
    | : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
SCHEDULED: <2023-03-13 lun.>
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
SCHEDULED: <2023-03-04 Sat>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
  SNV
| Exome | Pipeline |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | BWA_GATK | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | BWA_GATK | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
indel
 |   TP | FP | FN | Sensitivity | Precision | F-Score |   FDR |   |
 | 1254 | 72 | 75 |       0.944 |     0.946 |   0.945 | 0.054 |   |
 | 1309 | 10 | 20 |       0.985 |     0.992 |   0.989 | 0.008 |   |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
-  Cornish and Guda, 2015
Variant Type
|                       | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source   |
| NA12878               |         100%a |          40% |   0 |                      0 |           0 |   0 |               0 | Zook et  al18   | NIST     |
| Other NIST standard   |           71% |          40% | 50% |                      0 |           0 |   0 |               0 | Zook  et al18   |          |
| (e.g. AJ/Asian trios) |               |              |     |                        |             |     |                 |                 |          |
| Platinum              |           29% |            0 |   0 |                      0 |           0 |   0 |               0 | Eberle et  al8  | Platinum |
| Genomes               |               |              |     |                        |             |     |                 |                 |          |
| Venter/HuRef          |           14% |          40% |   0 |                      0 |           0 |   0 |               0 | Trost et al1    | HuRef    |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
  title     = "Systematic comparison of germline variant calling pipelines
               cross multiple next-generation sequencers",
  author    = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
               Yuhuan and Du, Hongli",
  abstract  = "The development and innovation of next generation sequencing
               (NGS) and the subsequent analysis tools have gain popularity in
               scientific researches and clinical diagnostic applications.
               Hence, a systematic comparison of the sequencing platforms and
               variant calling pipelines could provide significant guidance to
               NGS-based scientific and clinical genomics. In this study, we
               compared the performance, concordance and operating efficiency
               of 27 combinations of sequencing platforms and variant calling
               pipelines, testing three variant calling pipelines-Genome
               Analysis Tool Kit HaplotypeCaller, Strelka2 and
               Samtools-Varscan2 for nine data sets for the NA12878 genome
               sequenced by different platforms including BGISEQ500,
               MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
               calling performance of 12 combinations in WES datasets, all
               combinations displayed good performance in calling SNPs, with
               their F-scores entirely higher than 0.96, and their performance
               in calling INDELs varies from 0.75 to 0.91. And all 15
               combinations in WGS datasets also manifested good performance,
               with F-scores in calling SNPs were entirely higher than 0.975
               and their performance in calling INDELs varies from 0.71 to
               0.93. All of these combinations manifested high concordance in
               variant identification, while the divergence of variants
               identification in WGS datasets were larger than that in WES
               datasets. We also down-sampled the original WES and WGS datasets
               at a series of gradient coverage across multiple platforms, then
               the variants calling period consumed by the three pipelines at
               each coverage were counted, respectively. For the GIAB datasets
               on both BGI and Illumina platforms, Strelka2 manifested its
               ultra-performance in detecting accuracy and processing
               efficiency compared with other two pipelines on each sequencing
               platform, which was recommended in the further promotion and
               application of next generation sequencing technology. The
               results of our researches will provide useful and comprehensive
               guidelines for personal or organizational researchers in
               reliable and consistent variants identification.",
  journal   = "Sci. Rep.",
  publisher = "Springer Science and Business Media LLC",
  volume    =  9,
  number    =  1,
  pages     = "9345",
  month     =  jun,
  year      =  2019,
  copyright = "https://creativecommons.org/licenses/by/4.0",
  language  = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
  Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
 For WES datasets, the BGI platforms displayed the superior performance in SNPs
 calling while Illumina platforms manifested the better variants calling
 performance in INDELs calling, which could be explained by their divergence in
 sequencing strategy that producing different length of reads (all BGI platforms
 were 100 base pair read length while all Illumina platforms were 150 base pair
 read length). The read length effects, as a key factor between two platforms,
 would bring alignment bias and error which are higher for short reads and
 ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD    1      String  Decision for call (TP/FP/FN/N)
: FORMAT BK    1      String  Sub-type for decision (match/mismatch type)
: FORMAT BVT   1      String  High-level variant type (SNP|INDEL).
: FORMAT BLT   1      String  High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence  = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    19196584        .       T       C       50      PASS    platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ    0/1:.:781:109,123:138,150:348
: NC_000022.11    19196584        rs1061325       T       C       59.32   PASS    AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303      GT:AD:DP:GQ:PL  1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence  = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
 bcftools filter -i 'POS=31277416'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
 bcftools filter -i 'POS=31277416'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    31277416        .       CA      C       50      PASS    platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5  GT:PS:DP:ADALL:AD:GQ    1/1:.:465:16,229:0,190:129
: NC_000022.11    31277416        rs57244615      CAA     C,CA    389.02  PASS    AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572    GT:AD:DP:GQ:PL  1/2:5,10,14:29:64:406,202,313,64,0,88
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina  https://github.com/genome-in-a-bottle/giab_data_indexes
   puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
   Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
     https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
   Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
   Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
   La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
   Ajout des zones "difficiles"
   https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
  -> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
  -> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
  On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données :data:
** TODO Remplacer bam par fastq sur mesocentre
Commande
*** STRT Supprimer les fastq non "paired"
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 629434

[9.35]

[2.16610]

#+title: Bisonex
* Biblio :biblio:
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
    / : genotype unphased
    | : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
SCHEDULED: <2023-04-06 Thu>
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
SCHEDULED: <2023-04-06 Thu>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
  SNV
| Exome | Pipeline |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | BWA_GATK | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | BWA_GATK | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
indel
 |   TP | FP | FN | Sensitivity | Precision | F-Score |   FDR |   |
 | 1254 | 72 | 75 |       0.944 |     0.946 |   0.945 | 0.054 |   |
 | 1309 | 10 | 20 |       0.985 |     0.992 |   0.989 | 0.008 |   |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
-  Cornish and Guda, 2015
Variant Type
|                       | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source   |
| NA12878               |         100%a |          40% |   0 |                      0 |           0 |   0 |               0 | Zook et  al18   | NIST     |
| Other NIST standard   |           71% |          40% | 50% |                      0 |           0 |   0 |               0 | Zook  et al18   |          |
| (e.g. AJ/Asian trios) |               |              |     |                        |             |     |                 |                 |          |
| Platinum              |           29% |            0 |   0 |                      0 |           0 |   0 |               0 | Eberle et  al8  | Platinum |
| Genomes               |               |              |     |                        |             |     |                 |                 |          |
| Venter/HuRef          |           14% |          40% |   0 |                      0 |           0 |   0 |               0 | Trost et al1    | HuRef    |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
  title     = "Systematic comparison of germline variant calling pipelines
               cross multiple next-generation sequencers",
  author    = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
               Yuhuan and Du, Hongli",
  abstract  = "The development and innovation of next generation sequencing
               (NGS) and the subsequent analysis tools have gain popularity in
               scientific researches and clinical diagnostic applications.
               Hence, a systematic comparison of the sequencing platforms and
               variant calling pipelines could provide significant guidance to
               NGS-based scientific and clinical genomics. In this study, we
               compared the performance, concordance and operating efficiency
               of 27 combinations of sequencing platforms and variant calling
               pipelines, testing three variant calling pipelines-Genome
               Analysis Tool Kit HaplotypeCaller, Strelka2 and
               Samtools-Varscan2 for nine data sets for the NA12878 genome
               sequenced by different platforms including BGISEQ500,
               MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
               calling performance of 12 combinations in WES datasets, all
               combinations displayed good performance in calling SNPs, with
               their F-scores entirely higher than 0.96, and their performance
               in calling INDELs varies from 0.75 to 0.91. And all 15
               combinations in WGS datasets also manifested good performance,
               with F-scores in calling SNPs were entirely higher than 0.975
               and their performance in calling INDELs varies from 0.71 to
               0.93. All of these combinations manifested high concordance in
               variant identification, while the divergence of variants
               identification in WGS datasets were larger than that in WES
               datasets. We also down-sampled the original WES and WGS datasets
               at a series of gradient coverage across multiple platforms, then
               the variants calling period consumed by the three pipelines at
               each coverage were counted, respectively. For the GIAB datasets
               on both BGI and Illumina platforms, Strelka2 manifested its
               ultra-performance in detecting accuracy and processing
               efficiency compared with other two pipelines on each sequencing
               platform, which was recommended in the further promotion and
               application of next generation sequencing technology. The
               results of our researches will provide useful and comprehensive
               guidelines for personal or organizational researchers in
               reliable and consistent variants identification.",
  journal   = "Sci. Rep.",
  publisher = "Springer Science and Business Media LLC",
  volume    =  9,
  number    =  1,
  pages     = "9345",
  month     =  jun,
  year      =  2019,
  copyright = "https://creativecommons.org/licenses/by/4.0",
  language  = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
  Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
 For WES datasets, the BGI platforms displayed the superior performance in SNPs
 calling while Illumina platforms manifested the better variants calling
 performance in INDELs calling, which could be explained by their divergence in
 sequencing strategy that producing different length of reads (all BGI platforms
 were 100 base pair read length while all Illumina platforms were 150 base pair
 read length). The read length effects, as a key factor between two platforms,
 would bring alignment bias and error which are higher for short reads and
 ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD    1      String  Decision for call (TP/FP/FN/N)
: FORMAT BK    1      String  Sub-type for decision (match/mismatch type)
: FORMAT BVT   1      String  High-level variant type (SNP|INDEL).
: FORMAT BLT   1      String  High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence  = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    19196584        .       T       C       50      PASS    platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ    0/1:.:781:109,123:138,150:348
: NC_000022.11    19196584        rs1061325       T       C       59.32   PASS    AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303      GT:AD:DP:GQ:PL  1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence  = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
 bcftools filter -i 'POS=31277416'  /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
 bcftools filter -i 'POS=31277416'  ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11    31277416        .       CA      C       50      PASS    platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5  GT:PS:DP:ADALL:AD:GQ    1/1:.:465:16,229:0,190:129
: NC_000022.11    31277416        rs57244615      CAA     C,CA    389.02  PASS    AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572    GT:AD:DP:GQ:PL  1/2:5,10,14:29:64:406,202,313,64,0,88
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina  https://github.com/genome-in-a-bottle/giab_data_indexes
   puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
   Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
     https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
   Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
   Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
   La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
   Ajout des zones "difficiles"
   https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
  -> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
  -> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
  On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
** Zone de capture
GIAB fourni le .bed. INfo : https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
* Données :data:
** TODO Remplacer bam par fastq sur mesocentre
Commande
*** STRT Supprimer les fastq non "paired"
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 629434

Replacement in projects/bisonex.org at line 6 [9.35]

B:BD[10.49141] → [10.49141:57333]

298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** TODO Reproduire les performances precisionchallenge NA12878
https://www.nist.gov/programs-projects/genome-bottle
***** TODO 0GOOR
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDEL    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
****** KILL avec docker
CLOSED: [2023-02-17 Fri 19:25]
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** TODO Utilise AVX pour accélerer l'exécution
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** TODO Annotation avec nextflow
*** TODO VEP
***** KILL Utiliser --gene-phenotype ?
CLOSED: [2023-03-15 mer. 13:43]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
***** TODO Plugin pour CADD, pLI, LOEUF ?
https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#cadd
CADD: n’a pas réussi à le faire fonctionner
pLI, LOEUF : non demandé
***** TODO Utiliser l'option --hgvsg pour remplaer hgvsg.r ?
Non fait par Alexis, oubli a priori
***** TODO Ajout spliceAI ?
*** TODO Spip
**** TODO Checksum sur données
*** TODO Filtrer après VEP
**** TODO Remplacer avec simplement bcftools filter ?
*** TODO OMIM
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO clinvar
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO ACMG incidental
**** TODO Inclure dans vep ?
*** TODO Grantham
*** TODO LRG
*** TODO Gnomad
** DONE Porter exament la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TO

[10.49141]

[10.57333]

298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** KILL Reproduire les performances precisionchallenge : attention à HG002 et HG001!
CLOSED: [2023-04-01 Sat 19:43]
https://www.nist.gov/programs-projects/genome-bottle
***** KILL 0GOOR
CLOSED: [2023-04-01 Sat 19:40]
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDEL    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** TODO Utilise AVX pour accélerer l'exécution
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** TODO Annotation avec nextflow
*** TODO VEP
***** KILL Utiliser --gene-phenotype ?
CLOSED: [2023-03-15 mer. 13:43]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
***** TODO Plugin pour CADD, pLI, LOEUF ?
https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#cadd
CADD: n’a pas réussi à le faire fonctionner
pLI, LOEUF : non demandé
***** TODO Utiliser l'option --hgvsg pour remplaer hgvsg.r ?
Non fait par Alexis, oubli a priori
***** TODO Ajout spliceAI ?
*** TODO Spip
**** TODO Checksum sur données
*** TODO Filtrer après VEP
**** TODO Remplacer avec simplement bcftools filter ?
*** TODO OMIM
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO clinvar
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO ACMG incidental
**** TODO Inclure dans vep ?
*** TODO Grantham
*** TODO LRG
*** TODO Gnomad
** DONE Porter exament la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TO

Replacement in projects/bisonex.org at line 29 [9.35]

B:BD[2.32996] → [2.32996:39217]

∅:D[2.39217] → [10.245776:278544]

B:BD[10.245776] → [10.245776:278544]

B:BD[10.278544] → [2.39218:45140]

6_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO Validation : NA12878 :na12878:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB
**** DONE télécharger données avec Nextflow
CLOSED: [2023-02-25 Sat 19:47]
***** DONE Télécharger NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
****** DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
***** DONE Télécharger Ashkenazy trio HG002
CLOSED: [2023-02-17 Fri 19:29]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
***** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
***** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
**** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
***** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
****** DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [202
3-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
****** KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
****** DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
INDEL   PASS       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
  SNP    ALL      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
  SNP   PASS      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
****** DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b  /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz  -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz  -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000               2984           2682       1840    3890296     0.5931       0.0008     0.0015
     None               2984           2682       1841    3890296     0.5930       0.0008     0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
****** DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11    783006  .       A       G       50      PASS
Il y a A -> G et C -> A sur cette position
**** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
***** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version europénne) vs RefSeq: [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2][article 2015]] en faveur de GENCODE mais Alexis conseille Refseq
***** DONE -f, -R ou -T ?
CLOSED: [2023-02-25 Sat 19:47]
Selon la doc : -f avec le bed fourni et -T pour filtrer sor les exons
****** rtg tools
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    3.000               1015            910        206      32531     0.8154       0.0303     0.0583
     None               1015            910        206      32531     0.8154       0.0303     0.0583
**** TODO Exons seuls
***** DONE BestRefSeq 
CLOSED: [2023-02-19 Sun 12:05]
Dans refseq,[[https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/][2 types de modèles pour le gène]]
- basé sur refseq (NM_, NP_), curée
- basé sur gnomon (XM_ , XP_), prédite
Les modèles basés sur refseq ont la préférénce (cf lien)
On se restreint donc à bestrefseq
#+begin_src sh
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
gunzip https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
#+end_src
On se restrein aux exons codant (NM_)
#+begin_src
awk '/BestRefSeq\texon/ && /transcript_id=NM/ {print $1"\t"$4"\t"$5;}' GRCh38_latest_genomic.gff > exons.csv
#+end_src
Puis intersection
****** DONE Tests après correction bug dans noms de chromosome : precision ~ ok, recall très mauvais -> trop de FN ?
CLOSED: [2023-02-19 Sun 12:05]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
INDEL   PASS         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
  SNP    ALL        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
  SNP   PASS        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
****** DONE Vérifier exons: on a l'union des exons de tous les transcripts...
CLOSED: [2023-02-19 Sun 12:05]
Il faudrait un .bed d'illumina
On teste Twist for Illumina Exome 2.0 Plus BED File (hg19) sur https://support.illumina.com/downloads/nextera-flex-for-enrichment-BED-files.html
Conversion en hg38 avec ucsc
Renommage des chromosomes
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\\t/:;s:$:\\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -i.bak -f pattern.sed illumina_exons.bed
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_benchmark.bed -b illumina_exons.bed > HG001_GRCh38_1_22_v4.2.1_benchmark_illumina_exons.bed
#+end_src
Intersection
***** KILL Bed illumina
CLOSED: [2023-02-24 Fri 23:44]
****** KILL Sans filtre vep: Inversion truth et query... on recommence
CLOSED: [2023-02-19 Sun 13:15]
****** KILL Sans filtre vep: mieux mais pas exceptionnel
CLOSED: [2023-02-24 Fri 23:44]
cd work/00/2c72e62400956c96fb101ac7af405e/
$ cat .command.out
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          922       490       432          942       439          0     32     56       0.531453          0.533970
INDEL   PASS          922       490       432          942       439          0     32     56       0.531453          0.533970
  SNP    ALL        24618     18689      5929        21257      2571          0    239     17       0.759160          0.879052
  SNP   PASS        24618     18689      5929        21257      2571          0    239     17       0.759160          0.879052
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
            0.0         0.532709                     NaN                     NaN                   1.628743                   2.799180
            0.0         0.532709                     NaN                     NaN                   1.628743                   2.799180
            0.0         0.814719                2.899604                2.881526                   1.598479                   1.564378
            0.0         0.814719                2.899604                2.881526                   1.598479                   1.564378
****** KILL Comprendre pour les FN explosent avec vep annot
CLOSED: [2023-02-24 Fri 23:44]
NC_000001.11	924024	.	C	G
non couverte dans bam -> 1er exons de SAMD11 mais non couverte
Il est dans NM_001385641.1 mais pas dans NM_152486.4
***** KILL Obtenir les zones couvertes depuis le bam directement
CLOSED: [2023-02-24 Fri 23:44]
#+begin_src  sh
samtools sort NA12878_chr1.bam -o NA12878_chr1_sorted.bam
bedtools genomecov -ibam NA12878_chr1_sorted.bam  -bg > chr1.bedgraph
head chr1.bedgraph
#+end_src
#+RESULTS:
: NC_000001.11	10001	10021	1
: NC_000001.11	10021	10059	2
: NC_000001.11	10059	10063	1
: NC_000001.11	10063	10087	2
: NC_000001.11	10087	10110	1
: NC_000001.11	10110	10113	3
: NC_000001.11	10113	10121	2
On prend les régions avec 20 reads
awk '$4 > 20 {print $1"\t"$2"\t"$3}' chr1.bedgraph  > tomerge.bed
On fusionne les régions
bedtools merge -i tomerge.bed > exons_from_bam.bed
Problème : on manque parfois le bord des exons...
***** KILL Vérifier un bam de référence de NA12878
CLOSED: [2023-02-24 Fri 23:44]
Notamment pour la définitions des exons, on a des reads qui ne semblent pas alignés sur les bons exons selon IGV
On donne directemet à IGV le bam d'illuma WES ( https://github.com/genome-in-a-bottle/giab_data_indexes )
IGV n'aime pas le ftp apparement...
On récupére 2 échantillons pour ce patient sur HiSeq Exome
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/data/giab/GRCh38
wget https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/NA12878/alignment.index.NA12878_HiSeq_Exome_Garvan_GRCh37_09252015 -O todl.txt
awk '{print $1}' todl.txt | wget -i -
#+end_src
On récupére le premier échantillons en prenant just le chromosome 1
#+begin_src
samtools merge NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam project
.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam project.NIST_NIST7035_H7AP8ADXX_TA
AGGCGA_2_NA12878.bwa.markDuplicates.bam
samtools index NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam
#+end_src
****** Étude
NC_000001.11:924024 :
- bed d’illumina : pas correct
- bed refseq : meux
- bam :  exon non ciblé, probablement parce que Mane select n’a pas été utilisé mais refseq (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:28706)
NC_000001.11:924310: idem
NC_000001.11:924321: idem
NC_000001.11:924533: idem
NC_000001.11:966227: le bed ne couvre pas le bon exon !
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A965863%2D966591&hgsid=295298291_vhDbIv5YIckCKLpGB7UUaZi2cAkr
NC_000001.11:966272
***** KILL Filtrer variants introniques de référence avec vep
CLOSED: [2023-02-24 Fri 23:44]
****** KILL variant calling seulf + seulement -f: nombreux FP
CLOSED: [2023-02-24 Fri 23:44]
/Work/Users/apraga/bisonex/work/68/f1cf72a5a4078fdf743fb3844b369a
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          519       284       235        52169     37789      14094     12     64       0.547206          0.007511
INDEL   PASS          519       284       235        52169     37789      14094     12     64       0.547206          0.007511
  SNP    ALL        22131     17434      4697       357652    305313      34904    189     32       0.787764          0.054020
  SNP   PASS        22131     17434      4697       357652    305313      34904    189     32       0.787764          0.054020
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.270160         0.014820                     NaN                     NaN                   1.775956                   0.509510
       0.270160         0.014820                     NaN                     NaN                   1.775956                   0.509510
       0.097592         0.101108                2.971834                1.429533                   1.579776                   0.324923
       0.097592         0.101108                2.971834                1.429533                   1.579776                   0.324923
***** DONE Bed donnés par GIAB (en hg19) : résultats corrects
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-02 Thu>
Téléchargé à la main et converti via UCSCS. À automatiser, voir : *hg38
Sans oublier de renommer les chromosomes !
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
|-------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------|
| INDEL |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |
| SNP   |       46032 |    39369 |     6663 |       44600 |     1186 |      4041 |   304 |    30 |      0.855253 |         0.970759 |
 Type   Ssensibilité    VPP
 INDEL       0.710532   0.692946 
 SNP         0.855253   0.970759
 | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
 |----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
 |       0.090605 |        0.909353 |      2.529551552318896 |     2.4019675131548843 |        1.6206857273037931 |        1.6274012964054214 |
***** DONE Re-tester avec exons Refseq et option -T: résultats un peu moins bons
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-08 Wed>
On utilise directement les coordonées données par refseq
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6978      1599       1918    228    353       0.472876          0.683992
  SNP    ALL        59052     37825     21227        43480      1913       3740    675     35       0.640537          0.951862
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274864         0.559171                     NaN                     NaN                   1.547733                   2.756151
       0.086017         0.765767                2.433271                2.350281                   1.575230                   1.492346
***** TODO Vérifier que la zone de capture a vraiment besoin d'un liftover...
**** TODO D'où vient la différence ?
SCHEDULED: <2023-03-09 Thu>
***** DONE Statistiques
CLOSED: [2023-03-15 mer. 13:41] SCHEDULED: <2023-03-04 Sat>
On confirnme le nombre de SNP 6665
- 304 ont un problème d’haploide/allèle différente
- 28 avec une différence de génotype
- La majorité ne sont pas vu 6361
[1] "ALl FN 6665"
[1] "Genotype mismatch 28"
[1] "haplotype mismatch 304"
[1] "Missing  6333"
***** DONE Majorité des FN pour SPN sont peu couverts
CLOSED: [2023-03-16 Thu 16:54]
Chromosome 1 : majorité des FN ne sont pas vus...
cf figure
#+attr_html: :width 500px
[[file:reads.png]]
 1623412 : 5’UTR
 1953616 : intronique mais dans bed...
 2589991
 3488573
 3488732
 3780326
 3884683
 3914055
 4711993
 4712657
***** DONE Alignement
CLOSED: [2023-03-15 mer. 13:41]
****** DONE Impact de la version mineure du genome: non
CLOSED: [2023-03-09 Thu 23:13]
******* DONE GHC38 + version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6979      1599       1919    228    353       0.472876          0.683992
  SNP    ALL        59052     37827     21225        43483      1913       3741    676     35       0.640571          0.951865
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274968         0.559171                     NaN                     NaN                   1.547733                   2.756698
       0.086034         0.765792                2.433271                2.350254                   1.575230                   1.492375
******* DONE GHC38.p13 + notre version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6978      1599       1918    228    353       0.472876          0.683992
  SNP    ALL        59052     37825     21227        43480      1913       3740    675     35       0.640537          0.951862
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274864         0.559171                     NaN                     NaN                   1.547733                   2.756151
       0.086017         0.765767                2.433271                2.350281                   1.575230                   1.492346
***** DONE Vérifier si coordonnées génomiques (vcf)
CLOSED: [2023-03-09 Thu 23:05]
***** DONE Variant calling
CLOSED: [2023-03-22 Wed 23:11]
****** DONE Désactiver dbSNP : idem
CLOSED: [2023-03-22 Wed 15:00]
Après haplotycaller
$ sha256sum work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650  work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
Fichier utilisé par hap.py
$ sha256sum work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650  work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
On a bien le même que [[*Bed donnés par GIAB (en hg19) : résultats corrects][Bed donnés par GIAB (en hg19) : résultats corrects]]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         4871      3461      1410         7048      1554       1987    193    346       0.710532          0.692946
  SNP    ALL        46032     39367      6665        44599      1186       4042    304     30       0.855209          0.970757
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.281924         0.701629                     NaN                     NaN                   1.617499                   3.067409
       0.090630         0.909327                2.529552                2.402151                   1.620686                   1.627342
***** DONE Définition des exons
CLOSED: [2023-03-22 Wed 23:16]
****** KILL Vérifier qu'il ne manque pas des exons (avec bam ?)
CLOSED: [2023-03-16 Thu 16:54] SCHEDULED: <2023-03-05 Sun>
****** KILL Vérifier la couverture des variants introniques
CLOSED: [2023-03-22 Wed 15:01]
NC_000001.11  23344310  : intronique mais la région dans le bed couvre en partie l'exon
Profondeur sur le variant : 14 reads sur l'alt
****** Notes: UTR = exon - CDS
intron dans le gene = mRNA - exon
On le vérifie avec
  25 │ NC_000001.11  BestRefSeq  exon      65520   65573
  27 │ NC_000001.11  BestRefSeq  CDS       65565   65573
  https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65520%2D65573&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
  mRNA :
  https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65419%2D71585&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
****** DONE Statistiques : reads par type (introns, UTR, exons)
CLOSED: [2023-03-22 Wed 23:16]
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
#+begin_src julia
using CSV
using DataFrames, DataFramesMeta
using XAM, GenomicFeatures
function exonFromBestRefSeq(f)
    println("Reading exons from $(f)")
    cols = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute"]
    d = DataFrame(CSV.File(f ,header=cols, delim="\t",comment="#"))
    @chain d begin
        @rsubset :feature in ["exon", "CDS", "mRNA"] :source .== "BestRefSeq"
        @select :seqname :source :feature :start :end
        @distinct
    end
end
# Get exons from refseq (Bestrefeq, gnomon or just refseq)
function getExon(f)
    if isfile(f)
        println("Reading exons from preprocessed $(f)")
        d = DataFrame(CSV.File(f))
    else
        d = exonFromBestRefSeq("GRCh38_latest_genomic.gff.gz")
        CSV.write(f, d)
    end
    return d
end
# Return false negative and TRUTH annotation (genotype, type...)
# Read vcf manually: it's a tab-separated file and we extract the TRUTH column manually
#BD = Decision for call (TP/FP/FN/N)
#BK = tSub-type for decision (match/mismatch type)
#BI = Additional comparison information
#QQ = Variant quality for ROC creation.
#BV = High-level variant type (SNP|INDEL).
#BL = High-level location type (het|homref|hetalt|homalt|nocall).
function getFN(f)
    cols = ["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "TRUTH", "QUERY"]
    vcf = DataFrame(CSV.File(f,comment="#", delim="\t",
                             header = cols))
    # Extract genothype and other subcolumn fronm FORMAT
    format = split(vcf[1,:FORMAT], ":")
    @chain vcf begin
        @rtransform $format=split(:TRUTH, ':')
        @subset :BD .== "FN"
        @select :type=:BVT :CHROM :POS :REF :ALT :GT :BK :BLT
    end
end
# Search a variant in refseq exons
# Return the smallest interval (CDS)
function searchVariant(chrom, pos, exons)
    res = @subset exons :seqname .== chrom :start .<= pos :end .>= pos
    if isempty(res)
        return "intronicOutsideGene"
    elseif "CDS" in res.feature
        return "CDS"
    elseif "exon" in res.feature
        # UTR = exon - CDS
        return "UTR"
    elseif "mRNA" in res.feature
        # intron = mRNA - exon
        return "intronicGene"
    else
        return "intronicOutsideGene"
    end
end
# Once the bam file has been opened, count the reads for a single variant
function countReadsVariant(reader, chrom, pos)
    n = 0
    for record in eachoverlap(reader, chrom,pos:pos)
        n += 1
    end
    return n
end
# For all variants, get the number of reads from the bam file
function countReads(d)
    bamDir = "../script/files/bam"
    bam = bamDir * "/NA12878_NIST7035_recalibrated_hg38.bam"
    bai = bam * ".bai"
    reader = open(BAM.Reader, bam, index=bai)
    d2 = @transform(d, @byrow :reads = countReadsVariant(reader, :CHROM, :POS))
    close(reader)
    return d2
end
function exonicStatus(d, exons)
    # Check variants are in exon: for each variant, search in all exons
    @transform(d, @byrow :match = searchVariant(:CHROM, :POS, exons))
end
exons = getExon("GRCh38_latest_genomic_exon.csv")
d = getFN("../out/test-bed/test-allchr.vcf") |>
    x -> exonicStatus(x, exons) |>
    countReads
# # bed = DataFrame(CSV.File("../out/test-bed/exons_illumina_hg38.bed",comment="#", delim="\t", header=false))
CSV.write("falseNegatives.csv", d)
#+end_src
Discordance nombre de reads avec igv mais cohérent avec samtools : vu avec Alexis : probablement du à IVG, on reste sur samtools
$ samtools view   NA12878_NIST7035_recalibrated_hg38.bam NC_000001.11:1734812-1734812  -F "0x200" | wc -l
164
***** KILL Comprendre pourquoi le chromosome 6 a plus de FN
CLOSED: [2023-03-29 Wed 22:39]
Vérifier le graphe fait en julia
$ grep "FN.*SNP" test-allchr.vcf | grep NC_000006 -c
1264
@subset d :type .== "SNP" :CHROM .== "NC_000006.12"
1264×10 DataFrame
Région HLA: difficile car nombreux allèles alternatifs.
Mais dans nos données, sont principalement non vus.
La répartition des reads est similaire aux autres régions...
#+begin_src julia
vcf = vcfHappy("test-allchr.vcf")
d = DataFrame(CSV.File("data/falseNegatives.csv"))
d2 = @subset d :POS .>= 29600000 :POS .<= 32000000 :type .== "SNP" :CHROM .=="NC_000006.12"
#+end_src
#+RESULTS:
: julia> nrow(@subset d2 :BK .== ".")
: 612
: julia> nrow(@subset d2 :BK .!= ".")
: 6
: julia> nrow(@subset d2 :BK .!= "am")
: 612
***** KILL comparer avec Kumaran 2021
CLOSED: [2023-03-22 Wed 23:16]
Bed probablement différent, on a presque un facteur 2 sur le nombre de variant (TP..)
***** KILL Comparer avec stats de NA12878 dans example/happy sur chr21 (exons fournis)
CLOSED: [2023-03-04 Sat 11:01]
****** DONE DP_over_30_not_SNP_consensual_sequence.vcf: horrible
CLOSED: [2023-02-25 Sat 19:47]
déplorable
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          519        16       503         1579      1032        531      2      2       0.030829          0.015267        0.336289         0.020421                     NaN                     NaN                   1.775956                   4.768382
INDEL   PASS          519        16       503         1579      1032        531      2      2       0.030829          0.015267        0.336289         0.020421                     NaN                     NaN                   1.775956                   4.768382
  SNP    ALL        22131      1191     20940         4346      2342        813      4      1       0.053816          0.337107        0.187069         0.092815                2.971834                1.967235                   1.579776                   1.492828
  SNP   PASS        22131      1191     20940         4346      2342        813      4      1       0.053816          0.337107        0.187069         0.092815                2.971834                1.967235                   1.579776                   1.492828
****** DONE DP _over_30 : mieux
CLOSED: [2023-03-04 Sat 11:01]
/Work/Users/apraga/bisonex/work/69/cc1391f5a0fbf4db958a370ec17a19
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL           76        20        56           36         6         10      1      3       0.263158          0.769231
INDEL   PASS           76        20        56           36         6         10      1      3       0.263158          0.769231
  SNP    ALL          582       312       270          370         9         49      0      0       0.536082          0.971963
  SNP   PASS          582       312       270          370         9         49      0      0       0.536082          0.971963
******* DONE NC_000021.9:45515163 TA -> T : référence non normalisée ??
CLOSED: [2023-03-04 Sat 11:01]
NC_000021.9     45515163        .       TA      T       50      PASS    BS=45515163;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  0/1:FN:lm:d1_5:INDEL:het:.      ./.:.:.:.:NOCALL:nocall:0
NC_000021.9     45515163        .       TAA     T       694.02  PASS    BS=45515163;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.       0/1:FP:lm:d1_5:INDEL:het:694.02
Mais dans notre vcf
NC_000021.9     45515163        rs11284347      TAA     T,TA
La séquence est TAAAA donc notre TAA -> TA aurait du être reconnu comme identique. Faut-il utiliser un autre fasta ?
La décomposition de xcmp est
45515163        TAA -> T
45515164        AA -> A
NB: testée avec pre.py
#+begin_src
grep '^#' NA12878_NIST7035_DP_over_30.vcf > NA12878_NIST7035_DP_over_30_chr21.vcf
grep '^NC_000021' NA12878_NIST7035_DP_over_30.vcf >> NA12878_NIST7035_DP_over_30_chr21.vcf
pre.py NA12878_NIST7035_DP_over_30_chr21.vcf out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Selon l'algorithm, on devrait avoir TA -> T et non AA -> A (non left-align)
Si on débug:
#+begin_src
grep "^#" NA12878_NIST7035_DP_over_30_chr21.vcf > test_align.vcf
grep 45515163 NA12878_NIST7035_DP_over_30_chr21.vcf  >> test_align.vcf.gz
bgzip test_align.vcf.gz
bcftools index test_align.vcf.gz
multimerge test_align.vcf.gz -o out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Idem... Si on utile un genome de référence plus récent ?
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --process-full=1
#+end_src
Idem. Et vcfeval ?
#+begin_src
tabix HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz
tabix test_align.vcf.gz
rtg vcfeval -b HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz -c test_align.vcf.gz -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf -o vcfeval-test
#+end_src
Selected score threshold using: maximized F-measure
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   99.000                  0              0          1      54828     0.0000       0.0000     0.0000
     None                  0              0          1      54828     0.0000       0.0000     0.0000
     On essaie les différentes étapes
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --homref-split 1 --homref-vcf-out 1 --trimalleles 1 --splitalleles 1
#+end_src
1er changement
TAA T
TAA TA
Après réflexion, c'est la référence qui n'est pas normalisée ! (left-trimmed)
******* DONE NC_000021.9     14108836  T -> C
CLOSED: [2023-03-04 Sat 11:01]
Dans le bam mais filtré par DP
****** DONE variant calling seul : meilleur score pour l'instant (77% recall, 95% precision)
CLOSED: [2023-03-04 Sat 11:01]
tests/chr21-alexis
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL           76        
43        33           82        18         20      3      5       0.565789          0.709677
  SNP    ALL          582       448       134          530        25         57      6      1       0.769759          0.947146
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.243902         0.629617                     NaN                     NaN                   1.181818                   1.612903
       0.107547         0.849289                3.098592                2.925926                   1.530435                   1.774869
******* NC_000021.9:14144627 : FN, dans le bam mais pas dans le vcf (2 reads/9)
 On récupére tout le vcf: pas dedans
 Dans le bam : 9/2
 Idem pour le bam dans notre pipeline
 - base qualité 33 sur les 2 reads
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
On debug
#+begin_src
cd /Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr
samtools view -b NA12878_NIST7035.bam NC_000021.9 -o NA12878_NIST7035_chr21.bam
samtools index NA12878_NIST7035_chr21.bam
#+end_src
******** --debug
#+begin_src
gatk --java-options "-Xmx3g" HaplotypeCaller --input NA12878_NIST7035_chr21.bam \
    --output debug_chr1.vcf.gz \
    --reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
    --dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
    --tmp-dir . \
    --max-mnp-distance 2 --debug &> lol.txt
#+end_src
Les allèles sont bien retrouvées
#+begin_quote
14:41:05.530 INFO  EventMap - === Best Haplotypes ===
14:41:05.530 INFO  EventMap - AATTTAATTTTCTTACCTTTCTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO  EventMap - > Cigar = 41M
14:41:05.530 INFO  EventMap - >> Events = EventMap{}
14:41:05.530 INFO  EventMap - AATTTAATTTTCTTACCTTTTTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO  EventMap - > Cigar = 41M
14:41:05.530 INFO  EventMap - >> Events = EventMap{NC_000021.9:14144627-14144627 [C*, T],}
14:41:05.530 INFO  HaplotypeCallerGenotypingEngine - Genotyping event at 14144627 with alleles = [C*, T]
#+end_quote
NB: même en filtranrt sur le chromosome 21 avec julia, haplotypecaller parcourt tous les chromosomes
******** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-02-26 Sun 17:26]
******** DONE examine sortie --bamout : non présent
CLOSED: [2023-02-26 Sun 19:53]
#+begin_src
cd test/chr21-alexis
gatk --java-options "-Xmx3g" HaplotypeCaller \
    --input /Work/Users/apraga/bisonex/script/files/bam/NA12878_chr21.bam \
    --output debug_chr1.vcf.gz \
    --reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
    --dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
    --tmp-dir . \
    --max-mnp-distance 2 -bamout debug.bam
#+end_src
Pas de reads
 #+begin_quote
If you see nothing overlapping your region, then it might not have been flagged as active, or could have failed to assemble.
 #+end_quote
******** 14582339: FN mais pas de reads...
******** 14583327 idem
******** 17512551 idem
******** 17567111: difference d'haplotype
******** 17567621 pas de reads
***** TODO Comparer avec sortie du variant calling vcf donné par GIAB
****** STRT vcfeval
#+begin_src sh
wget https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
#+end_src
On renomme les chromosomes
#+begin_src
sed -i 's/^chrM/MT/' project.NIST.hc.snps.indels.vcf
sed -i 's/^chr//' project.NIST.hc.snps.indels.vcf
bcftools annotate --rename-chrs ../../genome/GRCh38.p13/chromosome_mapping.txt  project.NIST.hc.snps.indels.vcf -o project.NIST.hc.snps.indels.vcf.gz
#+end_src
On compresse et index
#+begin_src
bgzip HG001_GRCh38_1_22_v4.2.1_benchmark.vcf
tabix HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
tabix project.NIST.hc.snps.indels.vcf.gz
#+end_src
Conversion chromosome pour le bed définissant les exons :
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\t\/:;s:$:\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt  > pattern.sed
sed -f pattern.sed nexterarapidcapture_expandedexome_targetedregions.bed  -i.bak
#+end_src
Attention aux samples qui sont différents :
#+begin_src
 rtg vcfeval -b HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -c project.NIST.hc.snps.indels.vcf.gz --bed-regions nexterarapidcapture_expandedexome_targetedregions.bed -e HG001_GRCh38_1_22_v4.2.1_benchmark.bed.gz -o lol -t ../../genome/GRCh38.p13/genomeRef.sdf/ --sample HG001,NIST7035
#+end_src
****** TODO happy
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.PrecisioN
INDEL   PASS         4871      3678      1193         7036      1299       2011    208    217       0.755081          0.741493
  SNP   PASS        46032     41138      4894        47694      1622       4930    362     31       0.893683          0.962071
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.285816         0.748225                     NaN                     NaN                   1.617499                   2.524051
       0.103367         0.926617                2.529552                2.412446                   1.620686                   1.688868
***** TODO Statistiques avec vcfeval
**** TODO Résumer résultats pour Paul
*** TODO Platinum genome
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]

[2.32996]

6_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO GIAB Validation : GIAB
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB :giab:
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** TODO télécharger données avec Nextflow
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** TODO VCF de référence
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** TODO Ashkenazy trio HG002
SCHEDULED: <2023-04-01 Sat>
****** TODO Ashkenazy trio HG003
SCHEDULED: <2023-04-01 Sat>
****** TODO Ashkenazy trio HG004
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG005
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG006
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG007
SCHEDULED: <2023-04-01 Sat>
*
***** TODO Fastq
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** TODO Ashkenazy trio HG002
SCHEDULED: <2023-04-01 Sat>
****** TODO Ashkenazy trio HG003
SCHEDULED: <2023-04-01 Sat>
****** TODO Ashkenazy trio HG004
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG005
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG006
SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG007
SCHEDULED: <2023-04-01 Sat>
*
**** TODO NA12878 / HG001 :na12878:
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
INDEL   PASS       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
  SNP    ALL      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
  SNP   PASS      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
******* DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b  /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz  -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz  -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000               2984           2682       1840    3890296     0.5931       0.0008     0.0015
     None               2984           2682       1841    3890296     0.5930       0.0008     0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
******* DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11    783006  .       A       G       50      PASS
Il y a A -> G et C -> A sur cette position
***** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
****** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version europénne) vs RefSeq: [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2][article 2015]] en faveur de GENCODE mais Alexis conseille Refseq
****** DONE -f, -R ou -T ?
CLOSED: [2023-02-25 Sat 19:47]
Selon la doc : -f avec le bed fourni et -T pour filtrer sor les exons
******* rtg tools
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    3.000               1015            910        206      32531     0.8154       0.0303     0.0583
     None               1015            910        206      32531     0.8154       0.0303     0.0583
***** KILL Exons seuls
CLOSED: [2023-04-02 Sun 17:11]
****** DONE BestRefSeq 
CLOSED: [2023-02-19 Sun 12:05]
Dans refseq,[[https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/][2 types de modèles pour le gène]]
- basé sur refseq (NM_, NP_), curée
- basé sur gnomon (XM_ , XP_), prédite
Les modèles basés sur refseq ont la préférénce (cf lien)
On se restreint donc à bestrefseq
#+begin_src sh
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
gunzip https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
#+end_src
On se restrein aux exons codant (NM_)
#+begin_src
awk '/BestRefSeq\texon/ && /transcript_id=NM/ {print $1"\t"$4"\t"$5;}' GRCh38_latest_genomic.gff > exons.csv
#+end_src
Puis intersection
******* DONE Tests après correction bug dans noms de chromosome : precision ~ ok, recall très mauvais -> trop de FN ?
CLOSED: [2023-02-19 Sun 12:05]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
INDEL   PASS         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
  SNP    ALL        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
  SNP   PASS        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
******* DONE Vérifier exons: on a l'union des exons de tous les transcripts...
CLOSED: [2023-02-19 Sun 12:05]
Il faudrait un .bed d'illumina
On teste Twist for Illumina Exome 2.0 Plus BED File (hg19) sur https://support.illumina.com/downloads/nextera-flex-for-enrichment-BED-files.html
Conversion en hg38 avec ucsc
Renommage des chromosomes
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\\t/:;s:$:\\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -i.bak -f pattern.sed illumina_exons.bed
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_benchmark.bed -b illumina_exons.bed > HG001_GRCh38_1_22_v4.2.1_benchmark_illumina_exons.bed
#+end_src
Intersection
****** KILL Bed illumina
CLOSED: [2023-02-24 Fri 23:44]
******* KILL Sans filtre vep: Inversion truth et query... on recommence
CLOSED: [2023-02-19 Sun 13:15]
******* KILL Sans filtre vep: mieux mais pas exceptionnel
CLOSED: [2023-02-24 Fri 23:44]
cd work/00/2c72e62400956c96fb101ac7af405e/
$ cat .command.out
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          922       490       432          942       439          0     32     56       0.531453          0.533970
INDEL   PASS          922       490       432          942       439          0     32     56       0.531453          0.533970
  SNP    ALL        24618     18689      5929        21257      2571          0    239     17       0.759160          0.879052
  SNP   PASS        24618     18689      5929        21257      2571          0    239     17       0.759160          0.879052
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
            0.0         0.532709                     NaN                     NaN                   1.628743                   2.799180
            0.0         0.532709                     NaN                     NaN                   1.628743                   2.799180
            0.0         0.814719                2.899604                2.881526                   1.598479                   1.564378
            0.0         0.814719                2.899604                2.881526                   1.598479                   1.564378
******* KILL Comprendre pour les FN explosent avec vep annot
CLOSED: [2023-02-24 Fri 23:44]
NC_000001.11	924024	.	C	G
non couverte dans bam -> 1er exons de SAMD11 mais non couverte
Il est dans NM_001385641.1 mais pas dans NM_152486.4
****** KILL Obtenir les zones couvertes depuis le bam directement
CLOSED: [2023-02-24 Fri 23:44]
#+begin_src  sh
samtools sort NA12878_chr1.bam -o NA12878_chr1_sorted.bam
bedtools genomecov -ibam NA12878_chr1_sorted.bam  -bg > chr1.bedgraph
head chr1.bedgraph
#+end_src
#+RESULTS:
: NC_000001.11	10001	10021	1
: NC_000001.11	10021	10059	2
: NC_000001.11	10059	10063	1
: NC_000001.11	10063	10087	2
: NC_000001.11	10087	10110	1
: NC_000001.11	10110	10113	3
: NC_000001.11	10113	10121	2
On prend les régions avec 20 reads
awk '$4 > 20 {print $1"\t"$2"\t"$3}' chr1.bedgraph  > tomerge.bed
On fusionne les régions
bedtools merge -i tomerge.bed > exons_from_bam.bed
Problème : on manque parfois le bord des exons...
****** KILL Vérifier un bam de référence de NA12878
CLOSED: [2023-02-24 Fri 23:44]
Notamment pour la définitions des exons, on a des reads qui ne semblent pas alignés sur les bons exons selon IGV
On donne directemet à IGV le bam d'illuma WES ( https://github.com/genome-in-a-bottle/giab_data_indexes )
IGV n'aime pas le ftp apparement...
On récupére 2 échantillons pour ce patient sur HiSeq Exome
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/data/giab/GRCh38
wget https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/NA12878/alignment.index.NA12878_HiSeq_Exome_Garvan_GRCh37_09252015 -O todl.txt
awk '{print $1}' todl.txt | wget -i -
#+end_src
On récupére le premier échantillons en prenant just le chromosome 1
#+begin_src
samtools merge NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam project
.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam project.NIST_NIST7035_H7AP8ADXX_TA
AGGCGA_2_NA12878.bwa.markDuplicates.bam
samtools index NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam
#+end_src
******* Étude
NC_000001.11:924024 :
- bed d’illumina : pas correct
- bed refseq : meux
- bam :  exon non ciblé, probablement parce que Mane select n’a pas été utilisé mais refseq (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:28706)
NC_000001.11:924310: idem
NC_000001.11:924321: idem
NC_000001.11:924533: idem
NC_000001.11:966227: le bed ne couvre pas le bon exon !
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A965863%2D966591&hgsid=295298291_vhDbIv5YIckCKLpGB7UUaZi2cAkr
NC_000001.11:966272
****** KILL Filtrer variants introniques de référence avec vep
CLOSED: [2023-02-24 Fri 23:44]
******* KILL variant calling seulf + seulement -f: nombreux FP
CLOSED: [2023-02-24 Fri 23:44]
/Work/Users/apraga/bisonex/work/68/f1cf72a5a4078fdf743fb3844b369a
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          519       284       235        52169     37789      14094     12     64       0.547206          0.007511
INDEL   PASS          519       284       235        52169     37789      14094     12     64       0.547206          0.007511
  SNP    ALL        22131     17434      4697       357652    305313      34904    189     32       0.787764          0.054020
  SNP   PASS        22131     17434      4697       357652    305313      34904    189     32       0.787764          0.054020
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.270160         0.014820                     NaN                     NaN                   1.775956                   0.509510
       0.270160         0.014820                     NaN                     NaN                   1.775956                   0.509510
       0.097592         0.101108                2.971834                1.429533                   1.579776                   0.324923
       0.097592         0.101108                2.971834                1.429533                   1.579776                   0.324923
****** DONE Bed donnés par GIAB (en hg19) : résultats corrects :resultats:
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-02 Thu>
Téléchargé à la main et converti via UCSCS. À automatiser, voir : *hg38
Sans oublier de renommer les chromosomes !
| Type  | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
|-------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------|
| INDEL |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |
| SNP   |       46032 |    39369 |     6663 |       44600 |     1186 |      4041 |   304 |    30 |      0.855253 |         0.970759 |
 Type   Ssensibilité    VPP
 INDEL       0.710532   0.692946 
 SNP         0.855253   0.970759
 | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
 |----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
 |       0.090605 |        0.909353 |      2.529551552318896 |     2.4019675131548843 |        1.6206857273037931 |        1.6274012964054214 |
****** DONE Re-tester avec exons Refseq et option -T: résultats un peu moins bons
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-08 Wed>
On utilise directement les coordonées données par refseq
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6978      1599       1918    228    353       0.472876          0.683992
  SNP    ALL        59052     37825     21227        43480      1913       3740    675     35       0.640537          0.951862
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274864         0.559171                     NaN                     NaN                   1.547733                   2.756151
       0.086017         0.765767                2.433271                2.350281                   1.575230                   1.492346
****** DONE Vérifier que la zone de capture a vraiment besoin d'un liftover...
CLOSED: [2023-04-02 Sun 14:13]
D'après la documentation, oui
https://support.illumina.com/sequencing/sequencing_kits/nextera-rapid-capture-exome-kit/downloads.html
***** DONE D'où vient la différence ?
CLOSED: [2023-04-02 Sun 17:11] SCHEDULED: <2023-03-09 Thu>
****** DONE Statistiques
CLOSED: [2023-03-15 mer. 13:41] SCHEDULED: <2023-03-04 Sat>
On confirnme le nombre de SNP 6665
- 304 ont un problème d’haploide/allèle différente
- 28 avec une différence de génotype
- La majorité ne sont pas vu 6361
[1] "ALl FN 6665"
[1] "Genotype mismatch 28"
[1] "haplotype mismatch 304"
[1] "Missing  6333"
****** DONE Majorité des FN pour SPN sont peu couverts
CLOSED: [2023-03-16 Thu 16:54]
Chromosome 1 : majorité des FN ne sont pas vus...
cf figure
#+attr_html: :width 500px
[[file:reads.png]]
 1623412 : 5’UTR
 1953616 : intronique mais dans bed...
 2589991
 3488573
 3488732
 3780326
 3884683
 3914055
 4711993
 4712657
****** DONE Alignement
CLOSED: [2023-03-15 mer. 13:41]
******* DONE Impact de la version mineure du genome: non
CLOSED: [2023-03-09 Thu 23:13]
******** DONE GHC38 + version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6979      1599       1919    228    353       0.472876          0.683992
  SNP    ALL        59052     37827     21225        43483      1913       3741    676     35       0.640571          0.951865
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274968         0.559171                     NaN                     NaN                   1.547733                   2.756698
       0.086034         0.765792                2.433271                2.350254                   1.575230                   1.492375
******** DONE GHC38.p13 + notre version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7226      3417      3809         6978      1599       1918    228    353       0.472876          0.683992
  SNP    ALL        59052     37825     21227        43480      1913       3740    675     35       0.640537          0.951862
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.274864         0.559171                     NaN                     NaN                   1.547733                   2.756151
       0.086017         0.765767                2.433271                2.350281                   1.575230                   1.492346
****** DONE Vérifier si coordonnées génomiques (vcf)
CLOSED: [2023-03-09 Thu 23:05]
****** DONE Variant calling
CLOSED: [2023-03-22 Wed 23:11]
******* DONE Désactiver dbSNP : idem
CLOSED: [2023-03-22 Wed 15:00]
Après haplotycaller
$ sha256sum work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650  work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
Fichier utilisé par hap.py
$ sha256sum work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650  work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
On a bien le même que [[*Bed donnés par GIAB (en hg19) : résultats corrects][Bed donnés par GIAB (en hg19) : résultats corrects]]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         4871      3461      1410         7048      1554       1987    193    346       0.710532          0.692946
  SNP    ALL        46032     39367      6665        44599      1186       4042    304     30       0.855209          0.970757
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.281924         0.701629                     NaN                     NaN                   1.617499                   3.067409
       0.090630         0.909327                2.529552                2.402151                   1.620686                   1.627342
****** DONE Définition des exons
CLOSED: [2023-03-22 Wed 23:16]
******* KILL Vérifier qu'il ne manque pas des exons (avec bam ?)
CLOSED: [2023-03-16 Thu 16:54] SCHEDULED: <2023-03-05 Sun>
******* KILL Vérifier la couverture des variants introniques
CLOSED: [2023-03-22 Wed 15:01]
NC_000001.11  23344310  : intronique mais la région dans le bed couvre en partie l'exon
Profondeur sur le variant : 14 reads sur l'alt
******* Notes: UTR = exon - CDS
intron dans le gene = mRNA - exon
On le vérifie avec
  25 │ NC_000001.11  BestRefSeq  exon      65520   65573
  27 │ NC_000001.11  BestRefSeq  CDS       65565   65573
  https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65520%2D65573&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
  mRNA :
  https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65419%2D71585&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
******* DONE Statistiques : reads par type (introns, UTR, exons)
CLOSED: [2023-03-22 Wed 23:16]
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
#+begin_src julia
using CSV
using DataFrames, DataFramesMeta
using XAM, GenomicFeatures
function exonFromBestRefSeq(f)
    println("Reading exons from $(f)")
    cols = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute"]
    d = DataFrame(CSV.File(f ,header=cols, delim="\t",comment="#"))
    @chain d begin
        @rsubset :feature in ["exon", "CDS", "mRNA"] :source .== "BestRefSeq"
        @select :seqname :source :feature :start :end
        @distinct
    end
end
# Get exons from refseq (Bestrefeq, gnomon or just refseq)
function getExon(f)
    if isfile(f)
        println("Reading exons from preprocessed $(f)")
        d = DataFrame(CSV.File(f))
    else
        d = exonFromBestRefSeq("GRCh38_latest_genomic.gff.gz")
        CSV.write(f, d)
    end
    return d
end
# Return false negative and TRUTH annotation (genotype, type...)
# Read vcf manually: it's a tab-separated file and we extract the TRUTH column manually
#BD = Decision for call (TP/FP/FN/N)
#BK = tSub-type for decision (match/mismatch type)
#BI = Additional comparison information
#QQ = Variant quality for ROC creation.
#BV = High-level variant type (SNP|INDEL).
#BL = High-level location type (het|homref|hetalt|homalt|nocall).
function getFN(f)
    cols = ["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "TRUTH", "QUERY"]
    vcf = DataFrame(CSV.File(f,comment="#", delim="\t",
                             header = cols))
    # Extract genothype and other subcolumn fronm FORMAT
    format = split(vcf[1,:FORMAT], ":")
    @chain vcf begin
        @rtransform $format=split(:TRUTH, ':')
        @subset :BD .== "FN"
        @select :type=:BVT :CHROM :POS :REF :ALT :GT :BK :BLT
    end
end
# Search a variant in refseq exons
# Return the smallest interval (CDS)
function searchVariant(chrom, pos, exons)
    res = @subset exons :seqname .== chrom :start .<= pos :end .>= pos
    if isempty(res)
        return "intronicOutsideGene"
    elseif "CDS" in res.feature
        return "CDS"
    elseif "exon" in res.feature
        # UTR = exon - CDS
        return "UTR"
    elseif "mRNA" in res.feature
        # intron = mRNA - exon
        return "intronicGene"
    else
        return "intronicOutsideGene"
    end
end
# Once the bam file has been opened, count the reads for a single variant
function countReadsVariant(reader, chrom, pos)
    n = 0
    for record in eachoverlap(reader, chrom,pos:pos)
        n += 1
    end
    return n
end
# For all variants, get the number of reads from the bam file
function countReads(d)
    bamDir = "../script/files/bam"
    bam = bamDir * "/NA12878_NIST7035_recalibrated_hg38.bam"
    bai = bam * ".bai"
    reader = open(BAM.Reader, bam, index=bai)
    d2 = @transform(d, @byrow :reads = countReadsVariant(reader, :CHROM, :POS))
    close(reader)
    return d2
end
function exonicStatus(d, exons)
    # Check variants are in exon: for each variant, search in all exons
    @transform(d, @byrow :match = searchVariant(:CHROM, :POS, exons))
end
exons = getExon("GRCh38_latest_genomic_exon.csv")
d = getFN("../out/test-bed/test-allchr.vcf") |>
    x -> exonicStatus(x, exons) |>
    countReads
# # bed = DataFrame(CSV.File("../out/test-bed/exons_illumina_hg38.bed",comment="#", delim="\t", header=false))
CSV.write("falseNegatives.csv", d)
#+end_src
Discordance nombre de reads avec igv mais cohérent avec samtools : vu avec Alexis : probablement du à IVG, on reste sur samtools
$ samtools view   NA12878_NIST7035_recalibrated_hg38.bam NC_000001.11:1734812-1734812  -F "0x200" | wc -l
164
****** KILL Comprendre pourquoi le chromosome 6 a plus de FN
CLOSED: [2023-03-29 Wed 22:39]
Vérifier le graphe fait en julia
$ grep "FN.*SNP" test-allchr.vcf | grep NC_000006 -c
1264
@subset d :type .== "SNP" :CHROM .== "NC_000006.12"
1264×10 DataFrame
Région HLA: difficile car nombreux allèles alternatifs.
Mais dans nos données, sont principalement non vus.
La répartition des reads est similaire aux autres régions...
#+begin_src julia
vcf = vcfHappy("test-allchr.vcf")
d = DataFrame(CSV.File("data/falseNegatives.csv"))
d2 = @subset d :POS .>= 29600000 :POS .<= 32000000 :type .== "SNP" :CHROM .=="NC_000006.12"
#+end_src
#+RESULTS:
: julia> nrow(@subset d2 :BK .== ".")
: 612
: julia> nrow(@subset d2 :BK .!= ".")
: 6
: julia> nrow(@subset d2 :BK .!= "am")
: 612
****** KILL comparer avec Kumaran 2021
CLOSED: [2023-03-22 Wed 23:16]
Bed probablement différent, on a presque un facteur 2 sur le nombre de variant (TP..)
****** KILL Comparer avec stats de NA12878 dans example/happy sur chr21 (exons fournis)
CLOSED: [2023-03-04 Sat 11:01]
******* DONE DP_over_30_not_SNP_consensual_sequence.vcf: horrible
CLOSED: [2023-02-25 Sat 19:47]
déplorable
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          519        16       503         1579      1032        531      2      2       0.030829          0.015267        0.336289         0.020421                     NaN                     NaN                   1.775956                   4.768382
INDEL   PASS          519        16       503         1579      1032        531      2      2       0.030829          0.015267        0.336289         0.020421                     NaN                     NaN                   1.775956                   4.768382
  SNP    ALL        22131      1191     20940         4346      2342        813      4      1       0.053816          0.337107        0.187069         0.092815                2.971834                1.967235                   1.579776                   1.492828
  SNP   PASS        22131      1191     20940         4346      2342        813      4      1       0.053816          0.337107        0.187069         0.092815                2.971834                1.967235                   1.579776                   1.492828
******* DONE DP _over_30 : mieux
CLOSED: [2023-03-04 Sat 11:01]
/Work/Users/apraga/bisonex/work/69/cc1391f5a0fbf4db958a370ec17a19
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL           76        20        56           36         6         10      1      3       0.263158          0.769231
INDEL   PASS           76        20        56           36         6         10      1      3       0.263158          0.769231
  SNP    ALL          582       312       270          370         9         49      0      0       0.536082          0.971963
  SNP   PASS          582       312       270          370         9         49      0      0       0.536082          0.971963
******** DONE NC_000021.9:45515163 TA -> T : référence non normalisée ??
CLOSED: [2023-03-04 Sat 11:01]
NC_000021.9     45515163        .       TA      T       50      PASS    BS=45515163;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  0/1:FN:lm:d1_5:INDEL:het:.      ./.:.:.:.:NOCALL:nocall:0
NC_000021.9     45515163        .       TAA     T       694.02  PASS    BS=45515163;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.       0/1:FP:lm:d1_5:INDEL:het:694.02
Mais dans notre vcf
NC_000021.9     45515163        rs11284347      TAA     T,TA
La séquence est TAAAA donc notre TAA -> TA aurait du être reconnu comme identique. Faut-il utiliser un autre fasta ?
La décomposition de xcmp est
45515163        TAA -> T
45515164        AA -> A
NB: testée avec pre.py
#+begin_src
grep '^#' NA12878_NIST7035_DP_over_30.vcf > NA12878_NIST7035_DP_over_30_chr21.vcf
grep '^NC_000021' NA12878_NIST7035_DP_over_30.vcf >> NA12878_NIST7035_DP_over_30_chr21.vcf
pre.py NA12878_NIST7035_DP_over_30_chr21.vcf out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Selon l'algorithm, on devrait avoir TA -> T et non AA -> A (non left-align)
Si on débug:
#+begin_src
grep "^#" NA12878_NIST7035_DP_over_30_chr21.vcf > test_align.vcf
grep 45515163 NA12878_NIST7035_DP_over_30_chr21.vcf  >> test_align.vcf.gz
bgzip test_align.vcf.gz
bcftools index test_align.vcf.gz
multimerge test_align.vcf.gz -o out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Idem... Si on utile un genome de référence plus récent ?
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --process-full=1
#+end_src
Idem. Et vcfeval ?
#+begin_src
tabix HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz
tabix test_align.vcf.gz
rtg vcfeval -b HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz -c test_align.vcf.gz -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf -o vcfeval-test
#+end_src
Selected score threshold using: maximized F-measure
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   99.000                  0              0          1      54828     0.0000       0.0000     0.0000
     None                  0              0          1      54828     0.0000       0.0000     0.0000
     On essaie les différentes étapes
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --homref-split 1 --homref-vcf-out 1 --trimalleles 1 --splitalleles 1
#+end_src
1er changement
TAA T
TAA TA
Après réflexion, c'est la référence qui n'est pas normalisée ! (left-trimmed)
******** DONE NC_000021.9     14108836  T -> C
CLOSED: [2023-03-04 Sat 11:01]
Dans le bam mais filtré par DP
******* DONE variant calling seul : meilleur score pour l'instant (77% recall, 95% precision)
CLOSED: [2023-03-04 Sat 11:01]
tests/chr21-alexis
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL           76        43        33           82        18         20      3      5       0.565789          0.709677
  SNP    ALL          582       448       134          530        25         57      6      1       0.769759          0.947146
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.243902         0.629617                     NaN                     NaN                   1.181818                   1.612903
       0.107547         0.849289                3.098592                2.925926                   1.530435                   1.774869
******** NC_000021.9:14144627 : FN, dans le bam mais pas dans le vcf (2 reads/9)
 On récupére tout le vcf: pas dedans
 Dans le bam : 9/2
 Idem pour le bam dans notre pipeline
 - base qualité 33 sur les 2 reads
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
On debug
#+begin_src
cd /Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr
samtools view -b NA12878_NIST7035.bam NC_000021.9 -o NA12878_NIST7035_chr21.bam
samtools index NA12878_NIST7035_chr21.bam
#+end_src
********* --debug
#+begin_src
gatk --java-options "-Xmx3g" HaplotypeCaller --input NA12878_NIST7035_chr21.bam \
    --output debug_chr1.vcf.gz \
    --reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
    --dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
    --tmp-dir . \
    --max-mnp-distance 2 --debug &> lol.txt
#+end_src
Les allèles sont bien retrouvées
#+begin_quote
14:41:05.530 INFO  EventMap - === Best Haplotypes ===
14:41:05.530 INFO  EventMap - AATTTAATTTTCTTACCTTTCTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO  EventMap - > Cigar = 41M
14:41:05.530 INFO  EventMap - >> Events = EventMap{}
14:41:05.530 INFO  EventMap - AATTTAATTTTCTTACCTTTTTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO  EventMap - > Cigar = 41M
14:41:05.530 INFO  EventMap - >> Events = EventMap{NC_000021.9:14144627-14144627 [C*, T],}
14:41:05.530 INFO  HaplotypeCallerGenotypingEngine - Genotyping event at 14144627 with alleles = [C*, T]
#+end_quote
NB: même en filtranrt sur le chromosome 21 avec julia, haplotypecaller parcourt tous les chromosomes
********* DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-02-26 Sun 17:26]
********* DONE examine sortie --bamout : non présent
CLOSED: [2023-02-26 Sun 19:53]
#+begin_src
cd test/chr21-alexis
gatk --java-options "-Xmx3g" HaplotypeCaller \
    --input /Work/Users/apraga/bisonex/script/files/bam/NA12878_chr21.bam \
    --output debug_chr1.vcf.gz \
    --reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
    --dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
    --tmp-dir . \
    --max-mnp-distance 2 -bamout debug.bam
#+end_src
Pas de reads
 #+begin_quote
If you see nothing overlapping your region, then it might not have been flagged as active, or could have failed to assemble.
 #+end_quote
********* 14582339: FN mais pas de reads...
********* 14583327 idem
********* 17512551 idem
********* 17567111: difference d'haplotype
********* 17567621 pas de reads
****** DONE Comparer avec sortie du variant calling vcf donné par GIAB
CLOSED: [2023-04-02 Sun 17:11]
******* DONE vcfeval
CLOSED: [2023-04-01 Sat 11:59] SCHEDULED: <2023-04-01 Sat>
#+begin_src sh
 nextflow run workflows/test.nf -profile standard,helios -resume --test.vcfeval --test.giabVCF --outdir=test-giabVCF
cat test-giabVCF/vcfeval/output/summary.txt
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000              44818          44818       2892       6087     0.9394       0.8804     0.9089
     None              44819          44819       2896       6086     0.9393       0.8804     0.9089
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000              44818          44818       2892       6087     0.9394       0.8804     0.9089
     None              44819          44819       2896       6086     0.9393       0.8804     0.9089
******* DONE happy
CLOSED: [2023-04-01 Sat 11:56]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.PrecisioN
INDEL   PASS         4871      3678      1193         7036      1299       2011    208    217       0.755081          0.741493
  SNP   PASS        46032     41138      4894        47694      1622       4930    362     31       0.893683          0.962071
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.285816         0.748225                     NaN                     NaN                   1.617499                   2.524051
       0.103367         0.926617                2.529552                2.412446                   1.620686                   1.688868
****** DONE Statistiques avec vcfeval
CLOSED: [2023-04-02 Sun 17:10] SCHEDULED: <2023-04-01 Sat>
**** TODO Résumer résultats pour Paul + article :resultats:
SCHEDULED: <2023-04-02 Sun>
***** TODO HG001 :
SCHEDULED: <2023-04-02 Sun>
- notre version avec hap.py + vcfeval
- version GIAB
- référence ??
***** TODO HG002
SCHEDULED: <2023-04-02 Sun>
*** TODO Platinum genome
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]