apraga/org - Change 7T64XRC6YWKRZPFWTH27XQQZP6KEYTLCPEJ4SGPTLYRKZYUNBRQAC

Merge branch 'master' of git.sr.ht:~scut/org

Created by Alexis Praga on July 17, 2023

7T64XRC6YWKRZPFWTH27XQQZP6KEYTLCPEJ4SGPTLYRKZYUNBRQAC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 34 [5.35]

B:BD[4.8134] → [4.8134:8271]

B:BD[4.8271] → [2.29:8475]

B:BD[4.8271] → [3.8222:16445]

∅:D[2.8475] → [6.24811:26388]

∅:D[3.16445] → [6.24811:26388]

B:BD[6.24811] → [6.24811:26388]

∅:D[6.26388] → [7.49104:49455]

B:BD[7.49104] → [7.49104:49455]

∅:D[7.49455] → [8.48394:49010]

B:BD[8.48394] → [8.48394:49010]

∅:D[8.49010] → [9.24847:25560]

B:BD[9.24847] → [9.24847:25560]

∅:D[9.25560] → [10.39190:43625]

B:BD[10.39190] → [10.39190:43625]

_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL 
         413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* TODO Examiner les FP
******* TODO Tester un FP
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* TODO La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* TODO Enlever les FP qui correspondent à un changement dans le génome
SCHEDULED: <2023-07-09 Sun>
******** Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******** Algorithmes
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat av
         413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* TODO Examiner les FP
******* TODO Tester un FP
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* TODO Enlever les FP qui correspondent à un changement dans le génome
SCHEDULED: <2023-07-09 Sun>
Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
  Algorithme
  1. Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat av
ec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
INDEL   PASS       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
  SNP    ALL      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
  SNP   PASS      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
******* DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b  /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz  -c files/vcf/NA12
878_NIST7035_vep_annot.vcf.gz  -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000               2984           2682       1840    3890296     0.5931       0.0008     0.0015
     None               2984           2682       1841    3890296     0.5930       0.0008     0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 vari
ants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
******* DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11    783006  .       A       G       50      PASS
Il y a A -> G et C -> A sur cette position
***** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
****** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version eu
ropénne) vs RefSeq: [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2][article 2015]] en faveur de GENCODE mais Alexis conseille Refseq
****** DONE -f, -R ou -T ?
CLOSED: [2023-02-25 Sat 19:47]
Selon la doc : -f avec le bed fourni et -T pour filtrer sor les exons
******* rtg tools
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    3.000               1015            910        206      32531     0.8154       0.0303     0.0583
     None               1015            910        206      32531     0.8154       0.0303     0.0583
***** KILL Exons seuls
CLOSED: [2023-04-02 Sun 17:11]
****** DONE BestRefSeq 
CLOSED: [2023-02-19 Sun 12:05]
Dans refseq,[[https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/][2 types de modèles pour le gène]]
- basé sur refseq (NM_, NP_), curée
- basé sur gnomon (XM_ , XP_), prédite
Les modèles basés sur refseq ont la préférénce (cf lien)
On se restreint donc à bestrefseq
#+begin_src sh
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
gunzip https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
#+end_src
On se restrein aux exons codant (NM_)
#+begin_src
awk '/BestRefSeq\texon/ && /transcript_id=NM/ {print $1"\t"$4"\t"$5;}' GRCh38_latest_genomic.gff > exons.csv
#+end_src
Puis intersection
******* DONE Tests après correction bug dans noms de chromosome : precision ~ ok, recall très mauvais -> trop de FN ?
CLOSED: [2023-02-19 Sun 12:05]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
INDEL   PASS         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
  SNP    ALL        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
  SNP   PASS        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
******* DONE Vérifier exons: on a l'union des exons de tous les transcripts...
CLOSED: [2023-02-19 Sun 12:05]
Il faudrait un .bed d'illumina
On teste Twist for Illumina Exome 2.0 Plus BED File (hg19) sur https://support.illumina.com/downloads/nextera-flex-for-enrichment-BED-files.html
Conversion en hg38 avec ucsc
Renommage des chromosomes
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\\t/:;s:$:\\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -i.bak -f pattern.sed illumina_exons.bed
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_benchmark.bed -b illumina_exons.bed > HG001_GRCh38_1_22_v4.2.1_benchmark_illumina_exons.bed
#+end_src
Intersection
****** KILL Bed illumina
CLOSED: [2023-02-24 Fri 23:44]
******* KILL Sans filtre vep: Inversion truth et query... on recommence
CLOSED: [2023-02-19 Sun 13:15]
******* KILL Sans filtre vep: mieux mais pas exceptionnel
CLOSED: [2023-02-24 Fri 23:44]
cd work/00/2c72e62400956c96fb101ac7af405e/
$ cat .command.out
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          922       490       432          942       439          0     32     56       0.531453          0.533970
INDEL   PASS          922       490       432          942       439          0     32     56

[4.8134]

[10.43625]

_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
INDEL   PASS          413       246       167          751       289        215      2     98       0.595642          0.460821        0.286285         0.519629                     NaN                     NaN                   2.428571                   2.465116
  SNP    ALL        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
  SNP   PASS        15883     15479       404        23597      5277       2841     46     44       0.974564          0.745760        0.120397         0.844947                3.017198                 2.85705                   5.560099                   2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
******* TODO Examiner les FP
******* TODO Tester un FP
  2 │ chr1        608765  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:188
  liftDown UCSC: rien en GIAB : vrai FP
 3 │ chr1        762943  A           G           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:ti:SNP:homalt:287
 4 │ chr1        762945  A           T           ./.:.:.:.:NOCALL:nocall:.  1/1:FP:.:tv:SNP:homalt:287
 Remaniements complexes ? Pas dans le gène en HG38
******* DONE La plupart des FP (4705/5566) sont homozygotes: erreur de référence ?
CLOSED: [2023-07-12 Wed 21:10] SCHEDULED: <2023-07-09 Sun>
Sur les 2 premiers variants, ils montrent en fait la différence entre T2T et GRCh38
Erreur à l'alignement ?
******** KILL relancer l'alignement
CLOSED: [2023-07-09 Sun 17:36]
******** DONE vérifier reads identiques hg38 et T2T: oui
CLOSED: [2023-07-09 Sun 16:36]
T2T CHR1608765
38   	chr1:1180168-1180168 (
SRR14724513.24448214
SRR14724513.24448214
******* TODO Enlever les FP qui correspondent à un changement dans le génome
SCHEDULED: <2023-07-09 Sun>
******** Condition:
- pas de variation à la position en GRCh38
- variantion homozygote
- la varation en T2T correspond au changement de pair de base GRC38 -> T2T
  pour les SNP:
  alt_T2T[i] = DNA_GRC38[j]
  avec i la position en T2T et j la position en GRCh38
  Note: définir un ID n'est pas correct car les variants peuvent être modifié par happy !
******** Algorithmes
 - Pour chaque FP, c'est un "faux" FP si
     - REF en hg38 == ALT en T2T
     - et REF en hg38 != REF en T2T
     - et variant homozygote
Comment obtenir les séquences de réferences ?
1. liftover
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision  (= VPP) 69 et 97% respectivement
| Type  | TRUTH |    TP |   FN | QUERY |   FP |  UNK | FP.gt | FP.al |   Recall | Precision |
| INDEL |  4871 |  3461 | 1410 |  7048 | 1554 | 1987 |   193 |   346 | 0.710532 |  0.692946 |
| SNP   | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 |   304 |    30 | 0.855253 |  0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type  |    TP |   FP |  FN | Sensitivity | Precision | F-Score |   FDR |
|     1 | SNV   | 23689 | 1397 | 613 |       0.975 |     0.944 |   0.959 | 0.057 |
|     2 | SNV   | 23946 |  865 | 356 |       0.985 |     0.965 |   0.975 | 0.036 |
|     1 | indel |  1254 |   72 |  75 |       0.944 |     0.946 |   0.945 | 0.054 |
|     2 | indel |  1309 |   10 |  20 |       0.985 |     0.992 |   0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
INDEL   PASS       276768       274    276494         1500       257        968     26     15       0.000990          0.516917        0.645333         0.001976                     NaN                     NaN                   1.483361                   6.129187
  SNP    ALL      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
  SNP   PASS      1937706      1193   1936513         3338       106       2037     11      2       0.000616          0.918524        0.610246         0.001231                  2.0785                1.861183                   1.539064                   2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
INDEL   PASS       276768     10535    266233        52169     10969      30616   3552   2122       0.038064          0.491069        0.586862         0.070652                     NaN                     NaN                   1.483361                   0.509510
  SNP    ALL      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
  SNP   PASS      1937706    105753   1831953       357652     74634     177259  35111    797       0.054576          0.586270        0.495619         0.099857                  2.0785                 1.42954                   1.539064                   0.324923
******* DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b  /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz  -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz  -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    1.000               2984           2682       1840    3890296     0.5931       0.0008     0.0015
     None               2984           2682       1841    3890296     0.5930       0.0008     0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
******* DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11    783006  .       A       G       50      PASS
Il y a A -> G et C -> A sur cette position
***** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
****** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version europénne) vs RefSeq: [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2][article 2015]] en faveur de GENCODE mais Alexis conseille Refseq
****** DONE -f, -R ou -T ?
CLOSED: [2023-02-25 Sat 19:47]
Selon la doc : -f avec le bed fourni et -T pour filtrer sor les exons
******* rtg tools
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    3.000               1015            910        206      32531     0.8154       0.0303     0.0583
     None               1015            910        206      32531     0.8154       0.0303     0.0583
***** KILL Exons seuls
CLOSED: [2023-04-02 Sun 17:11]
****** DONE BestRefSeq 
CLOSED: [2023-02-19 Sun 12:05]
Dans refseq,[[https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/][2 types de modèles pour le gène]]
- basé sur refseq (NM_, NP_), curée
- basé sur gnomon (XM_ , XP_), prédite
Les modèles basés sur refseq ont la préférénce (cf lien)
On se restreint donc à bestrefseq
#+begin_src sh
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
gunzip https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
#+end_src
On se restrein aux exons codant (NM_)
#+begin_src
awk '/BestRefSeq\texon/ && /transcript_id=NM/ {print $1"\t"$4"\t"$5;}' GRCh38_latest_genomic.gff > exons.csv
#+end_src
Puis intersection
******* DONE Tests après correction bug dans noms de chromosome : precision ~ ok, recall très mauvais -> trop de FN ?
CLOSED: [2023-02-19 Sun 12:05]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
INDEL   PASS         7230       321      6909         1500       290        888     27     18       0.044398          0.526144
  SNP    ALL        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
  SNP   PASS        59052      1653     57399         3338       101       1583     12      2       0.027992          0.942450
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.592000         0.081887                     NaN                     NaN                    1.54733                   6.129187
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
       0.474236         0.054370                2.433271                1.861183                    1.57523                   2.703663
******* DONE Vérifier exons: on a l'union des exons de tous les transcripts...
CLOSED: [2023-02-19 Sun 12:05]
Il faudrait un .bed d'illumina
On teste Twist for Illumina Exome 2.0 Plus BED File (hg19) sur https://support.illumina.com/downloads/nextera-flex-for-enrichment-BED-files.html
Conversion en hg38 avec ucsc
Renommage des chromosomes
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\\t/:;s:$:\\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -i.bak -f pattern.sed illumina_exons.bed
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_benchmark.bed -b illumina_exons.bed > HG001_GRCh38_1_22_v4.2.1_benchmark_illumina_exons.bed
#+end_src
Intersection
****** KILL Bed illumina
CLOSED: [2023-02-24 Fri 23:44]
******* KILL Sans filtre vep: Inversion truth et query... on recommence
CLOSED: [2023-02-19 Sun 13:15]
******* KILL Sans filtre vep: mieux mais pas exceptionnel
CLOSED: [2023-02-24 Fri 23:44]
cd work/00/2c72e62400956c96fb101ac7af405e/
$ cat .command.out
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision
INDEL    ALL          922       490       432          942       439          0     32     56       0.531453          0.533970
INDEL   PASS          922       490       432          942       439          0     32     56