B:BD[
3.58110] → [
2.29:8221]
uvais résultat
SCHEDULED: <2023-07-03 Mon>
****** TODO Beaucoup trop de FP : erreur de FILTER
SCHEDULED: <2023-07-07 Fri>
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-08 Sat 10:43]
******* KILL Examiner quelques FP
CLOSED: [2023-07-08 Sat 10:43]
******* KILL Méthodologie du pangenome
CLOSED: [2023-07-08 Sat 10:43]
******* DONE [#A] Certains FP sont des vrais FP: erreur de flag et non d'happy
CLOSED: [2023-07-08 Sat 10:43] SCHEDULED: <2023-07-07 Fri>
******** Ex: chr1:408793
G C 600.64 . BS=408793;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
CLOSED: [2023-07-07 Fri 20:03]
********* DONE Vérifier que l'on est bien dans la zone "high confidence" définie dans le bed
oui:
chr1 408514 414397 . 500 +
********* DONE Vérifier que le variant est bien dans le truth et query
CLOSED: [2023-07-07 Fri 20:14]
truth
chr1 408793 . G C 50 MismatchedRefAllele AttemptedAlleles=C*->G;AttemptedLocus=chr1:408793-408793;SwappedAlleles;callable=CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK;callsets=6;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;datasets=4;datasetsmissingcall=IonExome,SolidSE75bp;filt=CS_CCS15kb_20kbDV_filt,CS_CCS15kb_20kbGATK4_filt;platformnames=Illumina,PacBio,CG,10X;platforms=4 GT:AD:ADALL:DP:GQ 1/0:153,159:141,132:689:99
query
chr1 408793 . G C 600.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.390;DP=64;ExcessHet=0.0000;FS=0.961;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.53;ReadPosRankSum=-3.146;SOR=0.906 GT:AD:DP:GQ:PL 0/1:33,30:63:99:608,0,798
********* DONE Tester happy sur ce variant: non retrouvé
CLOSED: [2023-07-08 Sat 10:42]
Il faut au moins 2 variant + les index (sinon "no chromosome in common")
Donc on augmente la taille de la région
#+begin_src sh :dir ~/code/bisonex/out
bcftools view HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz -o lifted_test.vcf.gz chr1:408793-408900
bcftools view HG001-SRX11061486_SRR14724513-T2T.vcf.gz -o run_test.vcf.gz chr1:408793-408900
bcftools index lifted_test.vcf.gz
bcftools index run_test.vcf.gz
#+end_src
#+begin_src sh
hap.py lifted_test.vcf.gz run_test.vcf.gz --reference ../data/chm13v2.0.fa -o test
#+end_src
On trouve bien les 2 variants. L'indel est retrouvé mais pas le SNP
chr1 408793 . G C 600.64 . BS=408793 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
chr1 408874 . CGAA C 1476.6 . BS=408874 GT:BD:BK:BI:BVT:BLT:QQ 0/1:TP:gm:d1_5:INDEL:het:1476.6 0/1:TP:gm:d1_5:INDEL:het:1476.6
********* DONE Changer flag : mismatch -> PASS: OUI
CLOSED: [2023-07-08 Sat 10:42]
********* DONE Changer haplotype 1/0 -> 0/1 truth: idem
CLOSED: [2023-07-08 Sat 10:35]
********* DONE Tester avec vcfeval : idemnt
CLOSED: [2023-07-08 Sat 10:43]
****** TODO 2x moins de variants
****** DONE Comparer hg38 et T2T: 2x moinsr de variants, trop de FP et FN
CLOSED: [2023-07-07 Fri 18:38]
T2T
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 | 0.286285 | 0.519629 | NaN | NaN | 2.428571 | 2.465116 |
| INDEL | PASS | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 | 0.286285 | 0.519629 | NaN | NaN | 2.428571 | 2.465116 |
| SNP | ALL | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 | 0.120397 | 0.686734 | 3.11461 | 2.85705 | 3.640644 | 2.114633 |
| SNP | PASS | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 | 0.120397 | 0.686734 | 3.11461 | 2.85705 | 3.640644 | 2.114633 |
Hg38
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 | 0.378198 | 0.888102 | NaN | NaN | 1.860963 | 2.247273 |
| INDEL | PASS | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 | 0.378198 | 0.888102 | NaN | NaN | 1.860963 | 2.247273 |
| SNP | ALL | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.00711 | 2.784686 | 1.591810 | 1.816145 |
| SNP | PASS | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.00711 | 2.784686 | 1.591810 | 1.816145 |
******* Résumé
T2T
| Type | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 |
| SNP | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 |
Hg38
| Type | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 |
| SNP | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
#+begin_src sh
bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed | wc -l
#+end_src
204280
T2T
#+begin_src sh
bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed | wc -l
#+end_src
204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
\
--reference chm13v2.0.fa \
--threads 6 \
\
-T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
--false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
\
-o HG001
#+end_src
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** TODO Corriger FILTER
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** D
uvais résultat
SCHEDULED: <2023-07-03 Mon>
****** DONE 2x moins de variants, beaucoup trop de FP : erreur de FILTER
CLOSED: [2023-07-08 Sat 11:17] SCHEDULED: <2023-07-07 Fri>
******* DONE [#A] Certains FP sont des vrais FP: erreur de flag et non d'happy
CLOSED: [2023-07-08 Sat 10:43] SCHEDULED: <2023-07-07 Fri>
******** Ex: chr1:408793
G C 600.64 . BS=408793;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
CLOSED: [2023-07-07 Fri 20:03]
********* DONE Vérifier que l'on est bien dans la zone "high confidence" définie dans le bed
oui:
chr1 408514 414397 . 500 +
********* DONE Vérifier que le variant est bien dans le truth et query
CLOSED: [2023-07-07 Fri 20:14]
truth
chr1 408793 . G C 50 MismatchedRefAllele AttemptedAlleles=C*->G;AttemptedLocus=chr1:408793-408793;SwappedAlleles;callable=CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK;callsets=6;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;datasets=4;datasetsmissingcall=IonExome,SolidSE75bp;filt=CS_CCS15kb_20kbDV_filt,CS_CCS15kb_20kbGATK4_filt;platformnames=Illumina,PacBio,CG,10X;platforms=4 GT:AD:ADALL:DP:GQ 1/0:153,159:141,132:689:99
query
chr1 408793 . G C 600.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.390;DP=64;ExcessHet=0.0000;FS=0.961;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.53;ReadPosRankSum=-3.146;SOR=0.906 GT:AD:DP:GQ:PL 0/1:33,30:63:99:608,0,798
********* DONE Tester happy sur ce variant: non retrouvé
CLOSED: [2023-07-08 Sat 10:42]
Il faut au moins 2 variant + les index (sinon "no chromosome in common")
Donc on augmente la taille de la région
#+begin_src sh :dir ~/code/bisonex/out
bcftools view HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz -o lifted_test.vcf.gz chr1:408793-408900
bcftools view HG001-SRX11061486_SRR14724513-T2T.vcf.gz -o run_test.vcf.gz chr1:408793-408900
bcftools index lifted_test.vcf.gz
bcftools index run_test.vcf.gz
#+end_src
#+begin_src sh
hap.py lifted_test.vcf.gz run_test.vcf.gz --reference ../data/chm13v2.0.fa -o test
#+end_src
On trouve bien les 2 variants. L'indel est retrouvé mais pas le SNP
chr1 408793 . G C 600.64 . BS=408793 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
chr1 408874 . CGAA C 1476.6 . BS=408874 GT:BD:BK:BI:BVT:BLT:QQ 0/1:TP:gm:d1_5:INDEL:het:1476.6 0/1:TP:gm:d1_5:INDEL:het:1476.6
********* DONE Changer flag : mismatch -> PASS: OUI
CLOSED: [2023-07-08 Sat 10:42]
********* DONE Changer haplotype 1/0 -> 0/1 truth: idem
CLOSED: [2023-07-08 Sat 10:35]
********* DONE Tester avec vcfeval : idemnt
CLOSED: [2023-07-08 Sat 10:43]
******* DONE Comparer hg38 et T2T: 2x moinsr de variants, trop de FP et FN
CLOSED: [2023-07-07 Fri 18:38]
T2T
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 | 0.286285 | 0.519629 | NaN | NaN | 2.428571 | 2.465116 |
| INDEL | PASS | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 | 0.286285 | 0.519629 | NaN | NaN | 2.428571 | 2.465116 |
| SNP | ALL | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 | 0.120397 | 0.686734 | 3.11461 | 2.85705 | 3.640644 | 2.114633 |
| SNP | PASS | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 | 0.120397 | 0.686734 | 3.11461 | 2.85705 | 3.640644 | 2.114633 |
Hg38
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 | 0.378198 | 0.888102 | NaN | NaN | 1.860963 | 2.247273 |
| INDEL | PASS | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 | 0.378198 | 0.888102 | NaN | NaN | 1.860963 | 2.247273 |
| SNP | ALL | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.00711 | 2.784686 | 1.591810 | 1.816145 |
| SNP | PASS | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.00711 | 2.784686 | 1.591810 | 1.816145 |
******** Résumé
T2T
| Type | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 413 | 246 | 167 | 751 | 289 | 215 | 2 | 93 | 0.595642 | 0.460821 |
| SNP | 11236 | 10985 | 251 | 23597 | 9771 | 2841 | 26 | 58 | 0.977661 | 0.529245 |
Hg38
| Type | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.890710 | 0.885510 |
| SNP | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 |
****** DONE Interesection des bed: similaire
CLOSED: [2023-07-04 Tue 23:11]
HG38
#+begin_src sh
bedtools intersect -a capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed -b /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed | wc -l
#+end_src
204280
T2T
#+begin_src sh
bedtools intersect -a /Work/Groups/bisonex/data/giab/T2T/Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed -b /Work/Groups/bisonex/data/giab/T2T/HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed | wc -l
#+end_src
204021
****** DONE Vérifier la ligne de commande
CLOSED: [2023-07-04 Tue 23:38]
#+begin_src sh
hap.py \
HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz \
HG001-SRX11061486_SRR14724513-T2T.vcf.gz \
\
--reference chm13v2.0.fa \
--threads 6 \
\
-T Agilent_SureSelect_All_Exons_v7_hg38_Regions_hg38_T2T.bed \
--false-positives HG001_GRCh38_1_22_v4.2.1_benchmark_hg38_T2T.bed \
\
-o HG001
#+end_src
****** DONE Corriger FILTER : mieux mais toujours trop de négatifs. 3/4 SNP retrouvés
CLOSED: [2023-07-08 Sat 15:19] SCHEDULED: <2023-07-08 Sat>
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
INDEL PASS 413 246 167 751 289 215 2 98 0.595642 0.460821 0.286285 0.519629 NaN NaN 2.428571 2.465116
SNP ALL 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
SNP PASS 15883 15479 404 23597 5277 2841 46 44 0.974564 0.745760 0.120397 0.844947 3.017198 2.85705 5.560099 2.114633
******* DONE Vérifier qu'il ne reste plus de filtre autre que PASS
CLOSED: [2023-07-08 Sat 15:19]
#+begin_src
$ zgrep -c 'PASS' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730505
$ zgrep -c '^chr' HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz
3730506
#+end_src
****** TODO 1/4 SNP manquant ?
SCHEDULED: <2023-07-08 Sat>
******* DONE Regarder avec Julia si ce sont vraiment des FP: 61/5277 qui ne le sont pas
CLOSED: [2023-07-09 Sun 12:09]
4155
******* TODO Tester un FP
******* TODO Vérifier quelques variants sur IGV
******* TODO Répartition des FP : cluster ?
******* TODO Examiner quelques FP
******* TODO Méthodologie du pangenome
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** DONE NA12878 :na12878:hg38:
CLOSED: [2023-06-30 Fri 22:30]
***** D