As expected, slightly better results due to alignement (most likely)
2HCVM6WPQZ6UQKODSOGXVOMUA3OREJJW3AUNVDF6VNOHCMU47NSQC
AAGTGATTTTA
14:41:05.530 INFO EventMap - > Cigar = 41M
14:41:05.530 INFO EventMap - >> Events = EventMap{NC_000021.9:14144627-14144627 [C*, T],}
14:41:05.530 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 14144627 with alleles = [C*, T]
#+end_quote
NB: même en filtranrt sur le chromosome 21 avec julia, haplotypecaller parcourt tous les chromosomes
********* DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-02-26 Sun 17:26]
********* DONE examine sortie --bamout : non présent
CLOSED: [2023-02-26 Sun 19:53]
#+begin_src
cd test/chr21-alexis
gatk --java-options "-Xmx3g" HaplotypeCaller \
--input /Work/Users/apraga/bisonex/script/files/bam/NA12878_chr21.bam \
--output debug_chr1.vcf.gz \
--reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
--dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
--tmp-dir . \
--max-mnp-distance 2 -bamout debug.bam
#+end_src
Pas de reads
#+begin_quote
If you see nothing overlapping your region, the
n it might not have been flagged as active, or could have failed to assemble.
#+end_quote
********* 14582339: FN mais pas de reads...
********* 14583327 idem
********* 17512551 idem
********* 17567111: difference d'haplotype
********* 17567621 pas de reads
****** DONE Comparer avec sortie du variant calling vcf donné par GIAB
CLOSED: [2023-04-02 Sun 17:11]
******* DONE vcfeval
CLOSED: [2023-04-01 Sat 11:59] SCHEDULED: <2023-04-01 Sat>
#+begin_src sh
nextflow run workflows/test.nf -profile standard,helios -resume --test.vcfeval --test.giabVCF --outdir=test-giabVCF
cat test-giabVCF/vcfeval/output/summary.txt
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 44818 44818 2892 6087 0.9394 0.8804 0.9089
None 44819 44819 2896 6086 0.9393 0.8804 0.9089
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 44818 44818 2892 6087 0.9394 0.8804 0.9089
None 44819 44819 2896 6086 0.9393 0.8804 0.9089
******* DONE happy
CLOSED: [2023-04-01 Sat 11:56]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.PrecisioN
INDEL PASS 4871 3678 1193 7036 1299 2011 208 217 0.755081 0.741493
SNP PASS 46032 41138 4894 47694 1622 4930 362 31 0.893683 0.962071
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.285816 0.748225 NaN NaN 1.617499 2.524051
0.103367 0.926617 2.529552 2.412446 1.620686 1.688868
****** DONE Statistiques avec vcfeval
CLOSED: [2023-04-02 Sun 17:10] SCHEDULED: <2023-04-01 Sat>
***** DONE Résultats finaux
CLOSED: [2023-04-14 Fri 09:53]
Version GIAB avec hap.py + vcfeval:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878-giab --test.compare=happy,vcfeval --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| INDEL | PASS | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| SNP | ALL | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
| SNP | PASS | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.7
AAGTGATTTTA
14:41:05.530 INFO EventMap - > Cigar = 41M
14:41:05.530 INFO EventMap - >> Events = EventMap{NC_000021.9:14144627-14144627 [C*, T],}
14:41:05.530 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 14144627 with alleles = [C*, T]
#+end_quote
NB: même en filtranrt sur le chromosome 21 avec julia, haplotypecaller parcourt tous les chromosomes
********* DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-02-26 Sun 17:26]
********* DONE examine sortie --bamout : non présent
CLOSED: [2023-02-26 Sun 19:53]
#+begin_src
cd test/chr21-alexis
gatk --java-options "-Xmx3g" HaplotypeCaller \
--input /Work/Users/apraga/bisonex/script/files/bam/NA12878_chr21.bam \
--output debug_chr1.vcf.gz \
--reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
--dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
--tmp-dir . \
--max-mnp-distance 2 -bamout debug.bam
#+end_src
Pas de reads
#+begin_quote
If you see nothing overlapping your region, then it might not have been flagged as active, or could have failed to assemble.
#+end_quote
********* 14582339: FN mais pas de reads...
********* 14583327 idem
********* 17512551 idem
********* 17567111: difference d'haplotype
********* 17567621 pas de reads
****** DONE Comparer avec sortie du variant calling vcf donné par GIAB
CLOSED: [2023-04-02 Sun 17:11]
******* DONE vcfeval
CLOSED: [2023-04-01 Sat 11:59] SCHEDULED: <2023-04-01 Sat>
#+begin_src sh
nextflow run workflows/test.nf -profile standard,helios -resume --test.vcfeval --test.giabVCF --outdir=test-giabVCF
cat test-giabVCF/vcfeval/output/summary.txt
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 44818 44818 2892 6087 0.9394 0.8804 0.9089
None 44819 44819 2896 6086 0.9393 0.8804 0.9089
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 44818 44818 2892 6087 0.9394 0.8804 0.9089
None 44819 44819 2896 6086 0.9393 0.8804 0.9089
******* DONE happy
CLOSED: [2023-04-01 Sat 11:56]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.PrecisioN
INDEL PASS 4871 3678 1193 7036 1299 2011 208 217 0.755081 0.741493
SNP PASS 46032 41138 4894 47694 1622 4930 362 31 0.893683 0.962071
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.285816 0.748225 NaN NaN 1.617499 2.524051
0.103367 0.926617 2.529552 2.412446 1.620686 1.688868
****** DONE Statistiques avec vcfeval
CLOSED: [2023-04-02 Sun 17:10] SCHEDULED: <2023-04-01 Sat>
***** DONE Résultats finaux
CLOSED: [2023-04-14 Fri 09:53]
:PROPERTIES:
:ID: cd79a77c-a0b6-4bb1-9e08-fe08dc89e3aa
:END:
Version GIAB avec hap.py + vcfeval:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878-giab --test.compare=happy,vcfeval --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| INDEL | PASS | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| SNP | ALL | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
| SNP | PASS | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.7
0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type |
Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
**** DONE HG004 :hg38:
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVaria
0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
**** DONE HG004 :hg38:
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
nextflow run workflows/compareVCF.nf -profile standard,helios --output=validation/na12878-grch38 --test.query=out/NA12878-GRCh38/callVariant/haplotypecaller/NA12878-GRCh38.vcf.gz --test.compare=happy,vcfeval -lib lib/ --test.id=HG001 --genome=GRCh38
***** GRCh38
****** vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
3.000 45121 44771 2961 5402 0.9380 0.8931
0.9150
None 45130 44780 2981 5393 0.9376 0.8933
0.9149
***** happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 4871 | 3636 | 1235 | 7498 | 1629 | 2185 | 195 | 363 | 0.746459 | 0.693394 | 0.291411 | 0.718948 | | | 1.6174985978687606 | 3.083517699115044 |
| INDEL | PASS | 4871 | 3636 | 1235 | 7498 | 1629 | 2185 | 195 | 363 | 0.746459 | 0.693394 | 0.291411 | 0.718948 | | | 1.6174985978687606 | 3.083517699115044 |
| SNP | ALL | 46032 | 41994 | 4038 | 48715 | 1438 | 5282 | 299 | 48 | 0.912278 | 0.966892 | 0.108427 | 0.938791 | 2.529551552318896 | 2.375987252320909 | 1.6206857273037931 | 1.6987584524997228 |
| SNP | PASS | 46032 | 41994 | 4038 | 48715 | 1438 | 5282 | 299 | 48 | 0.912278 | 0.966892 | 0.108427 | 0.938791 | 2.529551552318896 | 2.375987252320909 | 1.6206857273037931 | 1.6987584524997228 |
A comparer avec la version patchée génome mais mal alignée
[[id:cd79a77c-a0b6-4bb1-9e08-fe08dc89e3aa][Résultats finaux]]
Un peu plus de vrais positifs, logique
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVaria