OCOSI7GGMNSCDI2BSQLDOXZX5PS2FY4DHPAPKZ3RFAWZAOTPELJQC
7LNK2TK2ELYWDUZ3FVQM2VRSOACROZM4A5KZHVURYSFECLIMQSZQC
XZOQO7A5LE72EEGJRR6USUXL4S4L5DKQAHPJMUPXHF2MKLK4AK5AC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
7NDZXAGUWT2JOXEDHCP7OGVNLC7XUNJ2MNEBKDJKSJ2ZFL5HF4YAC
2TS7XBTNIT3BJUMSFVQKB6XK6GSSGMSZGDSRRZKMAQBMAX6MSCJAC
RUR2HQCDDUOIP7GD2GZV2UPCUF5QS6COVIVDKVIGZF22TVRFBKPAC
fonctionner le filtre technical variant
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Wed 10:30>
*** TODO Générer la base de donnée spip
SCHEDULED: <2023-08-03 Thu 11:30>
**** TODO Vérifier la génération du transcriptome en hg38: checksum différent
SCHEDULED: <2023-08-03 Thu 17:00>
- [X] Nettoyer et vérifier sur hg38 avec ediff les RData : différent
- [X] Sinon, ne pas nettoyer et générer: idem
**** TODO Récupérer ncbi RefSeq curated
.txt sur UCSC mais pas en T2T: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Format: https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1173061381_UepaHnvaOKFZKMOV4o7DtcNUHGVa&hgta_doSchemaDb=chlSab2&hgta_doSchemaTable=ncbiRefSeqCurated
Avec "*", les champs conservés (a priori)
| 1 | bin | Indexing field to speed chromosome range queries. |
| 2 | *name | Name of gene (usually transcript_id from GTF) |
| 3 | *chrom | Reference sequence chromosome or scaffold |
| 4 | *strand | + or - for strand |
| 5 | *txStart | Transcription start position (or end position for minus strand item) |
| 6 | *txEnd | Transcription end position (or start position for minus strand item) |
| 7 | *cdsStart | Coding region start (or end position for minus strand item) |
| 8 | *cdsEnd | Coding region end (or start position for minus strand item) |
| 9 | *exonCount | Number of exons |
| 10 | *exonStarts | Exon start positions (or end positions for minus strand item) |
| 11 | *exonEnds | Exon end positions (or start positions for minus strand item) |
| 12 | *score | score |
| 13 | *name2 | Alternate name (e.g. gene_id from GTF) |
| 14 | cdsStartStat | Status of CDS start annotation (none, unknown, incomplete, or complete) |
| 15 | cdsEndStat | Status of CDS end annotation (none, unknown, incomplete, or complete) |
| 16 | exonFrames | Exon frame {0,1,2}, or -1 if no frame for exon |
En T2T, seulement au format bigBed : https://hgdownload.soe.ucsc.edu/gbdb/hs1/ncbiRefSeq/
Il y a un exécutable pour convertir en bed : http://hgdownload.soe.ucsc.edu/admin/exe/
Sous gentoo, il faut instaler mit-krb5 (pour libkrb5)
#+begin_src
./bigBedToBed ncbiRefSeqCurated.bb ncbiRefSeqCurated.bed
#+end_src
Mais format différent...
Exemple
chr1 7505 13582 NR_182076.1 0 - 13582 13582 0 2 5477,138, 0,5939, LOC127239154 none none -1,-1, NR_182076.1 LOC127239154
Selon https://genome.ucsc.edu/cgi-bin/hgc?hgsid=1668458090_qvlzLgWgwTr5pQ0m7N3VuvgVlMSo&db=hub_3671779_hs1&c=chr1&l=7889&r=13966&o=7505&t=13582&g=hub_3671779_ncbiRefSeq&i=NR_182076.1
On a les dernière colonnes
- status CDS start annotation (none)
- status CDS end annotation (none)
- exome frame -1,-1
- transcript type
- identifiant gène
- nom du gene
- type du gene
Plus simple : regarder le bigBed pour hg19 ?
https://hgdownload.soe.ucsc.edu/gbdb/hg19/ncbiRefSeq/
On n'a pas le refsefcurated mais just other (idem en hg38)
*** PROJ Porter Spip
** PROJ Indicateurs qualité
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** PROJ vérifier si normalisation
** PROJ Rajouter vérification hgvs
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM POS ID REF ALT
1 9091 . A C
1 69091 . A C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i te
fonctionner le filtre technical variant
CLOSED: [2023-08-03 Thu 14:24] SCHEDULED: <2023-08-03 Wed 10:30>
*** TODO Générer la base de donnée spip
SCHEDULED: <2023-08-03 Thu 11:30>
**** TODO Vérifier la génération du transcriptome en hg38: checksum différent
SCHEDULED: <2023-08-03 Thu 17:00>
- [X] Nettoyer et vérifier sur hg38 avec ediff les RData : différent
- [X] Sinon, ne pas nettoyer et générer: idem
**** TODO Récupérer ncbi RefSeq curated
.txt sur UCSC mais pas en T2T: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
Format: https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1173061381_UepaHnvaOKFZKMOV4o7DtcNUHGVa&hgta_doSchemaDb=chlSab2&hgta_doSchemaTable=ncbiRefSeqCurated
Avec "*", les champs conservés (a priori)
| 1 | bin | Indexing field to speed chromosome range queries. |
| 2 | *name | Name of gene (usually transcript_id from GTF) |
| 3 | *chrom | Reference sequence chromosome or scaffold |
| 4 | *strand | + or - for strand |
| 5 | *txStart | Transcription start position (or end position for minus strand item) |
| 6 | *txEnd | Transcription end position (or start position for minus strand item) |
| 7 | *cdsStart | Coding region start (or end position for minus strand item) |
| 8 | *cdsEnd | Coding region end (or start position for minus strand item) |
| 9 | *exonCount | Number of exons |
| 10 | *exonStarts | Exon start positions (or end positions for minus strand item) |
| 11 | *exonEnds | Exon end positions (or start positions for minus strand item) |
| 12 | *score | score |
| 13 | *name2 | Alternate name (e.g. gene_id from GTF) |
| 14 | cdsStartStat | Status of CDS start annotation (none, unknown, incomplete, or complete) |
| 15 | cdsEndStat | Status of CDS end annotation (none, unknown, incomplete, or complete) |
| 16 | exonFrames | Exon frame {0,1,2}, or -1 if no frame for exon |
En T2T, seulement au format bigBed : https://hgdownload.soe.ucsc.edu/gbdb/hs1/ncbiRefSeq/
Il y a un exécutable pour convertir en bed : http://hgdownload.soe.ucsc.edu/admin/exe/
Sous gentoo, il faut instaler mit-krb5 (pour libkrb5)
#+begin_src
./bigBedToBed ncbiRefSeqCurated.bb ncbiRefSeqCurated.bed
#+end_src
Ne pas oublier les headers car ils sont dans un ordre différent:
#chrom chromStart chromEnd name score strand thickStart thickEnd reserved blockCount blockSizes chromStarts name2 cdsStartStat cdsEndStat exonFrames type geneName geneName2 geneType
*** PROJ Porter Spip
** PROJ Indicateurs qualité
*** Idée
Raredisease:
- FastQC : nombreuses statistiques. Non disponible Nix
- Mosdepth : calcule la profondeur (2x plus rapide que samtools depth). Nix
- MultiQC : fusionne juste les résultats des analyses. Non disponible nix
- Picard's CollectMutipleMetrics, CollectHsMetrics, and CollectWgsMetrics
- Qualimap : alternative fastqc ? Non disponible nix
- Sentieon's WgsMetricsAlgo : propriétaire
- TIDDIT's cov : TIDIT = remaninement chromosomique
Sarek:
- alignment statistics : samtools stats, mosdepth
- QC : MultiQC
MultiQC : non disponible Nix
** PROJ vérifier si normalisation
** PROJ Rajouter vérification hgvs
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM POS ID REF ALT
1 9091 . A C
1 69091 . A C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i te
.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-06-30 Fri 22:08] SCHEDULED: <2023-06-25 Sun>
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/HG001-SRX11061486_SRR14724513-GRCh38 --query=out/HG001-SRX11061486_SRR14724513-GRCh38/callVariant/haplotypecaller/HG001-SRX11061486_SRR14724513-GRCh38.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG001
#+end_src
Meilleurs résultats !
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.89071 | 0.88551 | 0.378198 | 0.888102 | | | 1.86096256684492 | 2.247272727272727 |
| INDEL | PASS | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.89071 | 0.88551 | 0.378198 | 0.888102 | | | 1.86096256684492 | 2.247272727272727 |
| SNP | ALL | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.007110300820419 | 2.78468624064479 | 1.5918102430965306 | 1.8161449399656946 |
| SNP | PASS | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.007110300820419 | 2.78468624064479 | 1.5918102430965306 | 1.8161449399656946 |
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** DONE HG002 :hg002:hg38:
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-25 Tue>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:24] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 626 | 536 | 90 | 900 | 23 | 334 | 6 | 14 | 0.85623 | 0.959364 | 0.371111 | 0.904868 | | | 1.5696202531645569 | 2.406844106463878 |
| INDEL | PASS | 626 | 536 | 90 | 900 | 23 | 334 | 6 | 14 | 0.85623 | 0.959364 | 0.371111 | 0.904868 | | | 1.5696202531645569 | 2.406844106463878 |
| SNP | ALL | 23136 | 22346 | 790 | 26289 | 103 | 3845 | 77 | 4 | 0.965854 | 0.995411 | 0.146259 | 0.98041 | 2.9282199219412863 | 2.7752583237657866 | 1.6348301800775018 | 1.8423330808354075 |
| SNP | PASS | 23136 | 22346 | 790 | 26289 | 103 | 3845 | 77 | 4 | 0.965854 | 0.995411 | 0.146259 | 0.98041 | 2.9282199219412863 | 2.7752583237657866 | 1.6348301800775018 | 1.8423330808354075 |
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-07-30 Sun 14:26]
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 644 | 538 | 106 | 914 | 32 | 337 | 7 | 19 | 0.835404 | 0.944541 | 0.368709 | 0.886626 | | | 1.7444933920704846 | 2.3138686131386863 |
| INDEL | PASS | 644 | 538 | 106 | 914 | 32 | 337 | 7 | 19 | 0.835404 | 0.944541 | 0.368709 | 0.886626 | | | 1.7444933920704846 | 2.3138686131386863 |
| SNP | ALL | 23126 | 22271 | 855 | 26405 | 135 | 4002 | 90 | 20 | 0.963029 | 0.993974 | 0.151562 | 0.978257 | 2.949462182004439 | 2.7766657134686876 | 1.6080333972695475 | 1.8465106245280984 |
| SNP | PASS | 23126 | 22271 | 855 | 26405 | 135 | 4002 | 90 | 20 | 0.963029 | 0.993974 | 0.151562 | 0.978257 | 2.949462182004439 | 2.7766657134686876 | 1.6080333972695475 | 1.8465106245280984 |
**** TODO HG004 :hg38:hg004:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:39] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 588 | 511 | 77 | 873 | 15 | 332 | 7 | 8 | 0.869048 | 0.972274 | 0.380298 | 0.917767 | | | 1.6111111111111112 | 2.3984375 |
| INDEL | PASS | 588 | 511 | 77 | 873 | 15 | 332 | 7 | 8 | 0.869048 | 0.972274 | 0.380298 | 0.917767 | | | 1.6111111111111112 | 2.3984375 |
| SNP | ALL | 23041 | 22225 | 816 | 26302 | 114 | 3972 | 87 | 5 | 0.964585 | 0.994895 | 0.151015 | 0.979505 | 2.91407709288504 | 2.7368645271229766 | 1.6883829538820783 | 1.8632748665431964 |
| SNP | PASS | 23041 | 22225 | 816 | 26302 | 114 | 3972 | 87 | 5 | 0.964585 | 0.994895 | 0.151015 | 0.979505 | 2.91407709288504 | 2.7368645271229766 | 1.6883829538820783 | 1.8632748665431964 |
**** STRT HG001 :hg001:T2T:
Avec liftover : 10x moins de variants...
Type,Filter,TRUTH.TOTAL,TRUTH.TP,TRUTH.FN,QUERY.TOTAL,QUERY.FP,QUERY.UNK,FP.gt,FP.al,METRIC.Recall,METRIC.Precision,METRIC.Frac_NA,METRIC.F1_Score,TRUTH.TOTAL.TiTv_ratio,QUERY.TOTAL.TiTv_ratio,TRUTH.TOTAL.het_hom_ratio,QUERY.TOTAL.het_hom_ratio
INDEL,ALL,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
INDEL,PASS,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
SNP,ALL,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
SNP,PASS,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
***** TODO Comprendre mauvais résultat
****** DONE 2x moins de variants, beaucoup trop de FP : erreur de FILTER
CLOSED: [2023-07-08 Sat 11:17] SCHEDULED: <2023-07-07 Fri>
******* DONE [#A] Certains FP sont des vrais FP: erreur de flag et non d'happy
CLOSED: [2023-07-08 Sat 10:43] SCHEDULED: <2023-07-07 Fri>
******** Ex: chr1:408793
G C 600.64 . BS=408793;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
CLOSED: [2023-07-07 Fri 20:03]
********* DONE Vérifier que l'on est bien dans la zone "high confidence" définie dans le bed
oui:
chr1 408514 414397 . 500 +
********* DONE Vérifier que le variant est bien dans le truth et query
CLOSED: [2023-07-07 Fri 20:14]
truth
chr1 408793 . G C 50 MismatchedRefAllele AttemptedAlleles=C*->G;AttemptedLocus=chr1:408793-408793;SwappedAlleles;callable=CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK;callsets=6;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;datasets=4;datasetsmissingcall=IonExome,SolidSE75bp;filt=CS_CCS15kb_20kbDV_filt,CS_CCS15kb_20kbGATK4_filt;platformnames=Illumina,PacBio,CG,10X;platforms=4 GT:AD:ADALL:DP:GQ 1/0:153,159:141,132:689:99
query
chr1 408793 . G C 600.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.390;DP=64;ExcessHet=0.0000;FS=0.961;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.53;ReadPosRankSum=-3.146;SOR=0.906 GT:AD:DP:GQ:PL 0/1:33,30:63:99:608,0,798
********* DONE Tester happy sur ce variant: non retrouvé
CLOSED: [2023-07-08 Sat 10:42]
Il faut au moins 2 variant + les index (sinon "no chromosome in common")
Donc on augmente la taille de la région
#+begin_src sh :dir ~/code/bisonex/out
bcftools view HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz -o lifted_test.vcf.gz chr1:408793-408900
bcftools view HG001-SRX11061486_SRR14724513-T2T.vcf.gz -o run_test.vcf.gz chr1:408793-408900
bcftools index lifted_test.vcf.gz
bcftools index run_test.vcf.gz
#+end_src
#+begin_src sh
hap.py lifted_test.vcf.gz run_test.vcf.gz --reference ../data/chm13v2.0.fa -o test
#+end_src
On trouve bien les 2 variants. L'indel est retrouvé mais pas le SNP
chr1 408793 . G C 600.64 . BS=408793 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
chr1 408874 . CGAA C 1476.6 . BS=408874 GT:BD:BK:BI:BVT:BLT:QQ 0/1:TP:gm:d1_5:INDEL:het:1476.6 0/1:TP:gm:d1_5:INDEL:het:1476.6
********* DONE Changer flag : mismatch -> PASS: OUI
CLOSED: [2023-07-08 Sat 10:42]
********* DONE Changer haplotype 1/0 -> 0/1 truth: idem
CLOSED: [2023-07-08 Sat 10:35]
********* DONE Tester avec vcfeval : idemnt
CLOSED: [2023-07-08 Sat 10:43]
******* DONE Comparer hg38 et T2T: 2x moinsr de variants, trop de FP et FN
CLOSED: [2023-07-07 Fri 18:38]
T2T
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | F
.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-06-30 Fri 22:08] SCHEDULED: <2023-06-25 Sun>
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/HG001-SRX11061486_SRR14724513-GRCh38 --query=out/HG001-SRX11061486_SRR14724513-GRCh38/callVariant/haplotypecaller/HG001-SRX11061486_SRR14724513-GRCh38.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG001
#+end_src
Meilleurs résultats !
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.89071 | 0.88551 | 0.378198 | 0.888102 | | | 1.86096256684492 | 2.247272727272727 |
| INDEL | PASS | 549 | 489 | 60 | 899 | 64 | 340 | 8 | 17 | 0.89071 | 0.88551 | 0.378198 | 0.888102 | | | 1.86096256684492 | 2.247272727272727 |
| SNP | ALL | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.007110300820419 | 2.78468624064479 | 1.5918102430965306 | 1.8161449399656946 |
| SNP | PASS | 21973 | 21462 | 511 | 26285 | 563 | 4263 | 68 | 16 | 0.976744 | 0.974435 | 0.162184 | 0.975588 | 3.007110300820419 | 2.78468624064479 | 1.5918102430965306 | 1.8161449399656946 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:55] SCHEDULED: <2023-06-25 Sun>
#+begin_src
ID="HG001-SRX11061486_SRR14724513-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,hay -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** DONE HG002 :hg002:hg38:
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-25 Tue>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:24] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 626 | 536 | 90 | 900 | 23 | 334 | 6 | 14 | 0.85623 | 0.959364 | 0.371111 | 0.904868 | | | 1.5696202531645569 | 2.406844106463878 |
| INDEL | PASS | 626 | 536 | 90 | 900 | 23 | 334 | 6 | 14 | 0.85623 | 0.959364 | 0.371111 | 0.904868 | | | 1.5696202531645569 | 2.406844106463878 |
| SNP | ALL | 23136 | 22346 | 790 | 26289 | 103 | 3845 | 77 | 4 | 0.965854 | 0.995411 | 0.146259 | 0.98041 | 2.9282199219412863 | 2.7752583237657866 | 1.6348301800775018 | 1.8423330808354075 |
| SNP | PASS | 23136 | 22346 | 790 | 26289 | 103 | 3845 | 77 | 4 | 0.965854 | 0.995411 | 0.146259 | 0.98041 | 2.9282199219412863 | 2.7752583237657866 | 1.6348301800775018 | 1.8423330808354075 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src sh
ID="HG002-SRX11061487_SRR14724512-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG002 --genome=GRCh38
#+end_src
**** DONE HG003 :hg003:hg38:
CLOSED: [2023-07-30 Sun 14:26]
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 644 | 538 | 106 | 914 | 32 | 337 | 7 | 19 | 0.835404 | 0.944541 | 0.368709 | 0.886626 | | | 1.7444933920704846 | 2.3138686131386863 |
| INDEL | PASS | 644 | 538 | 106 | 914 | 32 | 337 | 7 | 19 | 0.835404 | 0.944541 | 0.368709 | 0.886626 | | | 1.7444933920704846 | 2.3138686131386863 |
| SNP | ALL | 23126 | 22271 | 855 | 26405 | 135 | 4002 | 90 | 20 | 0.963029 | 0.993974 | 0.151562 | 0.978257 | 2.949462182004439 | 2.7766657134686876 | 1.6080333972695475 | 1.8465106245280984 |
| SNP | PASS | 23126 | 22271 | 855 | 26405 | 135 | 4002 | 90 | 20 | 0.963029 | 0.993974 | 0.151562 | 0.978257 | 2.949462182004439 | 2.7766657134686876 | 1.6080333972695475 | 1.8465106245280984 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk-4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src
ID="HG003-SRX11061488_SRR14724511-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG003 --genome=GRCh38
#+end_src
**** TODO HG004 :hg38:hg004:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
CLOSED: [2023-07-30 Sun 14:39] SCHEDULED: <2023-07-23 Sun>
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 588 | 511 | 77 | 873 | 15 | 332 | 7 | 8 | 0.869048 | 0.972274 | 0.380298 | 0.917767 | | | 1.6111111111111112 | 2.3984375 |
| INDEL | PASS | 588 | 511 | 77 | 873 | 15 | 332 | 7 | 8 | 0.869048 | 0.972274 | 0.380298 | 0.917767 | | | 1.6111111111111112 | 2.3984375 |
| SNP | ALL | 23041 | 22225 | 816 | 26302 | 114 | 3972 | 87 | 5 | 0.964585 | 0.994895 | 0.151015 | 0.979505 | 2.91407709288504 | 2.7368645271229766 | 1.6883829538820783 | 1.8632748665431964 |
| SNP | PASS | 23041 | 22225 | 816 | 26302 | 114 | 3972 | 87 | 5 | 0.964585 | 0.994895 | 0.151015 | 0.979505 | 2.91407709288504 | 2.7368645271229766 | 1.6883829538820783 | 1.8632748665431964 |
***** DONE Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi" gatk4.4
CLOSED: [2023-08-03 Thu 23:24] SCHEDULED: <2023-08-03 Thu>
#+begin_src sh
ID="HG004-SRX11061489_SRR14724510-GRCh38" ; nextflow run workflows/compareVCF.nf -profile standard,helios --outdir=out/${ID} --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --compare=vcfeval,happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v7_hg38_Regions.bed --id=HG003 --genome=GRCh38
#+end_src
**** TODO Refaire la figure pour HG001,2,3,4
SCHEDULED: <2023-08-04 Fri>
**** STRT HG001 :hg001:T2T:
Avec liftover : 10x moins de variants...
Type,Filter,TRUTH.TOTAL,TRUTH.TP,TRUTH.FN,QUERY.TOTAL,QUERY.FP,QUERY.UNK,FP.gt,FP.al,METRIC.Recall,METRIC.Precision,METRIC.Frac_NA,METRIC.F1_Score,TRUTH.TOTAL.TiTv_ratio,QUERY.TOTAL.TiTv_ratio,TRUTH.TOTAL.het_hom_ratio,QUERY.TOTAL.het_hom_ratio
INDEL,ALL,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
INDEL,PASS,413,246,167,751,289,215,2,93,0.595642,0.460821,0.286285,0.519629,,,2.4285714285714284,2.4651162790697674
SNP,ALL,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
SNP,PASS,11236,10985,251,23597,9771,2841,26,58,0.977661,0.529245,0.120397,0.686734,3.1146100329549617,2.857049501715406,3.640644361833953,2.1146328578975173
***** TODO Comprendre mauvais résultat
****** DONE 2x moins de variants, beaucoup trop de FP : erreur de FILTER
CLOSED: [2023-07-08 Sat 11:17] SCHEDULED: <2023-07-07 Fri>
******* DONE [#A] Certains FP sont des vrais FP: erreur de flag et non d'happy
CLOSED: [2023-07-08 Sat 10:43] SCHEDULED: <2023-07-07 Fri>
******** Ex: chr1:408793
G C 600.64 . BS=408793;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
CLOSED: [2023-07-07 Fri 20:03]
********* DONE Vérifier que l'on est bien dans la zone "high confidence" définie dans le bed
oui:
chr1 408514 414397 . 500 +
********* DONE Vérifier que le variant est bien dans le truth et query
CLOSED: [2023-07-07 Fri 20:14]
truth
chr1 408793 . G C 50 MismatchedRefAllele AttemptedAlleles=C*->G;AttemptedLocus=chr1:408793-408793;SwappedAlleles;callable=CS_HiSeqPE300xGATK_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,CCS15kb_20kbDV,10XLRGATK;callsets=6;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,10XChromiumLR;datasets=4;datasetsmissingcall=IonExome,SolidSE75bp;filt=CS_CCS15kb_20kbDV_filt,CS_CCS15kb_20kbGATK4_filt;platformnames=Illumina,PacBio,CG,10X;platforms=4 GT:AD:ADALL:DP:GQ 1/0:153,159:141,132:689:99
query
chr1 408793 . G C 600.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.390;DP=64;ExcessHet=0.0000;FS=0.961;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=9.53;ReadPosRankSum=-3.146;SOR=0.906 GT:AD:DP:GQ:PL 0/1:33,30:63:99:608,0,798
********* DONE Tester happy sur ce variant: non retrouvé
CLOSED: [2023-07-08 Sat 10:42]
Il faut au moins 2 variant + les index (sinon "no chromosome in common")
Donc on augmente la taille de la région
#+begin_src sh :dir ~/code/bisonex/out
bcftools view HG001_GRCh38_1_22_v4_lifted_merged.vcf.gz -o lifted_test.vcf.gz chr1:408793-408900
bcftools view HG001-SRX11061486_SRR14724513-T2T.vcf.gz -o run_test.vcf.gz chr1:408793-408900
bcftools index lifted_test.vcf.gz
bcftools index run_test.vcf.gz
#+end_src
#+begin_src sh
hap.py lifted_test.vcf.gz run_test.vcf.gz --reference ../data/chm13v2.0.fa -o test
#+end_src
On trouve bien les 2 variants. L'indel est retrouvé mais pas le SNP
chr1 408793 . G C 600.64 . BS=408793 GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:.:tv:SNP:het:600.64
chr1 408874 . CGAA C 1476.6 . BS=408874 GT:BD:BK:BI:BVT:BLT:QQ 0/1:TP:gm:d1_5:INDEL:het:1476.6 0/1:TP:gm:d1_5:INDEL:het:1476.6
********* DONE Changer flag : mismatch -> PASS: OUI
CLOSED: [2023-07-08 Sat 10:42]
********* DONE Changer haplotype 1/0 -> 0/1 truth: idem
CLOSED: [2023-07-08 Sat 10:35]
********* DONE Tester avec vcfeval : idemnt
CLOSED: [2023-07-08 Sat 10:43]
******* DONE Comparer hg38 et T2T: 2x moinsr de variants, trop de FP et FN
CLOSED: [2023-07-07 Fri 18:38]
T2T
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | F
r
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******** Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
1. lister 10 reads contenant le variant
2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******** DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******** PROJ Corriger proprement VCF ou résultats Happy
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* PROJ Examiner les FP restant après correction selon séquence de référence
******* PROJ Examiner les variants supprimé
****** KILL Méthodologie du pangenome
CLOSED: [2023-07-31 Mon 22:29] SCHEDULED: <2023-07-30 Sun>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] => 0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] => 0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878 :cento:hg001:
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** TODO Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed --id=HG001 --genome=GRCh38
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| INDEL | PASS | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| SNP | ALL | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
| SNP | PASS | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| INDEL | PASS | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| SNP | ALL | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
| SNP | PASS | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
**** TODO Test Twist Human core Exome (hg38):giab:
SCHEDULED: <2023-08-03 Thu 20:00>
**** TODO Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
SCHEDULED: <2023-08-03 Thu 20:00>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** TODO Comparer happy et happy-vcfeval :giab:
SCHEDULED: <2023-08-03 Thu>
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
foreach(rm, variantRecoderInput())
foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
dNew = mergeAll("recoded-success.csv",
"recoded-failures-mobidetails.csv",
"recoded-failures-mobidetails-redo.csv")
dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
@transform :transcript = :
transcript .* ":" .* :coding .* :codingPos .* :codingChange
@select :file :transcript :classification :zygosity
@rename :classificationCento = :classification
end
dTmp = outerjoin(dInit, dNew, on = :transcript)
CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
- mauvais patient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://ra
r
2. blat sur la séquence autour du variant
3. identifier quelques reads contenant le variant et regarder leur aligneement en hg38
Après discussion avec Alexis: solution 3
******** Algorithme
1. Extraire les coordonnées en T2T des faux positifs *homozygote*
2. Pour chaque faux positif
1. lister 10 reads contenant le variant
2. pour chacun de ces reads, récupérer la séquence en T2T et GRCh38 via le nom du read dans le bam
3. si la séquence en T2T modifiée par le variant est "identique" à celle en GRCh38, alors on ignore ce faux positif
Note: on ignore les reads qui ont changé de chromosome entre les version
******** DONE Résultat préliminaire
CLOSED: [2023-07-23 Sun 14:30]
cf [[file:~/roam/research/bisonex/code/giab/giab-corrected.csv][script julia]]
3498 faux positifs en moins, soit 0.89 sensibilité
julia> tp=15479
julia> fp=5277
julia> tp/(tp+fp)
0.7457602620928888
julia> tp/(tp+(fp-3498))
0.8969173716537258
On est toujours en dessous des 97%
******** PROJ Corriger proprement VCF ou résultats Happy
******* DONE Vérifier quelques variants sur IGV
CLOSED: [2023-07-09 Sun 17:36]
******* KILL Répartition des FP : cluster ?
CLOSED: [2023-07-09 Sun 17:36]
******* PROJ Examiner les FP restant après correction selon séquence de référence
******* PROJ Examiner les variants supprimé
****** KILL Méthodologie du pangenome
CLOSED: [2023-07-31 Mon 22:29] SCHEDULED: <2023-07-30 Sun>
***** KILL Mail Yannis
CLOSED: [2023-07-08 Sat 10:44]
***** DONE Mail GIAB pour version T2T
CLOSED: [2023-07-07 Fri 18:37]
**** TODO HG002 :hg002:T2T:
**** TODO HG003 :hg003:T2T:
**** TODO HG004 :hg004:T2T:
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-07-30 Sun 16:49] SCHEDULED: <2023-07-30 Sun 15:00>
:LOGBOOK:
CLOCK: [2023-07-30 Sun 16:06]--[2023-07-30 Sun 16:35] => 0:29
CLOCK: [2023-07-30 Sun 15:39]--[2023-07-30 Sun 15:40] => 0:01
:END:
/Entered on/ [2023-04-16 Sun 17:29]
Refaire résultats
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878 :cento:hg001:
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
**** DONE ADN commandé
CLOSED: [2023-06-30 Fri 22:29]
**** DONE Sauvegarder les données brutes
CLOSED: [2023-07-30 Sun 14:22] SCHEDULED: <2023-07-19 Wed>
K, scality, S
**** KILL Récupérer le fichier de capture
CLOSED: [2023-07-30 Sun 14:25] SCHEDULED: <2023-07-23 Sun>
Candidats donnés dans publication https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8354858/
#+begin_quote
In short, the Nextera Rapid Capture Exome Kit (Illumina, San Diego, CA), the SureSelect Human All Exon kit (Agilent, Santa Clara, CA) or the Twist Human Core Exome was used for enrichment, and a Nextseq500, HiSeq4000, or Novoseq 6000 (Illumina) instrument was used for the actual sequencing, with the average coverage targeted to at least 100× or at least 98% of the target DNA covered 20×.
#+end_quote
Par défaut, on utilisera https://www.twistbioscience.com/products/ngs/alliance-panels#tab-3
ANnonce récente pour nouveau panel Twist : https://www.centogene.com/news-events/news/newsdetails/twist-bioscience-and-centogene-launch-three-panels-to-advance-rare-disease-and-hereditary-cancer-research-and-support-diagnostics
Masi pas de fichier BED
***** DONE Mail centogène
CLOSED: [2023-07-30 Sun 14:22] DEADLINE: <2023-07-23 Sun>
**** TODO Tester Nextera Rapid Capture Exome v1.2 (hg19) :giab:
SCHEDULED: <2023-08-03 Thu 19:00>
https://support.illumina.com/downloads/nextera-rapid-capture-exome-v1-2-product-files.html
**** DONE Tester Agilent SureSelect All Exon V8 (hg38) :giab:
CLOSED: [2023-07-31 Mon 23:09] SCHEDULED: <2023-07-31 Mon>
https://earray.chem.agilent.com/suredesign/index.htm
"Find design"
"Agilent catalog"
Fichiers:
- Regions.bed: Targeted exon intervals, curated and targeted by Agilent Technologies
- MergedProbes.bed: Merged probes for targeted enrichment of exons described in Regions.bed
- Covered.bed: Merged probes and sequences with 95% homology or above
- Padded.bed: Merged probes and sequences with 95% homology or above extended 50 bp at each side
- AllTracks.bed: Targeted regions and covered tracks
#+begin_src sh
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy/ --compare=happy -lib lib --capture=capture/Agilent_SureSelect_All_Exons_v8_hg38_Regions.bed --id=HG001 --genome=GRCh38
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| INDEL | PASS | 423 | 395 | 28 | 915 | 108 | 405 | 4 | 13 | 0.933806 | 0.788235 | 0.442623 | 0.854868 | | | 1.7012987012987013 | 2.7916666666666665 |
| SNP | ALL | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
| SNP | PASS | 20984 | 20600 | 384 | 26080 | 780 | 4703 | 62 | 10 | 0.9817 | 0.963512 | 0.18033 | 0.972521 | 3.0499710592321048 | 2.7596541786743516 | 1.58256372367935 | 1.8978207694018234 |
**** DONE Test Twist Human core Exome (hg38):giab:
CLOSED: [2023-08-01 Tue 23:16] SCHEDULED: <202 3-08-02 Wed>
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
#+begin_src
nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/2300346867_63118093_NA12878-GRCh38/callVariant/haplotypecaller/2300346867_63118093_NA12878-GRCh38.vcf.gz --outdir=out/2300346867_63118093_NA12878-GRCh38/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| INDEL | PASS | 328 | 313 | 15 | 722 | 95 | 309 | 4 | 13 | 0.954268 | 0.769976 | 0.427978 | 0.852273 | | | 1.8584070796460177 | 2.8967391304347827 |
| SNP | ALL | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
| SNP | PASS | 19198 | 18962 | 236 | 23381 | 684 | 3738 | 48 | 10 | 0.987707 | 0.965178 | 0.159873 | 0.976313 | 3.1034188034188035 | 2.859264147830391 | 1.5669565217391304 | 1.8578767123287672 |
**** STRT Test Twist Human core Exome (hg38):giab:
SCHEDULED: <2023-08-03 Thu 20:00>
#+begin_src sh
ID="2300346867_NA12878-63118093_S260-GRCh38"; nextflow run workflows/compareVCF.nf -profile standard,helios --query=out/${ID}/callVariant/haplotypecaller/${ID}.vcf.gz --outdir=out/${ID}/happy-twist-exome-core/ --compare=happy -lib lib --capture=capture/Twist_Exome_Core_Covered_Targets_hg38.bed --id=HG001 --genome=GRCh38 -bg
#+end_src
**** TODO Tester Agilen SureSelect All Exon V8 (hg38) GATK-4.4:giab:
SCHEDULED: <2023-08-03 Thu 20:00>
**** KILL Tester si le panel Twist Alliance VCGS Exome suffit
CLOSED: [2023-07-31 Mon 22:31] SCHEDULED: <2023-07-30 Sun>
**** PROJ Comparer happy et happy-vcfeval :giab:
** TODO Insilico :cento:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
foreach(rm, variantRecoderInput())
foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
dNew = mergeAll("recoded-success.csv",
"recoded-failures-mobidetails.csv",
"recoded-failures-mobidetails-redo.csv")
dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
@transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
@select :file :transcript :classification :zygosity
@rename :classificationCento = :classification
end
dTmp = outerjoin(dInit, dNew, on = :transcript)
CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de cento
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
- mauvais patient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
#+end_src
#+RESULTS:
******* DONE Profile github HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://ra