#+title: Bisonex
* Biblio
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
/ : genotype unphased
| : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
SCHEDULED: <2023-03-13 lun.>
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
SCHEDULED: <2023-03-04 Sat>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
SNV
| Exome | Pipeline | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | BWA_GATK | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | BWA_GATK | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
indel
| TP | FP | FN | Sensitivity | Precision | F-Score | FDR | |
| 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 | |
| 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 | |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
- Cornish and Guda, 2015
Variant Type
| | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source |
| NA12878 | 100%a | 40% | 0 | 0 | 0 | 0 | 0 | Zook et al18 | NIST |
| Other NIST standard | 71% | 40% | 50% | 0 | 0 | 0 | 0 | Zook et al18 | |
| (e.g. AJ/Asian trios) | | | | | | | | | |
| Platinum | 29% | 0 | 0 | 0 | 0 | 0 | 0 | Eberle et al8 | Platinum |
| Genomes | | | | | | | | | |
| Venter/HuRef | 14% | 40% | 0 | 0 | 0 | 0 | 0 | Trost et al1 | HuRef |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
title = "Systematic comparison of germline variant calling pipelines
cross multiple next-generation sequencers",
author = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
Yuhuan and Du, Hongli",
abstract = "The development and innovation of next generation sequencing
(NGS) and the subsequent analysis tools have gain popularity in
scientific researches and clinical diagnostic applications.
Hence, a systematic comparison of the sequencing platforms and
variant calling pipelines could provide significant guidance to
NGS-based scientific and clinical genomics. In this study, we
compared the performance, concordance and operating efficiency
of 27 combinations of sequencing platforms and variant calling
pipelines, testing three variant calling pipelines-Genome
Analysis Tool Kit HaplotypeCaller, Strelka2 and
Samtools-Varscan2 for nine data sets for the NA12878 genome
sequenced by different platforms including BGISEQ500,
MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
calling performance of 12 combinations in WES datasets, all
combinations displayed good performance in calling SNPs, with
their F-scores entirely higher than 0.96, and their performance
in calling INDELs varies from 0.75 to 0.91. And all 15
combinations in WGS datasets also manifested good performance,
with F-scores in calling SNPs were entirely higher than 0.975
and their performance in calling INDELs varies from 0.71 to
0.93. All of these combinations manifested high concordance in
variant identification, while the divergence of variants
identification in WGS datasets were larger than that in WES
datasets. We also down-sampled the original WES and WGS datasets
at a series of gradient coverage across multiple platforms, then
the variants calling period consumed by the three pipelines at
each coverage were counted, respectively. For the GIAB datasets
on both BGI and Illumina platforms, Strelka2 manifested its
ultra-performance in detecting accuracy and processing
efficiency compared with other two pipelines on each sequencing
platform, which was recommended in the further promotion and
application of next generation sequencing technology. The
results of our researches will provide useful and comprehensive
guidelines for personal or organizational researchers in
reliable and consistent variants identification.",
journal = "Sci. Rep.",
publisher = "Springer Science and Business Media LLC",
volume = 9,
number = 1,
pages = "9345",
month = jun,
year = 2019,
copyright = "https://creativecommons.org/licenses/by/4.0",
language = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
For WES datasets, the BGI platforms displayed the superior performance in SNPs
calling while Illumina platforms manifested the better variants calling
performance in INDELs calling, which could be explained by their divergence in
sequencing strategy that producing different length of reads (all BGI platforms
were 100 base pair read length while all Illumina platforms were 150 base pair
read length). The read length effects, as a key factor between two platforms,
would bring alignment bias and error which are higher for short reads and
ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD 1 String Decision for call (TP/FP/FN/N)
: FORMAT BK 1 String Sub-type for decision (match/mismatch type)
: FORMAT BVT 1 String High-level variant type (SNP|INDEL).
: FORMAT BLT 1 String High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 19196584 . T C 50 PASS platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:781:109,123:138,150:348
: NC_000022.11 19196584 rs1061325 T C 59.32 PASS AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
bcftools filter -i 'POS=31277416' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=31277416' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 31277416 . CA C 50 PASS platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5 GT:PS:DP:ADALL:AD:GQ 1/1:.:465:16,229:0,190:129
: NC_000022.11 31277416 rs57244615 CAA C,CA 389.02 PASS AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572 GT:AD:DP:GQ:PL 1/2:5,10,14:29:64:406,202,313,64,0,88
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina https://github.com/genome-in-a-bottle/giab_data_indexes
puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
Ajout des zones "difficiles"
https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
-> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
-> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données
** TODO Remplacer bam par fastq sur mesocentre
Commande
*** STRT Supprimer les fastq non "paired"
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 }
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name } | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 } | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
Puis on supprime
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** TODO Supprimer bam qui ont des fastq
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz" | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam" | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** STRT Télécharger les données avec nextflow
**** DONE Genome de référence
**** DONE dbSNP
**** TODO VEP 20G
Ajout vérification checksum -> à vérifier
**** TODO transcriptome (spip)
Rajouter checksum manuel
**** KILL Refseq
**** STRT OMIM
codé, à vérifier
**** TODO ACMG incidental
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 | |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . | . | 12 | | | | | | | | |
| . | 12 | 0 | 2 | | | | | | | |
| 2 | 3 | 2 | 2 | 2 | 5 | . | | | | |
| . | 2 | 3 | 2 | 2 | 4 | | | | | |
| . | . | 3 | 12 | 3 | | | | | | |
| . | 5 | 2 | . | | | | | | | |
| . | . | . | 5 | 2 | 2 | | | | | |
| . | 9 | 9 | 9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039 |
| rs3833341 |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
| 764018 | A | ACAGGTCAAT,ACAGGT | .,5 | 2,. | |
| 23637790 | C | G,T | .,.,12 | | |
| 44651586 | C | A,G,T | .,.,.,5 | 2 | 2 |
| 764018 | A | ACAGGTCAAT | Benign | | |
| 23637790 | C | T | Benign | | |
| 44651586 | C | T | Benign | | |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 clinvar bénin et 1 patho.
En cherchant par NM, tu vois qu'il est bénin sur clinvar car il y a d'autres soumissions ! https://www.ncbi.nlm.nih.gov/clinvar/variation/262235/
Confirmation sur nos bases de données :
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' dbSNP_common_chr20.vcf.gz
764018 A ACAGGTCAAT,ACAGGT .,5|2,.
$ bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=764018' clinvar_chr20.vcf.gz
764018 A ACAGGTCAAT Benign
***** KILL Corriger script alexi
CLOSED: [2022-12-04 Sun 13:03]
Gère clinvar patho, probablement patho ou conflicting !
***** HOLD Rtg tools
****** Test
1. Générer SDf file
#+begin_src sh
rtg format genomeRef.fna -o genomeRef.sdf
#+end_src
2. Pour les bases de donnés, il faut l'option --sample ALT sinon on a
#+begin_src
$ rtg vcfeval -b dbSNP_common.vcf.gz -c clinvar.vcf.gz -o test -t genomeRef.sdf/^C
VCF header does not contain a FORMAT field named GQ
Error: Record did not contain enough samples: NC_000001.11 10001 rs1570391677 A,C . PASS RS=1570391677;dbSNPBuildID=154;SSR=0;PSEUDOGENEINFO=DDX11L1:100287102;VC=SNV;R5;GNO;FREQ=KOREAN:0.9891,0.0109,.|SGDP_PRJ:0,1,.|dbGaP_PopFreq:1,.,0;COMMON
#+end_src
Essai intersection clinvar (patho ou non) dbSNP
- faux négatif = dbSNP common qui ne sont pas dans clinvar
- faux positif = clinvar qui ne sont pas dbSNP common
- vrai positif = clinvar qui sont dans dbSNP common
- vrai positif baseline = dbSNP common qui sont dans clinvar
On calcule le nombre de lignes
#+begin_src ssh
zgrep '^[^#]' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
for i in *.vcf.gz; do echo $i; zgrep '^[^#]' $i | wc -l; done
#+end_src
| clinvar | 1493470 |
| fn.vcf.gz | 22330220 |
| fp.vcf.gz | 1222529 |
| tp-baseline.vcf.gz | 131040 |
| tp.vcf.gz | 136638 |
À noter qu'on ne retrouve pas tout clinvar...
1222529 + 131040 = 1353569 < 1493470
certains régions ne sont pas traitées :
#+begin_quote
Evaluation too complex (50002 unresolved paths, 34891 iterations) at reference region NC_000001.11:790930-790970. Variants in this region will not be included in results
#+end_quote
#+begin_src sh
grep 'not be included' vcfeval.log | wc -l
56192
#+end_src
Le total est quand même inférieur
On veut les clinvar non patho dans dbSNP soit les faux négatif (dbSNP common not contenu dans clinvar patho)
#+begin_src sh
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
tabix /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar-patho.vcf.gz
#+end_src
On lance le script (dbSNP common et clinvar = 9h)
#+begin_src sh
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -p smp
#SBATCH --time=12:00:00
#SBATCH --mem=12G
dir=/Work/Groups/bisonex/data
dbSNP=$dir/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz
clinvar=$dir/clinvar/GRCh38/clinvar-patho.vcf.gz
genome=$dir/genome/GRCh38.p13/genomeRef.sdf
srun rtg vcfeval -b $dbSNP -c $clinvar -o common-not-patho -t $genome --sample ALT
#+end_src
****** HOLD Voir pour régions complexes non traitées
***** DONE bcftools isec : non
CLOSED: [2022-11-27 Sun 00:38]
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -p common
#+end_src
On vérifie bien que les 2 fichiers commons on le même nombre de lignes
#+begin_src sh
$ grep -e '^NC' 0002.vcf | wc -l
74302
alex@gentoo ~/code/bisonex/data/common $ grep -e '^NC' 0003.vcf | wc -l
74302
#+end_src
****** DONE Impact option -n
CLOSED: [2022-10-23 Sun 13:56]
Mais en spécifiant -n =2:
#+begin_src sh
$ bedtools intersect -a dbSNP_common.vcf.gz -b clinvar.vcf.gz
74978
#+end_src
Si on ne regarde que les variants, on retrouve bien 74302
#+begin_src sh
rg "^NC" none_sorted.vcf | wc -l
#+end_src
NB : test fait avec
#+begin_src
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none.vcf
sort common/0003.vcf > common/0003_sorted.vcf
comm -13 common/0003_sorted.vcf none_sorted.vcf
#+end_src
****** DONE Géstion des duplicates: -c none
CLOSED: [2022-10-23 Sun 13:56]
Si on ne garde que ceux avec REF et ALT identiques
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | wc -l
74978
#+end_src
Si on garde tout
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | wc -l
137777
#+end_src
Pour regarder la différence :
#+begin_src sh
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c none -n =2 -w 1 | sort > none_sorted.vcf
bcftools isec dbSNP_common.vcf.gz clinvar.vcf.gz -c all -n =2 -w 1 | sort > all_sorted.vcf
comm -13 none_sorted.vcf all_sorted.vcf | head
#+end_src
Sur un exemple,on a bien des variants différents
****** DONE Suppression des clinvar patho
CLOSED: [2022-10-23 Sun 18:55]
Semble faire le travail vu que dbSNP_commo a 23194960 lignes (donc ~80 000 de moins)
#+begin_src sh
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz | wc -l
Note: -w option not given, printing list of sites...
23119984
#+end_src
Par contre, l'o'ption -w ou -p fait des ficher "data"...
Après un nouvel essai, plus de problème
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n=1 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
À noter le choix de l'option -n qui change entre "=1" et "~10"...
En effet "=1" = au moins 1 fichier et "~10" fait exactement dans le premier et non dans le second
#+begin_src
$ bcftools isec -e 'INFO/CLNSIG="Pathogenic" & INFO/CLNSIG="Pathogenic/Likely_pathogenic"' -c none -n~10 dbSNP_common.vcf.gz clinvar.vcf.gz -w 1 -o lol.vcf.gz
$ zcat lol.vcf.gz | wc -l
23120660
#+end_src
****** DONE Valider avec Alexis : bcftool isec
CLOSED: [2022-11-07 Mon 21:42 ]
****** DONE Pourquoi nombre de lignes différentes avec la version d'Alexis -> isec ne gère pas plusieurs ALT
CLOSED: [2022-11-26 Sat 23:36]
Grosse différence !
#+begin_src
$ wc -l ID_of_common_snp_not_clinvar_patho.txt
23119915 ID_of_common_snp_not_clinvar_patho.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
85820 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
À noter que tout dbSNP = 23194960
******* Clinvar classe 4 ? Moins mais toujours trop
#+begin_src
$ zgrep '^NC' tmp.vcf.gz | wc -l
21081654
#+end_src
******* Comparer les ID et regarder ceux en plus
#+begin_src sh
bcftools isec -e 'INFO/CLNSIG="Pathogenic"' -c none -n~10 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -w 1 -o tmp.vcf.gz
zgrep -o -e 'rs[[:digit:]]\' tmp.vcf.gz | sort | id_sorted.txt
sort ../database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt > reference_sorted.txt
comm -23 id_sorted.txt reference_sorted.txt > unique1.txt
#+end_src
Par exemple
#+begin_src sh
zgrep rs1000000561 ../database/dbSNP/dbSNP_common.vcf.gz
#+end_src
NC_000002.12 136732859 rs1000000561 ACG A,ACGCG . PASS RS=1000000561;dbSNPBuildID=151;SSR=0;VC=INDEL;GNO;FREQ=ALSPAC:0.2506,0.7494,.|TOMMO:0.9971,0.002865,.|TWINSUK:0.2473,0.7527,.|dbGaP_PopFreq:0.993,0.006943,8.902e-05;COMMON
Attention, clinvar est en numéro de chromosomoe et dbSNP en NC...
Normalement, géré lors du calcul d'intersection !
Ce SNP n'est pas dans clinvar (vérifié dans UCSC)
******* Tester sur chromosome 20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools view --regions NC_000020.11 ../database/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_chr20.vcf.gz
bcftools view --regions 20 ../database/clinvar/clinvar.vcf.gz -o clinvar_chr20.vcf.gz
tabix dbSNP_common_chr20.vcf.gz
tabix clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
Attention à bien renommer clinvar !
#+begin_src sh :dir ~/code/bisonex/test_isec
mv clinvar_chr20.vcf.gz clinvar_chr20_notremapped.vcf.gz
bcftools annotate --rename-chrs chromosome_mapping.txt clinvar_chr20_notremapped.vcf.gz -o clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
*ATTENTION*: sans indexer les vcf, les fichiers seront *VIDES*
*ATTENTION*: par défaut les filtres s'appliquent sur les 2. Cela est un problème si on joue sur l'inclusion et non l'exclusion
Attention: vérifier la conventdion de nommage des chromosomes
******** Test pathogene: ne prend pas en compte les multi-allèles ????
On teste l'intersection dbsnp et clinvar patho ainsi que le complémentaire
#+begin_src sh :dir ~/code/bisonex/test_isec
clinvar=clinvar_chr20_patho.vcf.gz
snp=dbSNP_common_chr20.vcf.gz
bcftools index $clinvar
bcftools index $snp
bcftools filter -i 'INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz -o $clinvar
bcftools isec $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]' $i | wc -l; done
#+end_src
#+RESULTS:
| tmp/0000.vcf |
| 518846 |
| tmp/0001.vcf |
| 0 |
| tmp/0002.vcf |
| 0 |
| tmp/0003.vcf |
| 0 |
Aucun clinvar patho... Clairement faux !
Autre méthode : on inclut tous les SNP et clinvar patho et on regarde ceux uniquement dans dbsnp
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -n=2 -i - -i 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
# grep '^[^#]' tmp/0000.vcf | wc -l
#+end_src
#+RESULTS:
Soit tout dbsnp donc rien
Note : on ne peut pas exclure les clinvar patho directement
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20.vcf.gz
bcftools isec -i - -e 'INFO/CLNSIG="Pathogenic"' $snp $clinvar -p tmp
for i in tmp/*.vcf ; do echo $i; grep '^[^#]' $i | wc -l; done
#+end_src
Car on ne peut plus faire la différence !
Si on utilise la version d'Alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output tmp.txt
sort tmp.txt > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
Si on cherche les clinvar patho (donc non présent dans la sortie)
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > all.txt
sort common-notpatho-alexis.txt > alexis.txt
comm -23 all.txt alexis.txt > patho.txt
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS\n' -i 'ID=@patho.txt' dbSNP_common_chr20.vcf.gz -o pos.txt
for pos in $(cat pos.txt); do
bcftools query -f '%CHROM %POS %ID %REF %ALT\n' -i 'POS='$pos dbSNP_common_chr20.vcf.gz
bcftools query -f '%CHROM %POS %ID %REF %ALT %INFO/CLNSIG\n' -i 'POS='$pos clinvar_chr20.vcf.gz
echo "------"
done
#+end_src
#+RESULTS:
| NC_000020.11 | 3234173 | rs3827075 | T | A,C,G | |
| NC_000020.11 | 3234173 | 262001 | T | G | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 | 1072511 | T | TGGCGAAGC | Pathogenic |
| NC_000020.11 | 3234173 | 208613 | TGGCGAAGC | G | Pathogenic |
| NC_000020.11 | 3234173 | 1312 | TGGCGAAGC | T | Pathogenic |
| ------ | | | | | |
| NC_000020.11 | 4699605 | rs1799990 | A | G | |
| NC_000020.11 | 4699605 | 13397 | A | G | Benign/Likely_benign |
| ------ | | | | | |
| NC_000020.11 | 10652589 | rs1131695 | G | A,C,T | |
| NC_000020.11 | 10652589 | 163705 | G | . | Benign |
| NC_000020.11 | 10652589 | 143063 | G | A | Benign |
| NC_000020.11 | 10652589 | 234555 | G | C | Pathogenic |
| ------ | | | | | |
| NC_000020.11 | 10658574 | rs1801138 | G | A,T | |
| NC_000020.11 | 10658574 | 42481 | G | A | Benign |
| NC_000020.11 | 10658574 | 992651 | G | T | Likely_pathogenic |
| NC_000020.11 | 10658574 | 213550 | GC | A | Pathogenic |
| ------ | | | | | |
| NC_000020.11 | 10672794 | rs79338570 | G | A,C | |
| NC_000020.11 | 10672794 | 255557 | G | A | Benign/Likely_benign |
| NC_000020.11 | 10672794 | 594067 | G | C | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 10672794 | 1324603 | G | GGA | Likely_pathogenic |
| ------ | | | | | |
| NC_000020.11 | 18525868 | rs146917730 | C | T | |
| NC_000020.11 | 18525868 | 811603 | C | T | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 25390747 | rs373200654 | G | C | |
| NC_000020.11 | 25390747 | 338000 | G | C | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 32800145 | rs2424926 | C | G,T | |
| NC_000020.11 | 32800145 | 338173 | C | G | Benign |
| NC_000020.11 | 32800145 | 338174 | C | T | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 33412656 | rs35938843 | C | G,T | |
| NC_000020.11 | 33412656 | 220958 | C | T | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 45891622 | rs181943893 | G | A,C,T | |
| NC_000020.11 | 45891622 | 459632 | G | C | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 45891622 | 797035 | G | T | Likely_benign |
| NC_000020.11 | 45891622 | 1572689 | GCTA | G | Likely_benign |
| ------ | | | | | |
| NC_000020.11 | 54171651 | rs35873579 | G | A,T | |
| NC_000020.11 | 54171651 | 285894 | G | A | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 54171651 | 1373583 | G | C | Uncertain_significance |
| NC_000020.11 | 54171651 | 895614 | G | T | Benign/Likely_benign |
| ------ | | | | | |
| NC_000020.11 | 62172726 | rs36106901 | G | A | |
| NC_000020.11 | 62172726 | 981031 | G | A | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 63349782 | rs1044396 | G | A,C | |
| NC_000020.11 | 63349782 | 93427 | G | A | Benign |
| NC_000020.11 | 63349782 | 857384 | G | C | Conflicting_interpretations_of_pathogenicity |
| ------ | | | | | |
| NC_000020.11 | 63414925 | rs1801545 | G | A,C,T | |
| NC_000020.11 | 63414925 | 194284 | G | A | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 63414925 | 129337 | G | C | Benign |
| NC_000020.11 | 63414925 | 851545 | GG | CA | Uncertain_significance |
| ------ | | | | | |
On a donc plusieurs problèmes :
1. isec devrait fonctionner au moins sur
| NC_000020.11 | 25390747 | rs373200654 | G | C | |
| NC_000020.11 | 25390747 | 338000 | G | C | Conflicting_interpretations_of_pathogenicity |
On teste juste sur cette ligne
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools filter -i 'POS=25390747' dbSNP_common_chr20.vcf.gz -o dbSNP_test.vcf.gz
#+end_src
On retrouve bien la ligne dans l'intersection...
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=25390747' clinvar_chr20.vcf.gz -o clinvar_test.vcf.gz
bcftools index dbSNP_test.vcf.gz dbSNP_test.vcf.gz
bcftools index dbSNP_test.vcf.gz clinvar_test.vcf.gz
bcftools isec dbSNP_test.vcf.gz clinvar_test.vcf.gz -p test
#+end_src
#+RESULTS:
2. isec ne semble pas fonctionner sur en cas d'ALT multiples
| NC_000020.11 | 32800145 | rs2424926 | C | G,T | |
| NC_000020.11 | 32800145 | 338173 | C | G | Benign |
| NC_000020.11 | 32800145 | 338174 | C | T | Conflicting_interpretations_of_pathogenicity |
| | | | | | |
3. s'il y a plusieurs variantions à une position, il faut bien vérifier que tous ne sont pas patho.
La version d'Alexis le fait bien
| NC_000020.11 | 3234173 | rs3827075 | T | A,C,G | |
| NC_000020.11 | 3234173 | 262001 | T | G | Conflicting_interpretations_of_pathogenicity |
| NC_000020.11 | 3234173 | 1072511 | T | TGGCGAAGC | Pathogenic |
| NC_000020.11 | 3234173 | 208613 | TGGCGAAGC | G | Pathogenic |
| NC_000020.11 | 3234173 | 1312 | TGGCGAAGC | T | Pathogenic |
****** DONE Voir si isec gère les multiallélique (chr20) : non, impossible de faire marcher
CLOSED: [2022-11-27 Sun 00:37]
******* DONE chr20 en prenant un patho clinvar aussi dans dbSNP
CLOSED: [2022-11-27 Sun 00:37]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter dbSNP_common_chr20.vcf.gz -i 'POS=10652589' -o test_dbsnp.vcf.gz
bcftools filter clinvar_chr20.vcf.gz -i 'POS=10652589' -o test_clinvar.vcf.gz
bcftools index test_dbsnp.vcf.gz
bcftools index test_clinvar.vcf.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec test_dbsnp.vcf.gz test_clinvar.vcf.gz -p tmp
grep '^[^#]' tmp/0002.vcf
grep '^[^#]' tmp/0003.vcf
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Testé en modifiant test_dbsnp !
Fonctionne avec un variant par ligne
****** DONE isec en coupant les sites multialléliques: non
CLOSED: [2022-11-27 Sun 00:37]
******* DONE Exemple simple ok
CLOSED: [2022-11-27 Sun 00:34]
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools filter -i 'POS=10652589' dbSNP_common_chr20.vcf.gz -o dbsnp_mwi.vcf.gz
bcftools filter -i 'POS=10652589' clinvar_chr20.vcf.gz -o clinvar_mwi.vcf.gz
bcftools index -f dbsnp_mwi.vcf.gz
bcftools index -f clinvar_mwi.vcf.gz
bcftools isec dbsnp_mwi.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
Même en biallélique, ne fonctionne pas.
Chr 20
Avec les fichiers du teste précédent
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbsnp_mwi.vcf.gz -o dbsnp_mwi_norm.vcf.gz
bcftools index dbsnp_mwi_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
*** TODO NA12878
Doc: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/README_v4.2.1.txt
Bed : https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.bed
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE simuscop
CLOSED: [2022-12-30 Fri 22:31]
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
*** TODO hap.py
https://github.com/Illumina/hap.py
**** DONE Version sans rtgtools avec python 3
CLOSED: [2023-02-02 Thu 22:15]
Procédure pour tester
#+begin_src
nix develop .#hap-py
$ genericBuild
#+end_src
1. Supprimer l’appel à make_dependencies dans cmakelist.txt : on peut tout installer avec nix
2. Patch Roc.cpp pour avoir numeric_limits ( error: 'numeric_limits' is not a member of 'std')
3. ajout de flags de link (essai, error)
set(ZLIB_LIBRARIES -lz -lbz2 -lcurl -lcrypto -llzma)
4. Changer les appels à print en print() dans le code python et suppression de quelques import
[nix-shell:~/source]$ sed -i.orig 's/print \"\(.*\)"/print(\1)/' src/python/*.py
**** DONE Sérialiser json pour écrire données de sorties
CLOSED: [2023-02-17 Fri 19:25]
**** DONE Tester sur example
CLOSED: [2023-02-04 Sat 00:25]
#+begin_src sh
$ cd hap.py
$ ../result/bin/hap.py example/happy/PG_NA12878_chr21.vcf.gz example/happy/NA12878_chr21.vcf.gz -f example/happy/PG_Conf_chr21.bed.gz -o test -r example/chr21.fa
#+end_src
#+RESULTS:
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
| INDEL | ALL | 8937 | 7839 | 1098 | 11812 | 343 | 3520 | 45 | 283 | 0.877140 | 0.958635 | 0.298002 | 0.916079 |
| INDEL | PASS | 8937 | 7550 | 1387 | 9971 | 283 | 1964 | 30 | 242 | 0.844803 | 0.964656 | 0.196971 | 0.900760 |
| SNP | ALL | 52494 | 52125 | 369 | 90092 | 582 | 37348 | 107 | 354 | 0.992971 | 0.988966 | 0.414554 | 0.990964 |
| SNP | PASS | 52494 | 46920 | 5574 | 48078 | 143 | 992 | 8 | 97 | 0.893816 | 0.996963 | 0.020633 | 0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X] integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** TODO Reproduire les performances precisionchallenge NA12878
https://www.nist.gov/programs-projects/genome-bottle
***** TODO 0GOOR
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score
INDEL ALL 525466 491355 34111 1156702 57724 605307 9384 25027 0.935084 0.895313 0.523304 0.914766
INDEL PASS 525466 491355 34111 1156702 57724 605307 9384 25027 0.935084 0.895313 0.523304 0.914766
SNP ALL 3365115 3358399 6716 5666020 21995 2284364 4194 1125 0.998004 0.993496 0.403169 0.995745
SNP PASS 3365115 3358399 6716 5666020 21995 2284364 4194 1125 0.998004 0.993496 0.403169 0.995745
TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
NaN NaN 1.528276 2.752637
NaN NaN 1.528276 2.752637
2.100129 1.473519 1.581196 1.795603
2.100129 1.473519 1.581196 1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py example/happy/PG_NA12878_chr21.vcf.gz example/happy/NA12878_chr21.vcf.gz -f example/happy/PG_Conf_chr21.bed.gz -o test
******* Helios
échec
****** KILL avec docker
CLOSED: [2023-02-17 Fri 19:25]
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** TODO Utilise AVX pour accélerer l'exécution
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** TODO Annotation avec nextflow
*** TODO VEP
***** KILL Utiliser --gene-phenotype ?
CLOSED: [2023-03-15 mer. 13:43]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
***** TODO Plugin pour CADD, pLI, LOEUF ?
https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#cadd
CADD: n’a pas réussi à le faire fonctionner
pLI, LOEUF : non demandé
***** TODO Utiliser l'option --hgvsg pour remplaer hgvsg.r ?
Non fait par Alexis, oubli a priori
***** TODO Ajout spliceAI ?
*** TODO Spip
**** TODO Checksum sur données
*** TODO Filtrer après VEP
**** TODO Remplacer avec simplement bcftools filter ?
*** TODO OMIM
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO clinvar
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO ACMG incidental
**** TODO Inclure dans vep ?
*** TODO Grantham
*** TODO LRG
*** TODO Gnomad
** DONE Porter exament la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration
** TODO MAJ avec picard
Normalement, GATK inclut picard mais la dernière version utilise picard pour certains outils
https://gatk.broadinstitute.org/hc/en-us/articles/9570266920219--Tool-Documentation-Index
A compléter après validation
*** TODO markduplicates
La dernière version dans la documentation utilise picard !!
** TODO Bwa-mem2 au lieu de bwa mem
https://github.com/bwa-mem2/bwa-mem2
** TODO Parallélisation haplotypecaller
spark est en beta, ne pas utiliser
parallélisation du pauvre : se restreindre à un chromosome avec -L et paralléliser sur le nombre de chromosome
** KILL CRAM au lieu de SAM ?
CLOSED: [2022-12-30 Fri 20:38]
Version compressée de bam mais :
#+begin_quote
All GATK tools that take in mapped read data expect a BAM file as primary format. Some support the CRAM format, but we have observed performance issues when working directly from CRAM files, so in our own work we convert CRAM to BAM first, and we only use CRAM for archival purposes
#+end_quote
Source: https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats
** TODO Quality score recalibration avec un ensemble de fichier
Voir GATK best practice
** KILL Utiliser T-to-T comme références
CLOSED: [2023-01-01 Sun 21:35]
Semble compliqué avec les nouvelles bases de données
** TODO Macro excel
** TODO Utiliser le XML de clinvar
Extraction sous VCF possible avec
https://github.com/SeqOne/clinvcf
** TODO Franklin
*** TODO Classification via API
https://www.postman.com/genoox-ps/workspace/franklin-api-documentation-s-public-workspace/documentation/6621518-4335389d-12e3-445f-8182-339df95b2a09
** Annotation
Liste complète
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9252745/
*** TODO Utilise une version allégée de GnomAD (une seule colonne)
*** TODO Digenisme (cf nomenclature omim)
C’est dans le nom de la maladie
* TODO Tests
:PROPERTIES:
:CATEGORY: tests
:END:
** TODO Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new.txt |
| 1070310 | total |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_20_old.vcf.gz
bcftools filter -i 'CHROM="19" | CHROM="20"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_20_old.vcf.gz
#+end_src
#+RESULTS:
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19_20.vcf.gz --dbSNP dbSNP_common_19_20.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_20_old.vcf.gz --dbSNP dbSNP_common_19_20_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
***** DONE Regarder la répartition des différences
CLOSED: [2022-12-11 Sun 16:29]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -23 notpatho.new notpatho.old > nopatho.diff
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@nopatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
On a principalement des coordonnées non consensuelles (non "NC_", voir notes)
#+RESULTS:
: 2 NC_000002.12
: 18 NC_000003.12
: 2 NC_000004.12
: 2 NC_000005.10
: 14 NC_000006.12
: 6 NC_000007.14
: 2 NC_000009.12
: 1 NC_000010.11
: 6 NC_000014.9
: 1 NC_000015.10
: 3 NC_000016.10
: 3 NC_000017.11
: 1 NC_000019.10
: 1 NC_000020.11
: 1 NC_000021.9
: 2 NC_000022.11
: 16018 NT_113793.3
: 17010 NT_113796.3
: 14 NT_113891.3
: 1 NT_167244.2
: 13 NT_167245.2
: 2 NT_167246.2
: 13 NT_167247.2
: 7 NT_167248.2
: 14 NT_167249.2
: 14857 NT_187361.1
: 92 NT_187367.1
: 1 NT_187369.1
: 13 NT_187381.1
: 54 NT_187383.1
: 6 NT_187499.1
: 46 NT_187502.1
: 13754 NT_187513.1
: 611 NT_187517.1
: 1 NT_187520.1
: 1 NT_187524.1
: 249 NT_187526.1
: 18 NT_187532.1
: 1 NT_187546.1
: 886 NT_187562.1
: 1 NT_187564.1
: 346 NT_187576.1
: 13 NT_187600.1
: 5 NT_187601.1
: 494 NT_187606.1
: 1 NT_187607.1
: 12 NT_187613.1
: 307 NT_187614.1
: 1 NT_187625.1
: 445 NT_187633.1
: 43 NT_187648.1
: 18 NT_187649.1
: 1 NT_187652.1
: 512 NT_187661.1
: 18 NT_187678.1
: 49 NT_187681.1
: 1 NT_187682.1
: 18 NT_187688.1
: 12 NT_187689.1
: 18 NT_187690.1
: 18 NT_187691.1
: 404 NT_187693.1
: 2 NW_003315952.3
: 1 NW_003315970.2
: 203 NW_003571054.1
: 322 NW_003571055.2
: 16 NW_003571056.2
: 16 NW_003571057.2
: 16 NW_003571058.2
: 16 NW_003571059.2
: 16 NW_003571060.1
: 213 NW_003571061.2
: 2 NW_009646201.1
: 322 NW_009646205.1
: 321 NW_009646206.1
: 371 NW_012132914.1
: 1 NW_012132915.1
: 13 NW_012132918.1
: 2 NW_013171801.1
: 1 NW_013171807.1
: 49 NW_015148966.1
: 14 NW_015495298.1
: 2 NW_015495299.1
: 1 NW_016107298.1
: 4 NW_017363813.1
: 2 NW_017852933.1
: 1 NW_018654722.1
: 38 NW_021160001.1
: 1 NW_021160003.1
: 1 NW_021160007.1
: 7 NW_021160017.1
***** DONE Regarder la différence avec la version sans les sites non consensuels: ok !
CLOSED: [2022-12-11 Sun 20:11]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > notpatho.diff
wc -l
#+end_src
#+RESULTS:
: 528 notpatho.diff
Il manque 528 variants
rs1057520103
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@notpatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
: 528 NC_012920.1
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
**** DONE Supprimer les sites non consensuels
CLOSED: [2022-12-11 Sun 19:51]
**** DONE Rajouter les mitochondries (vu avec Paul)
CLOSED: [2022-12-13 Tue 17:26]
Ok avec notre version générée. Sur le J: 21155134
$ wc -l dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
21155065 dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
$ wc -l ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
21155065 ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
La différence vient probablement d'une vieille version de clinvar
**** TODO Comprendre la différence nouvelle version et prod
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
#sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
#sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > missing-from-old
comm -23 notpatho.new notpatho.old > missing-from-new
wc -l missing-from-old
wc -l missing-from-new
#+end_src
#+RESULTS:
| 75 | missing-from-old |
| 6 | missing-from-new |
Il manque 75 variants et on a 6 en trop
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@missing-from-old' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
| 16 | NC_000001.11 |
| 2 | NC_000002.12 |
| 18 | NC_000003.12 |
| 7 | NC_000004.12 |
| 1 | NC_000005.10 |
| 5 | NC_000006.12 |
| 3 | NC_000007.14 |
| 2 | NC_000009.12 |
| 1 | NC_000010.11 |
| 5 | NC_000011.10 |
| 3 | NC_000015.10 |
| 1 | NC_000016.10 |
| 4 | NC_000017.11 |
| 2 | NC_000019.10 |
| 1 | NC_000020.11 |
| 3 | NC_000022.11 |
| 1 | NC_000023.11 |
| 2 | NT_113891.3 |
| 2 | NT_167245.2 |
| 2 | NT_167247.2 |
| 1 | NT_167248.2 |
| 2 | NT_167249.2 |
| 18 | NT_187532.1 |
| 1 | NT_187562.1 |
| 1 | NT_187633.1 |
| 18 | NT_187649.1 |
| 18 | NT_187678.1 |
| 18 | NT_187688.1 |
| 12 | NT_187689.1 |
| 18 | NT_187690.1 |
| 18 | NT_187691.1 |
| 1 | NW_013171807.1 |
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
*** DONE Comparer les versions
CLOSED: [2023-01-26 jeu. 17:42]
**** DONE bases de données
CLOSED: [2023-01-26 jeu. 17:42]
***** DONE Genome de référénce:
CLOSED: [2023-01-06 Fri 00:00]
Version calculée
$ find . -type f -name "genomeRef.*" -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./genomeRef.fna
d121084c35037763ea58c59726545eaa1c11025a7bf2d75634677c72ddb72fd1 ./genomeRef.dict
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./genomeRef.fna.fai
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./bwa/genomeRef.ann
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./bwa/genomeRef.bwt
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./bwa/genomeRef.sa
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./bwa/genomeRef.amb
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./bwa/genomeRef.pac
Version de référence
$ find . -type f -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./GRCh38_latest_genomic.fna
dd87d628a8179fc1e37a33b99101eece8282bec88dc17b1998a07ff0b912d4a3 ./GRCh38_latest_genomic.fna.index
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./GRCh38_latest_genomic.fna.fai
974cc313146aedd9ec2ae3f86b382b1317180e078621c1d53e8a803d4ec0d3a9 ./GRCh38_latest_genomic.dict
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./GRCh38_latest_genomic.fna.bwt
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./GRCh38_latest_genomic.fna.ann
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./GRCh38_latest_genomic.fna.amb
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./GRCh38_latest_genomic.fna.sa
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./GRCh38_latest_genomic.fna.pac
Dict ok si on renome le ficdhier d'origine
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/
sed 's/UR:.*/UR:genomeRef.fna/' data-alexis-reference/genome/GRCh38_latest_genomic.dict > lol.dict
diff lol.dict data/genome/GRCh38.p13/genomeRef.dict
#+end_src
#+RESULTS:
***** DONE dbSNP et dbSNP common: ok
CLOSED: [2023-01-03 Tue 23:17]
sha256sum GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
sha256sum data-alexis-reference/dbSNP/GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d data-alexis-reference/dbSNP/GCF_000001405.39.gz"
sha256sum dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b dbSNP_common.vcf.gz
sha256sum data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
***** DONE Clinvar : version différente
CLOSED: [2023-01-06 Fri 22:51]
$ zgrep -v '^#' data-alexis-reference/clinvar/clinvar.vcf.gz | wc -l
1474547
$ zgrep -v '^#' data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
1571404
***** DONE Revérifer checksum de GRCh38 et dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
****** DONE GRCh38
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/genome>sha256sum GRCh38_latest_genomic.fna
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
PS J:\bases_de_donnees\genome> cat .\checksum.txt
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
****** DONE dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/dbSNP>sha256sum GCF_000001405.39.gz 01/26/2023 04:40:48 PM
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
PS J:\bases_de_donnees\dbSNP> cat .\checksums.txt
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
**** DONE Outils
CLOSED: [2023-01-26 jeu. 17:42]
| | Prod | Test |
| VCFtools | 0.1.17 | 0.1.16 |
| bcftools | 1.14 | 1.16 |
| samtools | 1.14 | 1.13 |
| gatk | 4.2.4.1 | 4.3.0.0 |
| bwa | ? | 0.7.17-r1188 |
On a des versions plus vieilles sauf (le plus important) Gatk
*** HOLD 63003856_S135
**** KILL Notes : bonne reproductibilité sur le cluster mais diff avec la version "de prod"
CLOSED: [2023-01-09 Mon 23:03]
tester sequential puis version spécifique bwa mem
Note: clinvar est plus ancien dans la version d'Alexis, cela explique les 8884 de la version pseudo-prod (via ID not clinvar patho)
Attention : le nombre de lignes = celles sans commentaires (et pas just NC... car il devrait y avoir les mitochondries !)
$ find . -name filter-depth.vcf -exec sh -c 'echo {}; grep -c -v '^#' {}' \;
Attention, on a un vcf.gz pour la version de test !!! ne pas utiliser le vcf
| | prod | prod | prod | prod | prod | test | test | test |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| PC | Karine | helios | helios | helios | Helios | Helios | Helios | Helios |
| cores | 4 | 4 | 4 | 24 | 1 | 24 | 24 | 1 |
| gatk | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.3.0 | 4.2.3.1 |
| samtools | 1.10 | 1.10 | 1.13 | 1.13 | 1.13 | 1.13 | | |
| clinvar | 2022-09-03 | | 2022-09-03 | 2022-09-03 | 2022-09-03 | 2022-12-03 | 2022-12-03 | 2022-12-03 |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| bwa mem | 128077207 | 128077207 | 128077207 | 128077211 | 128077210 | 128077211 | 128077211 | 128077210 |
| cleanSam | 128077207 | | | | 128077210 | 128077211 | 128077211 | 128077210 |
| applybqsr | | | | 128077211 | | | | |
| haplotypecaller | 1528059 | 1528066 | 1528066 | 1528093 | 1528089 | 1528093 | 1528093 | 1528093 |
| DP_over_30 | 84708 | 84716 | 84716 | 84724 | 84725 | 84724 | 84724 | 84725 |
| not_SNP | 11362 | 11366 | 11366 | 11377 | 11384 | NA | NA | NA |
| consensual_sequence | 8864 | 8868 | 8868 | 8884 | 8886 | 8898 | 8898 | 8900 |
| tsv | 1087 | | 1121 | | | | | |
***** Convention
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
On a vérifié que grep -c et grep | wc -l donnent le même résultat
#
**** KILL Gatk 4.3.0
CLOSED: [2023-01-04 Wed 19:16]
***** KILL Alignement
CLOSED: [2023-01-04 Wed 19:16]
****** DONE Brut
CLOSED: [2022-12-26 Mon 22:03]
Bam alexis
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
Notre
9f/26cf3d] Cached process > preprocess:BWA_MEM (63003856_S135)
$ samtools view -c work/9f/26cf3deb07b425a3e851be2a7bd782/63003856_S135.bam
On vérifie la sortie
$ samtools view -c out/63003856_S135/preprocessing/mapped/63003856_S135.bam
128077211
Petite différence (< 1e-8) mais selon Alexis, bwa mem est non reproductible. d'autant qu'on utilise une version parallélisée
128077211
****** On vérifie les arguments: ok
#+begin_src
bwa mem \
-R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add' \
-t 24 \
$INDEX \
63003856_S135_R1_001.fastq.gz 63003856_S135_R2_001.fastq.gz \
| samtools sort --threads 24 -o 63003856_S135.bam -
#+end_src
#+begin_src
bwa mem -R "@RG\tID:$sample\tSM:$sample\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add" -v 2 -t `nproc` $genomeRef ${fastq[0]} ${fastq[1]} | $samtools sort -@ `nproc` -O BAM -o $tmpDir/$post_bwa
#+end_src
****** DONE Avec gatk 4.2 (version alexis) : idem
$ samtools view -c mapped/63003856_S135.bam
128077211
****** DONE Nettoyé
CLOSED: [2022-12-26 Mon 22:08]
[57/4b5b4c] Cached process > preprocess:GATK4_CLEANSAM (63003856_S135)
$ samtools view -c work/57/4b5b4c647b98bb7099c4d1ba24bd75/63003856_S135.bam
128077211
Et la sortie
$ samtools view -c out/63003856_S135/preprocessing/clean-sam/sorted.bam
128077211
contre
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_cleaned.bam
128077207
On regarde les stats en détails (de la version nettoyée)
$ samtools flagstat work/57/4b5b4c647b98bb7099c4d1ba24bd75/63003856_S135.bam
128077211 + 0 in total (QC-passed reads + QC-failed reads)
126905130 + 0 primary
0 + 0 secondary
1172081 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
127941054 + 0 mapped (99.89% : N/A)
126768973 + 0 primary mapped (99.89% : N/A)
126905130 + 0 paired in sequencing
63452565 + 0 read1
63452565 + 0 read2
125263664 + 0 properly paired (98.71% : N/A)
126676024 + 0 with itself and mate mapped
92949 + 0 singletons (0.07% : N/A)
979608 + 0 with mate mapped to a different chr
675398 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_cleaned.bam
128077207 + 0 in total (QC-passed reads + QC-failed reads)
126905130 + 0 primary
0 + 0 secondary
1172077 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
127941051 + 0 mapped (99.89% : N/A)
126768974 + 0 primary mapped (99.89% : N/A)
126905130 + 0 paired in sequencing
63452565 + 0 read1
63452565 + 0 read2
125263790 + 0 properly paired (98.71% : N/A)
126676026 + 0 with itself and mate mapped
92948 + 0 singletons (0.07% : N/A)
979618 + 0 with mate mapped to a different chr
675412 + 0 with mate mapped to a different chr (mapQ>=5)
****** DONE mark duplicate
CLOSED: [2022-12-26 Mon 22:27]
Alexis
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_marked_dup.bam
128077207
Nous (pas de sortie dans out/)
$ samtools view -c work/46/bd75b4547452af36ee2c6b45362922/63003856_S135
128077211
logique car on ne supprime pas de donné...
******* Arguments ok
#+begin_src
gatk --java-options "-Xmx3g" MarkDuplicates \
--INPUT sorted.bam \
--OUTPUT marked_dups.bam \
--METRICS_FILE marked_dups.bam.metrics \
--TMP_DIR . \
--REFERENCE_SEQUENCE genomeRef.fna \
#+end_src
#+begin_src
$gatk MarkDuplicates \
-I $tmpDir/$post_cleanSam \
-O $tmpDir/$post_markDuplicate \
-M $tmpDir/"$sample"_marked_dup.metrix \
--CREATE_INDEX true \
--VERBOSITY WARNING
#+end_src
****** KILL baserecalibrator
CLOSED: [2023-01-04 Wed 19:15]
#+begin_src
$gatk BaseRecalibrator \
-I $tmpDir/$post_markDuplicate \
-R $genomeRef \
--known-sites $dbsnpDir/dbSNP_common.vcf.gz \
-O $tmpDir/$post_BaseRecalibrator
#+end_src
$ cd /Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
$ sed 's/sample/63003856_S135/' 63003856_S135.table > 63003856_S135.table2
Les fichiers n'ont pas le même nombre d'erreurs mais assez proches. Sur le premier table, 3 score
#:GATKTable:3:94:%d:%d:%d:;
#:GATKTable:Quantized:Quality quantization map
QualityScore Count QuantizedScore
11 298878631 11
25 542282996 25
34 12846268833 34
vs (référence0)
11 298877785 11
25 542282089 25
34 12846264839 34
******* options ok
#+begin_src
gatk --java-options "-Xmx3g" BaseRecalibrator \
--input marked_dups.bam \
--output 63003856_S135.table \
--reference genomeRef.fna \
\
--known-sites dbSNP_common.vcf.gz \
--tmp-dir . \
#+end_src
****** KILL applybqsr
CLOSED: [2023-01-04 Wed 19:15]
options ok
#+begin_src
gatk --java-options "-Xmx3g" ApplyBQSR \
--input marked_dups.bam \
--output 63003856_S135.bam \
--reference genomeRef.fna \
--bqsr-recal-file 63003856_S135.table \
\
--tmp-dir . \
#+end_src
#+begin_src
$gatk ApplyBQSR -R $genomeRef \
-I $tmpDir/$post_markDuplicate \
--bqsr-recal-file $tmpDir/$post_BaseRecalibrator \
-O $bamDir/$post_ApplyBQSR \
--verbosity WARNING
#+end_src
missing file
***** KILL Variant caling
CLOSED: [2023-01-04 Wed 19:16]
****** KILL haplotypecaller
CLOSED: [2023-01-04 Wed 19:15]
******** options ok
#+begin_src
gatk --java-options "-Xmx3g" HaplotypeCaller \
--input 63003856_S135.bam \
--output 63003856_S135.vcf.gz \
--reference genomeRef.fna \
--dbsnp dbSNP.gz \
\
\
--tmp-dir . \
--max-mnp-distance 2
#+end_src
#+begin_src
$gatk --java-options "-Xmx32g" HaplotypeCaller \
-R $genomeRef \
-I $bamDir/$post_ApplyBQSR \
-O $vcfDir/$post_haplotypecaller \
-D "$dbsnpDir"/GCF_000001405.39.gz \
--max-mnp-distance 2 \
--verbosity WARNING
#+end_src
******** DONE Nombres lignes gatk 4.3.0 : trop différent
CLOSED: [2023-01-04 Wed 19:11]
$ grep '^NC' out/63003856_S135/variantCalling/haplotypecaller/63003856_S135.vcf | wc -l
1631935
$ grep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135 .vcf | wc -l
1506894
******** DONE Regarder les flags d'haplotypecaller : nombreuses différences...
CLOSED: [2023-01-04 Wed 19:02]
| --dbsnp /Work/Users/apraga/bisonex/work/08/fca52ac598f21a2812f866bd590792/dbSNP.gz | --dbsnp /mnt/j/bases_de_donnees/dbSNP/GCF_000001405.39.gz |
| --max-mnp-distance 2 | --max-mnp-distance 2 |
| --output 63003856_S135.vcf.gz | --output /mnt/j/working_directory_pipeline_analyse_exome/vcf/63003856_S135.vcf |
| --input 63003856_S135.bam | --input /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam |
| --reference genomeRef.fna | --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna |
| --tmp-dir . | --verbosity WARNING |
| --use-posteriors-to-calculate-qual false | --use-posteriors-to-calculate-qual false |
| --dont-use-dragstr-priors false | --dont-use-dragstr-priors false |
| --use-new-qual-calculator true | --use-new-qual-calculator true |
| --annotate-with-num-discovered-alleles false | --annotate-with-num-discovered-alleles false |
| --heterozygosity 0.001 | --heterozygosity 0.001 |
| --indel-heterozygosity 1.25E-4 | --indel-heterozygosity 1.25E-4 |
| --heterozygosity-stdev 0.01 | --heterozygosity-stdev 0.01 |
| --standard-min-confidence-threshold-for-calling 30.0 | --standard-min-confidence-threshold-for-calling 30.0 |
| --max-alternate-alleles 6 | --max-alternate-alleles 6 |
| --max-genotype-count 1024 | --max-genotype-count 1024 |
| --sample-ploidy 2 | --sample-ploidy 2 |
| --num-reference-samples-if-no-call 0 | --num-reference-samples-if-no-call 0 |
| --genotype-assignment-method USE_PLS_TO_ASSIGN | --genotype-assignment-method USE_PLS_TO_ASSIGN |
| --contamination-fraction-to-filter 0.0 | --contamination-fraction-to-filter 0.0 |
| --output-mode EMIT_VARIANTS_ONLY | --output-mode EMIT_VARIANTS_ONLY |
| --all-site-pls false | --all-site-pls false |
| --flow-likelihood-parallel-threads 0 | --gvcf-gq-bands 1 |
| --flow-likelihood-optimized-comp false | --gvcf-gq-bands 2 |
| --flow-use-t0-tag false | --gvcf-gq-bands 3 |
| --flow-probability-threshold 0.003 | --gvcf-gq-bands 4 |
| --flow-remove-non-single-base-pair-indels false | --gvcf-gq-bands 5 |
| --flow-remove-one-zero-probs false | --gvcf-gq-bands 6 |
| --flow-quantization-bins 121 | --gvcf-gq-bands 7 |
| --flow-fill-empty-bins-value 0.001 | --gvcf-gq-bands 8 |
| --flow-symmetric-indel-probs false | --gvcf-gq-bands 9 |
| --flow-report-insertion-or-deletion false | --gvcf-gq-bands 10 |
| --flow-disallow-probs-larger-than-call false | --gvcf-gq-bands 11 |
| --flow-lump-probs false | --gvcf-gq-bands 12 |
| --flow-retain-max-n-probs-base-format false | --gvcf-gq-bands 13 |
| --flow-probability-scaling-factor 10 | --gvcf-gq-bands 14 |
| --flow-order-cycle-length 4 | --gvcf-gq-bands 15 |
| --flow-number-of-uncertain-flows-to-clip 0 | --gvcf-gq-bands 16 |
| --flow-nucleotide-of-first-uncertain-flow T | --gvcf-gq-bands 17 |
| --keep-boundary-flows false | --gvcf-gq-bands 18 |
| --gvcf-gq-bands 1 | --gvcf-gq-bands 19 |
| --gvcf-gq-bands 2 | --gvcf-gq-bands 20 |
| --gvcf-gq-bands 3 | --gvcf-gq-bands 21 |
| --gvcf-gq-bands 4 | --gvcf-gq-bands 22 |
| --gvcf-gq-bands 5 | --gvcf-gq-bands 23 |
| --gvcf-gq-bands 6 | --gvcf-gq-bands 24 |
| --gvcf-gq-bands 7 | --gvcf-gq-bands 25 |
| --gvcf-gq-bands 8 | --gvcf-gq-bands 26 |
| --gvcf-gq-bands 9 | --gvcf-gq-bands 27 |
| --gvcf-gq-bands 10 | --gvcf-gq-bands 28 |
| --gvcf-gq-bands 11 | --gvcf-gq-bands 29 |
| --gvcf-gq-bands 12 | --gvcf-gq-bands 30 |
| --gvcf-gq-bands 13 | --gvcf-gq-bands 31 |
| --gvcf-gq-bands 14 | --gvcf-gq-bands 32 |
| --gvcf-gq-bands 15 | --gvcf-gq-bands 33 |
| --gvcf-gq-bands 16 | --gvcf-gq-bands 34 |
| --gvcf-gq-bands 17 | --gvcf-gq-bands 35 |
| --gvcf-gq-bands 18 | --gvcf-gq-bands 36 |
| --gvcf-gq-bands 19 | --gvcf-gq-bands 37 |
| --gvcf-gq-bands 20 | --gvcf-gq-bands 38 |
| --gvcf-gq-bands 21 | --gvcf-gq-bands 39 |
| --gvcf-gq-bands 22 | --gvcf-gq-bands 40 |
| --gvcf-gq-bands 23 | --gvcf-gq-bands 41 |
| --gvcf-gq-bands 24 | --gvcf-gq-bands 42 |
| --gvcf-gq-bands 25 | --gvcf-gq-bands 43 |
| --gvcf-gq-bands 26 | --gvcf-gq-bands 44 |
| --gvcf-gq-bands 27 | --gvcf-gq-bands 45 |
| --gvcf-gq-bands 28 | --gvcf-gq-bands 46 |
| --gvcf-gq-bands 29 | --gvcf-gq-bands 47 |
| --gvcf-gq-bands 30 | --gvcf-gq-bands 48 |
| --gvcf-gq-bands 31 | --gvcf-gq-bands 49 |
| --gvcf-gq-bands 32 | --gvcf-gq-bands 50 |
| --gvcf-gq-bands 33 | --gvcf-gq-bands 51 |
| --gvcf-gq-bands 34 | --gvcf-gq-bands 52 |
| --gvcf-gq-bands 35 | --gvcf-gq-bands 53 |
| --gvcf-gq-bands 36 | --gvcf-gq-bands 54 |
| --gvcf-gq-bands 37 | --gvcf-gq-bands 55 |
| --gvcf-gq-bands 38 | --gvcf-gq-bands 56 |
| --gvcf-gq-bands 39 | --gvcf-gq-bands 57 |
| --gvcf-gq-bands 40 | --gvcf-gq-bands 58 |
| --gvcf-gq-bands 41 | --gvcf-gq-bands 59 |
| --gvcf-gq-bands 42 | --gvcf-gq-bands 60 |
| --gvcf-gq-bands 43 | --gvcf-gq-bands 70 |
| --gvcf-gq-bands 44 | --gvcf-gq-bands 80 |
| --gvcf-gq-bands 45 | --gvcf-gq-bands 90 |
| --gvcf-gq-bands 46 | --gvcf-gq-bands 99 |
| --gvcf-gq-bands 47 | --floor-blocks false |
| --gvcf-gq-bands 48 | --indel-size-to-eliminate-in-ref-model 10 |
| --gvcf-gq-bands 49 | --disable-optimizations false |
| --gvcf-gq-bands 50 | --dragen-mode false |
| --gvcf-gq-bands 51 | --apply-bqd false |
| --gvcf-gq-bands 52 | --apply-frd false |
| --gvcf-gq-bands 53 | --disable-spanning-event-genotyping false |
| --gvcf-gq-bands 54 | --transform-dragen-mapping-quality false |
| --gvcf-gq-bands 55 | --mapping-quality-threshold-for-genotyping 20 |
| --gvcf-gq-bands 56 | --max-effective-depth-adjustment-for-frd 0 |
| --gvcf-gq-bands 57 | --just-determine-active-regions false |
| --gvcf-gq-bands 58 | --dont-genotype false |
| --gvcf-gq-bands 59 | --do-not-run-physical-phasing false |
| --gvcf-gq-bands 60 | --do-not-correct-overlapping-quality false |
| --gvcf-gq-bands 70 | --use-filtered-reads-for-annotations false |
| --gvcf-gq-bands 80 | --adaptive-pruning false |
| --gvcf-gq-bands 90 | --do-not-recover-dangling-branches false |
| --gvcf-gq-bands 99 | --recover-dangling-heads false |
| --floor-blocks false | --kmer-size 10 |
| --indel-size-to-eliminate-in-ref-model 10 | --kmer-size 25 |
| --disable-optimizations false | --dont-increase-kmer-sizes-for-cycles false |
| --dragen-mode false | --allow-non-unique-kmers-in-ref false |
| --flow-mode NONE | --num-pruning-samples 1 |
| --apply-bqd false | --min-dangling-branch-length 4 |
| --apply-frd false | --recover-all-dangling-branches false |
| --disable-spanning-event-genotyping false | --max-num-haplotypes-in-population 128 |
| --transform-dragen-mapping-quality false | --min-pruning 2 |
| --mapping-quality-threshold-for-genotyping 20 | --adaptive-pruning-initial-error-rate 0.001 |
| --max-effective-depth-adjustment-for-frd 0 | --pruning-lod-threshold 2.302585092994046 |
| --just-determine-active-regions false | --pruning-seeding-lod-threshold 9.210340371976184 |
| --dont-genotype false | --max-unpruned-variants 100 |
| --do-not-run-physical-phasing false | --linked-de-bruijn-graph false |
| --do-not-correct-overlapping-quality false | --disable-artificial-haplotype-recovery false |
| --use-filtered-reads-for-annotations false | --enable-legacy-graph-cycle-detection false |
| --use-flow-aligner-for-stepwise-hc-filtering false | --debug-assembly false |
| --adaptive-pruning false | --debug-graph-transformations false |
| --do-not-recover-dangling-branches false | --capture-assembly-failure-bam false |
| --recover-dangling-heads false | --num-matching-bases-in-dangling-end-to-recover -1 |
| --kmer-size 10 | --error-correction-log-odds -Infinity |
| --kmer-size 25 | --error-correct-reads false |
| --dont-increase-kmer-sizes-for-cycles false | --kmer-length-for-read-error-correction 25 |
| --allow-non-unique-kmers-in-ref false | --min-observations-for-kmer-to-be-solid 20 |
| --num-pruning-samples 1 | --base-quality-score-threshold 18 |
| --min-dangling-branch-length 4 | --dragstr-het-hom-ratio 2 |
| --recover-all-dangling-branches false | --dont-use-dragstr-pair-hmm-scores false |
| --max-num-haplotypes-in-population 128 | --pair-hmm-gap-continuation-penalty 10 |
| --min-pruning 2 | --expected-mismatch-rate-for-read-disqualification 0.02 |
| --adaptive-pruning-initial-error-rate 0.001 | --pair-hmm-implementation FASTEST_AVAILABLE |
| --pruning-lod-threshold 2.302585092994046 | --pcr-indel-model CONSERVATIVE |
| --pruning-seeding-lod-threshold 9.210340371976184 | --phred-scaled-global-read-mismapping-rate 45 |
| --max-unpruned-variants 100 | --disable-symmetric-hmm-normalizing false |
| --linked-de-bruijn-graph false | --disable-cap-base-qualities-to-map-quality false |
| --disable-artificial-haplotype-recovery false | --enable-dynamic-read-disqualification-for-genotyping false |
| --enable-legacy-graph-cycle-detection false | --dynamic-read-disqualification-threshold 1.0 |
| --debug-assembly false | --native-pair-hmm-threads 4 |
| --debug-graph-transformations false | --native-pair-hmm-use-double-precision false |
| --capture-assembly-failure-bam false | --bam-writer-type CALLED_HAPLOTYPES |
| --num-matching-bases-in-dangling-end-to-recover -1 | --dont-use-soft-clipped-bases false |
| --error-correction-log-odds -Infinity | --min-base-quality-score 10 |
| --error-correct-reads false | --smith-waterman JAVA |
| --kmer-length-for-read-error-correction 25 | --emit-ref-confidence NONE |
| --min-observations-for-kmer-to-be-solid 20 | --force-call-filtered-alleles false |
| --likelihood-calculation-engine PairHMM | --soft-clip-low-quality-ends false |
| --base-quality-score-threshold 18 | --allele-informative-reads-overlap-margin 2 |
| --dragstr-het-hom-ratio 2 | --smith-waterman-dangling-end-match-value 25 |
| --dont-use-dragstr-pair-hmm-scores false | --smith-waterman-dangling-end-mismatch-penalty -50 |
| --pair-hmm-gap-continuation-penalty 10 | --smith-waterman-dangling-end-gap-open-penalty -110 |
| --expected-mismatch-rate-for-read-disqualification 0.02 | --smith-waterman-dangling-end-gap-extend-penalty -6 |
| --pair-hmm-implementation FASTEST_AVAILABLE | --smith-waterman-haplotype-to-reference-match-value 200 |
| --pcr-indel-model CONSERVATIVE | --smith-waterman-haplotype-to-reference-mismatch-penalty -150 |
| --phred-scaled-global-read-mismapping-rate 45 | --smith-waterman-haplotype-to-reference-gap-open-penalty -260 |
| --disable-symmetric-hmm-normalizing false | --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 |
| --disable-cap-base-qualities-to-map-quality false | --smith-waterman-read-to-haplotype-match-value 10 |
| --enable-dynamic-read-disqualification-for-genotyping false | --smith-waterman-read-to-haplotype-mismatch-penalty -15 |
| --dynamic-read-disqualification-threshold 1.0 | --smith-waterman-read-to-haplotype-gap-open-penalty -30 |
| --native-pair-hmm-threads 4 | --smith-waterman-read-to-haplotype-gap-extend-penalty -5 |
| --native-pair-hmm-use-double-precision false | --min-assembly-region-size 50 |
| --flow-hmm-engine-min-indel-adjust 6 | --max-assembly-region-size 300 |
| --flow-hmm-engine-flat-insertion-penatly 45 | --active-probability-threshold 0.002 |
| --flow-hmm-engine-flat-deletion-penatly 45 | --max-prob-propagation-distance 50 |
| --pileup-detection false | --force-active false |
| --pileup-detection-enable-indel-pileup-calling false | --assembly-region-padding 100 |
| --num-artificial-haplotypes-to-add-per-allele 5 | --padding-around-indels 75 |
| --artifical-haplotype-filtering-kmer-size 10 | --padding-around-snps 20 |
| --pileup-detection-snp-alt-threshold 0.1 | --padding-around-strs 75 |
| --pileup-detection-indel-alt-threshold 0.5 | --max-extension-into-assembly-region-padding-legacy 25 |
| --pileup-detection-absolute-alt-depth 0.0 | --max-reads-per-alignment-start 50 |
| --pileup-detection-snp-adjacent-to-assembled-indel-range 5 | --enable-legacy-assembly-region-trimming false |
| --pileup-detection-bad-read-tolerance 0.0 | --interval-set-rule UNION |
| --pileup-detection-proper-pair-read-badness true | --interval-padding 0 |
| --pileup-detection-edit-distance-read-badness-threshold 0.08 | --interval-exclusion-padding 0 |
| --pileup-detection-chimeric-read-badness true | --interval-merging-rule ALL |
| --pileup-detection-template-mean-badness-threshold 0.0 | --read-validation-stringency SILENT |
| --pileup-detection-template-std-badness-threshold 0.0 | --seconds-between-progress-updates 10.0 |
| --bam-writer-type CALLED_HAPLOTYPES | --disable-sequence-dictionary-validation false |
| --dont-use-soft-clipped-bases false | --create-output-bam-index true |
| --override-fragment-softclip-check false | --create-output-bam-md5 false |
| --min-base-quality-score 10 | --create-output-variant-index true |
| --smith-waterman JAVA | --create-output-variant-md5 false |
| --emit-ref-confidence NONE | --max-variants-per-shard 0 |
| --force-call-filtered-alleles false | --lenient false |
| --reference-model-deletion-quality 30 | --add-output-sam-program-record true |
| --soft-clip-low-quality-ends false | --add-output-vcf-command-line true |
| --allele-informative-reads-overlap-margin 2 | --cloud-prefetch-buffer 40 |
| --smith-waterman-dangling-end-match-value 25 | --cloud-index-prefetch-buffer -1 |
| --smith-waterman-dangling-end-mismatch-penalty -50 | --disable-bam-index-caching false |
| --smith-waterman-dangling-end-gap-open-penalty -110 | --sites-only-vcf-output false |
| --smith-waterman-dangling-end-gap-extend-penalty -6 | --help false |
| --smith-waterman-haplotype-to-reference-match-value 200 | --version false |
| --smith-waterman-haplotype-to-reference-mismatch-penalty -150 | --showHidden false |
| --smith-waterman-haplotype-to-reference-gap-open-penalty -260 | --QUIET false |
| --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 | --use-jdk-deflater false |
| --smith-waterman-read-to-haplotype-match-value 10 | --use-jdk-inflater false |
| --smith-waterman-read-to-haplotype-mismatch-penalty -15 | --gcs-max-retries 20 |
| --smith-waterman-read-to-haplotype-gap-open-penalty -30 | --gcs-project-for-requester-pays |
| --smith-waterman-read-to-haplotype-gap-extend-penalty -5 | --disable-tool-default-read-filters false |
| --flow-assembly-collapse-hmer-size 0 | --minimum-mapping-quality 20 |
| --flow-assembly-collapse-partial-mode false | --disable-tool-default-annotations false |
| --flow-filter-alleles false | --enable-all-annotations false |
| --flow-filter-alleles-qual-threshold 30.0 | --allow-old-rms-mapping-quality-annotation-data false |
| --flow-filter-alleles-sor-threshold 3.0 | Version="4.2.4.1",Date="December 3, 2022 at 1:20:38 AM CET"> |
| --flow-filter-lone-alleles false |
| --flow-filter-alleles-debug-graphs false |
| --min-assembly-region-size 50 |
| --max-assembly-region-size 300 |
| --active-probability-threshold 0.002 |
| --max-prob-propagation-distance 50 |
| --force-active false |
| --assembly-region-padding 100 |
| --padding-around-indels 75 |
| --padding-around-snps 20 |
| --padding-around-strs 75 |
| --max-extension-into-assembly-region-padding-legacy 25 |
| --max-reads-per-alignment-start 50 |
| --enable-legacy-assembly-region-trimming false |
| --interval-set-rule UNION |
| --interval-padding 0 |
| --interval-exclusion-padding 0 |
| --interval-merging-rule ALL |
| --read-validation-stringency SILENT |
| --seconds-between-progress-updates 10.0 |
| --disable-sequence-dictionary-validation false |
| --create-output-bam-index true |
| --create-output-bam-md5 false |
| --create-output-variant-index true |
| --create-output-variant-md5 false |
| --max-variants-per-shard 0 |
| --lenient false |
| --add-output-sam-program-record true |
| --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false |
| --sites-only-vcf-output false |
| --help false |
| --version false |
| --showHidden false |
| --verbosity INFO |
| --QUIET false |
| --use-jdk-deflater false |
| --use-jdk-inflater false |
| --gcs-max-retries 20 |
| --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false |
| --minimum-mapping-quality 20 |
| --disable-tool-default-annotations false |
| --enable-all-annotations false |
| --allow-old-rms-mapping-quality-annotation-data false" |
| Version="4.3.0.0",Date="December 16, 2022 at 12:51:03 AM CET"> |
****** KILL [#B] filterDepth : 21 en trop
CLOSED: [2023-01-04 Wed 19:16]
Nouvelle version (correcton bug markdupicates)
#+begin_src sh
grep '^NC' out/63003856_S135/variantCalling/filter-depth.vcf |wc -l
82054
#+end_src
Alexis
#+begin_src sh
zgrep '^NC' /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf | wc -l
82033
#+end_src
Ne vient pas du filtre sur la profondeur:
bcftools filter -i 'FORMAT/AD[0:1]<=10' 63003856_S135_DP_over_30.vcf
bcftools filter -i 'FORMAT/DP<=30' 63003856_S135_DP_over_30.vcf
Idem pour notre version. Rien ne sort.
On compare le nombre de lignes
#+begin_src sh
bgzip out/63003856_S135/variantCalling/filter-depth.vcf
tabix /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf.gz
tabix out/63003856_S135/variantCalling/filter-depth.vcf.gz
bcftools isec out/63003856_S135/variantCalling/filter-depth.vcf.gz /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf.gz -p compare-filter-depth
find compare-filter-depth/ -type f -exec wc -l {} \;
84763 compare-filter-depth/sites.txt
710 compare-filter-depth/0001.vcf
8 compare-filter-depth/README.txt
85339 compare-filter-depth/0002.vcf
85340 compare-filter-depth/0003.vcf
725 compare-filter-depth/0000.vcf
#+end_src
****** KILL exclude SNP + consensual : 34 en trop !!
CLOSED: [2023-01-04 Wed 19:16]
$ grep '^NC' out/63003856_S135/variantCalling/filter-polymorphisms.vcf | wc -l
8898
vs
$ grep '^NC' /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf | wc -l
8864
****** KILL Filter technical variants
CLOSED: [2023-01-04 Wed 19:16]
**** DONE Gatk 4.2.4 : idem
CLOSED: [2023-01-07 Sat 00:05]
***** DONE Variant calling
CLOSED: [2023-01-07 Sat 00:05]
****** DONE haplotypecaller: mieux mais non identique !
CLOSED: [2023-01-07 Sat 00:05]
******* DONE Nombres lignes gatk 4.2.2 : faible différence
CLOSED: [2023-01-04 Wed 19:18]
$ zgrep '^NC' 63003856_S135.vcf.gz | wc -l
1506931
$ grep '^NC' /Work/Groups/bisonex/ref-vcf/63003856_S135 .vcf | wc -l
1506894
******* DONE Flags la même version de gatk 4.2.2 : ok identique
CLOSED: [2023-01-04 Wed 19:09]
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
| ",Version="4.2.4.1",Date="January 4, 2023 at 1:46:41 AM CET"> | Version="4.2.4.1",Date="December 3, 2022 at 1:20:38 AM CET"> |
| --dbsnp /Work/Users/apraga/bisonex/work/5d/feb81028d262d7701bed0a759ff6f6/dbSNP.gz | --dbsnp /mnt/j/bases_de_donnees/dbSNP/GCF_000001405.39.gz |
| --max-mnp-distance 2 | --max-mnp-distance 2 |
| --output 63003856_S135.vcf.gz | --output /mnt/j/working_directory_pipeline_analyse_exome/vcf/63003856_S135.vcf |
| --input 63003856_S135.bam | --input /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam |
| --reference genomeRef.fna | --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna |
| --tmp-dir . | --verbosity WARNING |
| --use-posteriors-to-calculate-qual false | --use-posteriors-to-calculate-qual false |
| --dont-use-dragstr-priors false | --dont-use-dragstr-priors false |
| --use-new-qual-calculator true | --use-new-qual-calculator true |
| --annotate-with-num-discovered-alleles false | --annotate-with-num-discovered-alleles false |
| --heterozygosity 0.001 | --heterozygosity 0.001 |
| --indel-heterozygosity 1.25E-4 | --indel-heterozygosity 1.25E-4 |
| --heterozygosity-stdev 0.01 | --heterozygosity-stdev 0.01 |
| --standard-min-confidence-threshold-for-calling 30.0 | --standard-min-confidence-threshold-for-calling 30.0 |
| --max-alternate-alleles 6 | --max-alternate-alleles 6 |
| --max-genotype-count 1024 | --max-genotype-count 1024 |
| --sample-ploidy 2 | --sample-ploidy 2 |
| --num-reference-samples-if-no-call 0 | --num-reference-samples-if-no-call 0 |
| --genotype-assignment-method USE_PLS_TO_ASSIGN | --genotype-assignment-method USE_PLS_TO_ASSIGN |
| --contamination-fraction-to-filter 0.0 | --contamination-fraction-to-filter 0.0 |
| --output-mode EMIT_VARIANTS_ONLY | --output-mode EMIT_VARIANTS_ONLY |
| --all-site-pls false | --all-site-pls false |
| --gvcf-gq-bands 1 | --gvcf-gq-bands 1 |
| --gvcf-gq-bands 2 | --gvcf-gq-bands 2 |
| --gvcf-gq-bands 3 | --gvcf-gq-bands 3 |
| --gvcf-gq-bands 4 | --gvcf-gq-bands 4 |
| --gvcf-gq-bands 5 | --gvcf-gq-bands 5 |
| --gvcf-gq-bands 6 | --gvcf-gq-bands 6 |
| --gvcf-gq-bands 7 | --gvcf-gq-bands 7 |
| --gvcf-gq-bands 8 | --gvcf-gq-bands 8 |
| --gvcf-gq-bands 9 | --gvcf-gq-bands 9 |
| --gvcf-gq-bands 10 | --gvcf-gq-bands 10 |
| --gvcf-gq-bands 11 | --gvcf-gq-bands 11 |
| --gvcf-gq-bands 12 | --gvcf-gq-bands 12 |
| --gvcf-gq-bands 13 | --gvcf-gq-bands 13 |
| --gvcf-gq-bands 14 | --gvcf-gq-bands 14 |
| --gvcf-gq-bands 15 | --gvcf-gq-bands 15 |
| --gvcf-gq-bands 16 | --gvcf-gq-bands 16 |
| --gvcf-gq-bands 17 | --gvcf-gq-bands 17 |
| --gvcf-gq-bands 18 | --gvcf-gq-bands 18 |
| --gvcf-gq-bands 19 | --gvcf-gq-bands 19 |
| --gvcf-gq-bands 20 | --gvcf-gq-bands 20 |
| --gvcf-gq-bands 21 | --gvcf-gq-bands 21 |
| --gvcf-gq-bands 22 | --gvcf-gq-bands 22 |
| --gvcf-gq-bands 23 | --gvcf-gq-bands 23 |
| --gvcf-gq-bands 24 | --gvcf-gq-bands 24 |
| --gvcf-gq-bands 25 | --gvcf-gq-bands 25 |
| --gvcf-gq-bands 26 | --gvcf-gq-bands 26 |
| --gvcf-gq-bands 27 | --gvcf-gq-bands 27 |
| --gvcf-gq-bands 28 | --gvcf-gq-bands 28 |
| --gvcf-gq-bands 29 | --gvcf-gq-bands 29 |
| --gvcf-gq-bands 30 | --gvcf-gq-bands 30 |
| --gvcf-gq-bands 31 | --gvcf-gq-bands 31 |
| --gvcf-gq-bands 32 | --gvcf-gq-bands 32 |
| --gvcf-gq-bands 33 | --gvcf-gq-bands 33 |
| --gvcf-gq-bands 34 | --gvcf-gq-bands 34 |
| --gvcf-gq-bands 35 | --gvcf-gq-bands 35 |
| --gvcf-gq-bands 36 | --gvcf-gq-bands 36 |
| --gvcf-gq-bands 37 | --gvcf-gq-bands 37 |
| --gvcf-gq-bands 38 | --gvcf-gq-bands 38 |
| --gvcf-gq-bands 39 | --gvcf-gq-bands 39 |
| --gvcf-gq-bands 40 | --gvcf-gq-bands 40 |
| --gvcf-gq-bands 41 | --gvcf-gq-bands 41 |
| --gvcf-gq-bands 42 | --gvcf-gq-bands 42 |
| --gvcf-gq-bands 43 | --gvcf-gq-bands 43 |
| --gvcf-gq-bands 44 | --gvcf-gq-bands 44 |
| --gvcf-gq-bands 45 | --gvcf-gq-bands 45 |
| --gvcf-gq-bands 46 | --gvcf-gq-bands 46 |
| --gvcf-gq-bands 47 | --gvcf-gq-bands 47 |
| --gvcf-gq-bands 48 | --gvcf-gq-bands 48 |
| --gvcf-gq-bands 49 | --gvcf-gq-bands 49 |
| --gvcf-gq-bands 50 | --gvcf-gq-bands 50 |
| --gvcf-gq-bands 51 | --gvcf-gq-bands 51 |
| --gvcf-gq-bands 52 | --gvcf-gq-bands 52 |
| --gvcf-gq-bands 53 | --gvcf-gq-bands 53 |
| --gvcf-gq-bands 54 | --gvcf-gq-bands 54 |
| --gvcf-gq-bands 55 | --gvcf-gq-bands 55 |
| --gvcf-gq-bands 56 | --gvcf-gq-bands 56 |
| --gvcf-gq-bands 57 | --gvcf-gq-bands 57 |
| --gvcf-gq-bands 58 | --gvcf-gq-bands 58 |
| --gvcf-gq-bands 59 | --gvcf-gq-bands 59 |
| --gvcf-gq-bands 60 | --gvcf-gq-bands 60 |
| --gvcf-gq-bands 70 | --gvcf-gq-bands 70 |
| --gvcf-gq-bands 80 | --gvcf-gq-bands 80 |
| --gvcf-gq-bands 90 | --gvcf-gq-bands 90 |
| --gvcf-gq-bands 99 | --gvcf-gq-bands 99 |
| --floor-blocks false | --floor-blocks false |
| --indel-size-to-eliminate-in-ref-model 10 | --indel-size-to-eliminate-in-ref-model 10 |
| --disable-optimizations false | --disable-optimizations false |
| --dragen-mode false | --dragen-mode false |
| --apply-bqd false | --apply-bqd false |
| --apply-frd false | --apply-frd false |
| --disable-spanning-event-genotyping false | --disable-spanning-event-genotyping false |
| --transform-dragen-mapping-quality false | --transform-dragen-mapping-quality false |
| --mapping-quality-threshold-for-genotyping 20 | --mapping-quality-threshold-for-genotyping 20 |
| --max-effective-depth-adjustment-for-frd 0 | --max-effective-depth-adjustment-for-frd 0 |
| --just-determine-active-regions false | --just-determine-active-regions false |
| --dont-genotype false | --dont-genotype false |
| --do-not-run-physical-phasing false | --do-not-run-physical-phasing false |
| --do-not-correct-overlapping-quality false | --do-not-correct-overlapping-quality false |
| --use-filtered-reads-for-annotations false | --use-filtered-reads-for-annotations false |
| --adaptive-pruning false | --adaptive-pruning false |
| --do-not-recover-dangling-branches false | --do-not-recover-dangling-branches false |
| --recover-dangling-heads false | --recover-dangling-heads false |
| --kmer-size 10 | --kmer-size 10 |
| --kmer-size 25 | --kmer-size 25 |
| --dont-increase-kmer-sizes-for-cycles false | --dont-increase-kmer-sizes-for-cycles false |
| --allow-non-unique-kmers-in-ref false | --allow-non-unique-kmers-in-ref false |
| --num-pruning-samples 1 | --num-pruning-samples 1 |
| --min-dangling-branch-length 4 | --min-dangling-branch-length 4 |
| --recover-all-dangling-branches false | --recover-all-dangling-branches false |
| --max-num-haplotypes-in-population 128 | --max-num-haplotypes-in-population 128 |
| --min-pruning 2 | --min-pruning 2 |
| --adaptive-pruning-initial-error-rate 0.001 | --adaptive-pruning-initial-error-rate 0.001 |
| --pruning-lod-threshold 2.302585092994046 | --pruning-lod-threshold 2.302585092994046 |
| --pruning-seeding-lod-threshold 9.210340371976184 | --pruning-seeding-lod-threshold 9.210340371976184 |
| --max-unpruned-variants 100 | --max-unpruned-variants 100 |
| --linked-de-bruijn-graph false | --linked-de-bruijn-graph false |
| --disable-artificial-haplotype-recovery false | --disable-artificial-haplotype-recovery false |
| --enable-legacy-graph-cycle-detection false | --enable-legacy-graph-cycle-detection false |
| --debug-assembly false | --debug-assembly false |
| --debug-graph-transformations false | --debug-graph-transformations false |
| --capture-assembly-failure-bam false | --capture-assembly-failure-bam false |
| --num-matching-bases-in-dangling-end-to-recover -1 | --num-matching-bases-in-dangling-end-to-recover -1 |
| --error-correction-log-odds -Infinity | --error-correction-log-odds -Infinity |
| --error-correct-reads false | --error-correct-reads false |
| --kmer-length-for-read-error-correction 25 | --kmer-length-for-read-error-correction 25 |
| --min-observations-for-kmer-to-be-solid 20 | --min-observations-for-kmer-to-be-solid 20 |
| --base-quality-score-threshold 18 | --base-quality-score-threshold 18 |
| --dragstr-het-hom-ratio 2 | --dragstr-het-hom-ratio 2 |
| --dont-use-dragstr-pair-hmm-scores false | --dont-use-dragstr-pair-hmm-scores false |
| --pair-hmm-gap-continuation-penalty 10 | --pair-hmm-gap-continuation-penalty 10 |
| --expected-mismatch-rate-for-read-disqualification 0.02 | --expected-mismatch-rate-for-read-disqualification 0.02 |
| --pair-hmm-implementation FASTEST_AVAILABLE | --pair-hmm-implementation FASTEST_AVAILABLE |
| --pcr-indel-model CONSERVATIVE | --pcr-indel-model CONSERVATIVE |
| --phred-scaled-global-read-mismapping-rate 45 | --phred-scaled-global-read-mismapping-rate 45 |
| --disable-symmetric-hmm-normalizing false | --disable-symmetric-hmm-normalizing false |
| --disable-cap-base-qualities-to-map-quality false | --disable-cap-base-qualities-to-map-quality false |
| --enable-dynamic-read-disqualification-for-genotyping false | --enable-dynamic-read-disqualification-for-genotyping false |
| --dynamic-read-disqualification-threshold 1.0 | --dynamic-read-disqualification-threshold 1.0 |
| --native-pair-hmm-threads 4 | --native-pair-hmm-threads 4 |
| --native-pair-hmm-use-double-precision false | --native-pair-hmm-use-double-precision false |
| --bam-writer-type CALLED_HAPLOTYPES | --bam-writer-type CALLED_HAPLOTYPES |
| --dont-use-soft-clipped-bases false | --dont-use-soft-clipped-bases false |
| --min-base-quality-score 10 | --min-base-quality-score 10 |
| --smith-waterman JAVA | --smith-waterman JAVA |
| --emit-ref-confidence NONE | --emit-ref-confidence NONE |
| --force-call-filtered-alleles false | --force-call-filtered-alleles false |
| --soft-clip-low-quality-ends false | --soft-clip-low-quality-ends false |
| --allele-informative-reads-overlap-margin 2 | --allele-informative-reads-overlap-margin 2 |
| --smith-waterman-dangling-end-match-value 25 | --smith-waterman-dangling-end-match-value 25 |
| --smith-waterman-dangling-end-mismatch-penalty -50 | --smith-waterman-dangling-end-mismatch-penalty -50 |
| --smith-waterman-dangling-end-gap-open-penalty -110 | --smith-waterman-dangling-end-gap-open-penalty -110 |
| --smith-waterman-dangling-end-gap-extend-penalty -6 | --smith-waterman-dangling-end-gap-extend-penalty -6 |
| --smith-waterman-haplotype-to-reference-match-value 200 | --smith-waterman-haplotype-to-reference-match-value 200 |
| --smith-waterman-haplotype-to-reference-gap-open-penalty -260 | --smith-waterman-haplotype-to-reference-gap-open-penalty -260 |
| --smith-waterman-haplotype-to-reference-mismatch-penalty -150 | --smith-waterman-haplotype-to-reference-mismatch-penalty -150 |
| --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 | --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 |
| --smith-waterman-read-to-haplotype-match-value 10 | --smith-waterman-read-to-haplotype-match-value 10 |
| --smith-waterman-read-to-haplotype-mismatch-penalty -15 | --smith-waterman-read-to-haplotype-mismatch-penalty -15 |
| --smith-waterman-read-to-haplotype-gap-open-penalty -30 | --smith-waterman-read-to-haplotype-gap-open-penalty -30 |
| --smith-waterman-read-to-haplotype-gap-extend-penalty -5 | --smith-waterman-read-to-haplotype-gap-extend-penalty -5 |
| --min-assembly-region-size 50 | --min-assembly-region-size 50 |
| --max-assembly-region-size 300 | --max-assembly-region-size 300 |
| --active-probability-threshold 0.002 | --active-probability-threshold 0.002 |
| --max-prob-propagation-distance 50 | --max-prob-propagation-distance 50 |
| --force-active false | --force-active false |
| --assembly-region-padding 100 | --assembly-region-padding 100 |
| --padding-around-indels 75 | --padding-around-indels 75 |
| --padding-around-snps 20 | --padding-around-snps 20 |
| --padding-around-strs 75 | --padding-around-strs 75 |
| --max-extension-into-assembly-region-padding-legacy 25 | --max-extension-into-assembly-region-padding-legacy 25 |
| --max-reads-per-alignment-start 50 | --max-reads-per-alignment-start 50 |
| --enable-legacy-assembly-region-trimming false | --enable-legacy-assembly-region-trimming false |
| --interval-set-rule UNION | --interval-set-rule UNION |
| --interval-padding 0 | --interval-padding 0 |
| --interval-exclusion-padding 0 | --interval-exclusion-padding 0 |
| --interval-merging-rule ALL | --interval-merging-rule ALL |
| --read-validation-stringency SILENT | --read-validation-stringency SILENT |
| --seconds-between-progress-updates 10.0 | --seconds-between-progress-updates 10.0 |
| --disable-sequence-dictionary-validation false | --disable-sequence-dictionary-validation false |
| --create-output-bam-index true | --create-output-bam-index true |
| --create-output-bam-md5 false | --create-output-bam-md5 false |
| --create-output-variant-index true | --create-output-variant-index true |
| --create-output-variant-md5 false | --create-output-variant-md5 false |
| --max-variants-per-shard 0 | --max-variants-per-shard 0 |
| --lenient false | --lenient false |
| --add-output-sam-program-record true | --add-output-sam-program-record true |
| --add-output-vcf-command-line true | --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 | --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 | --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false | --disable-bam-index-caching false |
| --sites-only-vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --verbosity INFO | --QUIET false |
| --QUIET false | --use-jdk-deflater false |
| --use-jdk-deflater false | --use-jdk-inflater false |
| --use-jdk-inflater false | --gcs-max-retries 20 |
| --gcs-max-retries 20 | --gcs-project-for-requester-pays |
| --gcs-project-for-requester-pays | --disable-tool-default-read-filters false |
| --disable-tool-default-read-filters false | --minimum-mapping-quality 20 |
| --minimum-mapping-quality 20 | --disable-tool-default-annotations false |
| --disable-tool-default-annotations false | --enable-all-annotations false |
| --enable-all-annotations false | --allow-old-rms-mapping-quality-annotation-data false |
| --allow-old-rms-mapping-quality-annotation-data false | |
****** DONE filter depth : Toujours la même différence...
CLOSED: [2023-01-07 Sat 00:05]
$ grep '^NC' filter-depth.vcf | wc -l
82054
$ zgrep '^NC' /Work/Groups/bisonex/ref_63003856_S135/63003856_S135_DP_over_30.vcf.gz | wc -l
82033
Non lié à la profondeur : on teste avec
bcftools filter -i 'FORMAT/DP<=30' filter-depth.vcf
bcftools filter -i 'FORMAT/AD[0:1]<=10' filter-depth.vcf
****** DONE Vérifier qu'en utilsant 2 filtres différents on a bien la même chose : oui
CLOSED: [2023-01-07 Sat 00:05]
$ bcftools filter -e 'FORMAT/DP<=30' 63003856_S135.vcf.gz | bcftools filter -e 'FORMAT/AD[0:1]<=10' -o two-filters.vcf
$ grep '^NC' two-filters.vcf | wc -l
82054
***** DONE Tester bwa en séquentiel
CLOSED: [2023-01-07 Sat 00:05]
**** DONE (save) Version d'Alexis 24 threads, sans télécharger les bases de données, gatk 4.2.4.1
CLOSED: [2023-01-07 Sat 00:06]
***** Bwa mem 24 threads: comme la version test...
$ cd /Work/Users/apraga/bisonex/script/files/tmp_63003856_S135
$ samtools view -c 63003856_S135.bam
128077211
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
En ne conservant que les mapped reads, minime différence
$ samtools view -c -F 260 /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
127941051
$ samtools view -c -F 260 files/tmp_63003856_S135/63003856_S135.bam
127941054
**** DONE Version prod à la maison
CLOSED: [2023-01-15 Sun 23:22]
preprocessing + variant calling sans alignement
***** DONE GATK nix
CLOSED: [2023-01-15 Sun 23:22]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
On a le même nombre
***** DONE GATK compilé: idem
CLOSED: [2023-01-15 Sun 23:22]
script/files-src/tmp_63003856_S135 on prod [$!?]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
Idem...
**** TODO GATK : tests de non régression ??
**** DONE Compiler gatk: idem
CLOSED: [2023-01-19 Thu 22:03]
***** DONE Compilation
CLOSED: [2023-01-14 Sat 22:45]
Requirements :
- java8 avec jDK
- git lfs
#+begin_src sh
git clone https://github.com/broadinstitute/gatk
git checkout tags/4.2.4.0 -b 4.2.4.0
nix-shell -p jdk8 git-lfs
# We need the datasets for testing (otherwise it fails)
git lfs install
git lfs pull --include src/main/resources/large
./gradlew bundle
#+end_src
***** DONE Vérifier tests
CLOSED: [2023-01-19 Thu 22:03]
#+begin_src
./gradlew test
#+end_src
267064 tests completed, 444 failed, 1966 skipped
> There were failing tests. See the report at: file:///Home/Users/apraga/gatk/build/reports/tests/test/index.html
BUILD FAILED in 37m 1s
6 actionable tasks: 1 executed, 5 up-to-date
Ceux qui ont planté sont liés à spark + erreurs d'authentification kerberos
***** KILL Lancer calcul : maison
CLOSED: [2023-01-19 Thu 22:03]
JAVA_HOME=/nix/store/r1r5jr7gv6hcchpiggjmfqjkzbi8y5ja-openjdk-8u322-ga/lib/openjdk PATH=$PATH:$JAVA_HOME/bin ./gatk --version
**** TODO Comparer tsv à la sortie (Alexis)
- diff
- Nombre de lignes
**** TODO Version de prod + nix sur Helios
***** DONE 24 coeurs : idem
CLOSED: [2023-01-20 Fri 22:49]
****** DONE Preprocessing : idem
CLOSED: [2023-01-20 Fri 22:49]
******* DONE Alignement
CLOSED: [2023-01-19 Thu 22:28]
******** DONE Nombre lignes : idem
CLOSED: [2023-01-19 Thu 22:48]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:16:19 PM
128077211
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:11:39 PM
128077207
******** DONE Bwa version ok, arguments ok
CLOSED: [2023-01-20 Fri 22:49]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:46:55 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -v 2 -t 24 /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna files/fastq/63003856_S135_R1_001.fastq.gz files/fastq/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.13 CL:samtools sort -@ 24 -O BAM -o files/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:49:25 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/bin/bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -t 4 /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R1_001.fastq.gz /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:/mnt/h/tools/samtools sort -@ 4 -O BAM -o /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
******* KILL ApplyBQSR
CLOSED: [2023-01-20 Fri 22:48]
/Work/Users/apraga/bisonex/script/files/bam〉samtools view -c 63003856_S135_recalibrated_hg38.bam 01/19/2023 10:21:00 PM
128077211
??
****** DONE Variant calling
CLOSED: [2023-01-20 Fri 22:48]
******* DONE Re vérifier flags
CLOSED: [2023-01-19 Thu 22:44]
/Work/Projects/bisonex/ref-63003856_S135〉less 63003856_S135_DP_over_30.vcf.gz
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
| Ref | Prod helios |
|-------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------|
| --dbsnp /mnt/j/bases_de_donnees/dbSNP/GCF_000001405.39.gz | --dbsnp /Work/Groups/bisonex/data-alexis-reference/dbSNP/GCF_000001405.39.gz |
| --max-mnp-distance 2 | --max-mnp-distance 2 |
| --output /mnt/j/working_directory_pipeline_analyse_exome/vcf/63003856_S135.vcf | --output files/vcf/63003856_S135.vcf |
| --input /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam | --input files/bam/63003856_S135_recalibrated_hg38.bam |
| --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna | --reference /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna |
| --verbosity WARNING | --verbosity WARNING |
| --use-posteriors-to-calculate-qual false | --use-posteriors-to-calculate-qual false |
| --dont-use-dragstr-priors false | --dont-use-dragstr-priors false |
| --use-new-qual-calculator true | --use-new-qual-calculator true |
| --annotate-with-num-discovered-alleles false | --annotate-with-num-discovered-alleles false |
| --heterozygosity 0.001 | --heterozygosity 0.001 |
| --indel-heterozygosity 1.25E-4 | --indel-heterozygosity 1.25E-4 |
| --heterozygosity-stdev 0.01 | --heterozygosity-stdev 0.01 |
| --standard-min-confidence-threshold-for-calling 30.0 | --standard-min-confidence-threshold-for-calling 30.0 |
| --max-alternate-alleles 6 | --max-alternate-alleles 6 |
| --max-genotype-count 1024 | --max-genotype-count 1024 |
| --sample-ploidy 2 | --sample-ploidy 2 |
| --num-reference-samples-if-no-call 0 | --num-reference-samples-if-no-call 0 |
| --genotype-assignment-method USE_PLS_TO_ASSIGN | --genotype-assignment-method USE_PLS_TO_ASSIGN |
| --contamination-fraction-to-filter 0.0 | --contamination-fraction-to-filter 0.0 |
| --output-mode EMIT_VARIANTS_ONLY | --output-mode EMIT_VARIANTS_ONLY |
| --all-site-pls false | --all-site-pls false |
| --gvcf-gq-bands 1 | --gvcf-gq-bands 1 |
| --gvcf-gq-bands 2 | --gvcf-gq-bands 2 |
| --gvcf-gq-bands 3 | --gvcf-gq-bands 3 |
| --gvcf-gq-bands 4 | --gvcf-gq-bands 4 |
| --gvcf-gq-bands 5 | --gvcf-gq-bands 5 |
| --gvcf-gq-bands 6 | --gvcf-gq-bands 6 |
| --gvcf-gq-bands 7 | --gvcf-gq-bands 7 |
| --gvcf-gq-bands 8 | --gvcf-gq-bands 8 |
| --gvcf-gq-bands 9 | --gvcf-gq-bands 9 |
| --gvcf-gq-bands 10 | --gvcf-gq-bands 10 |
| --gvcf-gq-bands 11 | --gvcf-gq-bands 11 |
| --gvcf-gq-bands 12 | --gvcf-gq-bands 12 |
| --gvcf-gq-bands 13 | --gvcf-gq-bands 13 |
| --gvcf-gq-bands 14 | --gvcf-gq-bands 14 |
| --gvcf-gq-bands 15 | --gvcf-gq-bands 15 |
| --gvcf-gq-bands 16 | --gvcf-gq-bands 16 |
| --gvcf-gq-bands 17 | --gvcf-gq-bands 17 |
| --gvcf-gq-bands 18 | --gvcf-gq-bands 18 |
| --gvcf-gq-bands 19 | --gvcf-gq-bands 19 |
| --gvcf-gq-bands 20 | --gvcf-gq-bands 20 |
| --gvcf-gq-bands 21 | --gvcf-gq-bands 21 |
| --gvcf-gq-bands 22 | --gvcf-gq-bands 22 |
| --gvcf-gq-bands 23 | --gvcf-gq-bands 23 |
| --gvcf-gq-bands 24 | --gvcf-gq-bands 24 |
| --gvcf-gq-bands 25 | --gvcf-gq-bands 25 |
| --gvcf-gq-bands 26 | --gvcf-gq-bands 26 |
| --gvcf-gq-bands 27 | --gvcf-gq-bands 27 |
| --gvcf-gq-bands 28 | --gvcf-gq-bands 28 |
| --gvcf-gq-bands 29 | --gvcf-gq-bands 29 |
| --gvcf-gq-bands 30 | --gvcf-gq-bands 30 |
| --gvcf-gq-bands 31 | --gvcf-gq-bands 31 |
| --gvcf-gq-bands 32 | --gvcf-gq-bands 32 |
| --gvcf-gq-bands 33 | --gvcf-gq-bands 33 |
| --gvcf-gq-bands 34 | --gvcf-gq-bands 34 |
| --gvcf-gq-bands 35 | --gvcf-gq-bands 35 |
| --gvcf-gq-bands 36 | --gvcf-gq-bands 36 |
| --gvcf-gq-bands 37 | --gvcf-gq-bands 37 |
| --gvcf-gq-bands 38 | --gvcf-gq-bands 38 |
| --gvcf-gq-bands 39 | --gvcf-gq-bands 39 |
| --gvcf-gq-bands 40 | --gvcf-gq-bands 40 |
| --gvcf-gq-bands 41 | --gvcf-gq-bands 41 |
| --gvcf-gq-bands 42 | --gvcf-gq-bands 42 |
| --gvcf-gq-bands 43 | --gvcf-gq-bands 43 |
| --gvcf-gq-bands 44 | --gvcf-gq-bands 44 |
| --gvcf-gq-bands 45 | --gvcf-gq-bands 45 |
| --gvcf-gq-bands 46 | --gvcf-gq-bands 46 |
| --gvcf-gq-bands 47 | --gvcf-gq-bands 47 |
| --gvcf-gq-bands 48 | --gvcf-gq-bands 48 |
| --gvcf-gq-bands 49 | --gvcf-gq-bands 49 |
| --gvcf-gq-bands 50 | --gvcf-gq-bands 50 |
| --gvcf-gq-bands 51 | --gvcf-gq-bands 51 |
| --gvcf-gq-bands 52 | --gvcf-gq-bands 52 |
| --gvcf-gq-bands 53 | --gvcf-gq-bands 53 |
| --gvcf-gq-bands 54 | --gvcf-gq-bands 54 |
| --gvcf-gq-bands 55 | --gvcf-gq-bands 55 |
| --gvcf-gq-bands 56 | --gvcf-gq-bands 56 |
| --gvcf-gq-bands 57 | --gvcf-gq-bands 57 |
| --gvcf-gq-bands 58 | --gvcf-gq-bands 58 |
| --gvcf-gq-bands 59 | --gvcf-gq-bands 59 |
| --gvcf-gq-bands 60 | --gvcf-gq-bands 60 |
| --gvcf-gq-bands 70 | --gvcf-gq-bands 70 |
| --gvcf-gq-bands 80 | --gvcf-gq-bands 80 |
| --gvcf-gq-bands 90 | --gvcf-gq-bands 90 |
| --gvcf-gq-bands 99 | --gvcf-gq-bands 99 |
| --floor-blocks false | --floor-blocks false |
| --indel-size-to-eliminate-in-ref-model 10 | --indel-size-to-eliminate-in-ref-model 10 |
| --disable-optimizations false | --disable-optimizations false |
| --dragen-mode false | --dragen-mode false |
| --apply-bqd false | --apply-bqd false |
| --apply-frd false | --apply-frd false |
| --disable-spanning-event-genotyping false | --disable-spanning-event-genotyping false |
| --transform-dragen-mapping-quality false | --transform-dragen-mapping-quality false |
| --mapping-quality-threshold-for-genotyping 20 | --mapping-quality-threshold-for-genotyping 20 |
| --max-effective-depth-adjustment-for-frd 0 | --max-effective-depth-adjustment-for-frd 0 |
| --just-determine-active-regions false | --just-determine-active-regions false |
| --dont-genotype false | --dont-genotype false |
| --do-not-run-physical-phasing false | --do-not-run-physical-phasing false |
| --do-not-correct-overlapping-quality false | --do-not-correct-overlapping-quality false |
| --use-filtered-reads-for-annotations false | --use-filtered-reads-for-annotations false |
| --adaptive-pruning false | --adaptive-pruning false |
| --do-not-recover-dangling-branches false | --do-not-recover-dangling-branches false |
| --recover-dangling-heads false | --recover-dangling-heads false |
| --kmer-size 10 | --kmer-size 10 |
| --kmer-size 25 | --kmer-size 25 |
| --dont-increase-kmer-sizes-for-cycles false | --dont-increase-kmer-sizes-for-cycles false |
| --allow-non-unique-kmers-in-ref false | --allow-non-unique-kmers-in-ref false |
| --num-pruning-samples 1 | --num-pruning-samples 1 |
| --min-dangling-branch-length 4 | --min-dangling-branch-length 4 |
| --recover-all-dangling-branches false | --recover-all-dangling-branches false |
| --max-num-haplotypes-in-population 128 | --max-num-haplotypes-in-population 128 |
| --min-pruning 2 | --min-pruning 2 |
| --adaptive-pruning-initial-error-rate 0.001 | --adaptive-pruning-initial-error-rate 0.001 |
| --pruning-lod-threshold 2.302585092994046 | --pruning-lod-threshold 2.302585092994046 |
| --pruning-seeding-lod-threshold 9.210340371976184 | --pruning-seeding-lod-threshold 9.210340371976184 |
| --max-unpruned-variants 100 | --max-unpruned-variants 100 |
| --linked-de-bruijn-graph false | --linked-de-bruijn-graph false |
| --disable-artificial-haplotype-recovery false | --disable-artificial-haplotype-recovery false |
| --enable-legacy-graph-cycle-detection false | --enable-legacy-graph-cycle-detection false |
| --debug-assembly false | --debug-assembly false |
| --debug-graph-transformations false | --debug-graph-transformations false |
| --capture-assembly-failure-bam false | --capture-assembly-failure-bam false |
| --num-matching-bases-in-dangling-end-to-recover -1 | --num-matching-bases-in-dangling-end-to-recover -1 |
| --error-correction-log-odds -Infinity | --error-correction-log-odds -Infinity |
| --error-correct-reads false | --error-correct-reads false |
| --kmer-length-for-read-error-correction 25 | --kmer-length-for-read-error-correction 25 |
| --min-observations-for-kmer-to-be-solid 20 | --min-observations-for-kmer-to-be-solid 20 |
| --base-quality-score-threshold 18 | --base-quality-score-threshold 18 |
| --dragstr-het-hom-ratio 2 | --dragstr-het-hom-ratio 2 |
| --dont-use-dragstr-pair-hmm-scores false | --dont-use-dragstr-pair-hmm-scores false |
| --pair-hmm-gap-continuation-penalty 10 | --pair-hmm-gap-continuation-penalty 10 |
| --expected-mismatch-rate-for-read-disqualification 0.02 | --expected-mismatch-rate-for-read-disqualification 0.02 |
| --pair-hmm-implementation FASTEST_AVAILABLE | --pair-hmm-implementation FASTEST_AVAILABLE |
| --pcr-indel-model CONSERVATIVE | --pcr-indel-model CONSERVATIVE |
| --phred-scaled-global-read-mismapping-rate 45 | --phred-scaled-global-read-mismapping-rate 45 |
| --disable-symmetric-hmm-normalizing false | --disable-symmetric-hmm-normalizing false |
| --disable-cap-base-qualities-to-map-quality false | --disable-cap-base-qualities-to-map-quality false |
| --enable-dynamic-read-disqualification-for-genotyping false | --enable-dynamic-read-disqualification-for-genotyping false |
| --dynamic-read-disqualification-threshold 1.0 | --dynamic-read-disqualification-threshold 1.0 |
| --native-pair-hmm-threads 4 | --native-pair-hmm-threads 4 |
| --native-pair-hmm-use-double-precision false | --native-pair-hmm-use-double-precision false |
| --bam-writer-type CALLED_HAPLOTYPES | --bam-writer-type CALLED_HAPLOTYPES |
| --dont-use-soft-clipped-bases false | --dont-use-soft-clipped-bases false |
| --min-base-quality-score 10 | --min-base-quality-score 10 |
| --smith-waterman JAVA | --smith-waterman JAVA |
| --emit-ref-confidence NONE | --emit-ref-confidence NONE |
| --force-call-filtered-alleles false | --force-call-filtered-alleles false |
| --soft-clip-low-quality-ends false | --soft-clip-low-quality-ends false |
| --allele-informative-reads-overlap-margin 2 | --allele-informative-reads-overlap-margin 2 |
| --smith-waterman-dangling-end-match-value 25 | --smith-waterman-dangling-end-match-value 25 |
| --smith-waterman-dangling-end-mismatch-penalty -50 | --smith-waterman-dangling-end-mismatch-penalty -50 |
| --smith-waterman-dangling-end-gap-open-penalty -110 | --smith-waterman-dangling-end-gap-open-penalty -110 |
| --smith-waterman-dangling-end-gap-extend-penalty -6 | --smith-waterman-dangling-end-gap-extend-penalty -6 |
| --smith-waterman-haplotype-to-reference-match-value 200 | --smith-waterman-haplotype-to-reference-match-value 200 |
| --smith-waterman-haplotype-to-reference-mismatch-penalty -150 | --smith-waterman-haplotype-to-reference-mismatch-penalty -150 |
| --smith-waterman-haplotype-to-reference-gap-open-penalty -260 | --smith-waterman-haplotype-to-reference-gap-open-penalty -260 |
| --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 | --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 |
| --smith-waterman-read-to-haplotype-match-value 10 | --smith-waterman-read-to-haplotype-match-value 10 |
| --smith-waterman-read-to-haplotype-mismatch-penalty -15 | --smith-waterman-read-to-haplotype-mismatch-penalty -15 |
| --smith-waterman-read-to-haplotype-gap-open-penalty -30 | --smith-waterman-read-to-haplotype-gap-open-penalty -30 |
| --smith-waterman-read-to-haplotype-gap-extend-penalty -5 | --smith-waterman-read-to-haplotype-gap-extend-penalty -5 |
| --min-assembly-region-size 50 | --min-assembly-region-size 50 |
| --max-assembly-region-size 300 | --max-assembly-region-size 300 |
| --active-probability-threshold 0.002 | --active-probability-threshold 0.002 |
| --max-prob-propagation-distance 50 | --max-prob-propagation-distance 50 |
| --force-active false | --force-active false |
| --assembly-region-padding 100 | --assembly-region-padding 100 |
| --padding-around-indels 75 | --padding-around-indels 75 |
| --padding-around-snps 20 | --padding-around-snps 20 |
| --padding-around-strs 75 | --padding-around-strs 75 |
| --max-extension-into-assembly-region-padding-legacy 25 | --max-extension-into-assembly-region-padding-legacy 25 |
| --max-reads-per-alignment-start 50 | --max-reads-per-alignment-start 50 |
| --enable-legacy-assembly-region-trimming false | --enable-legacy-assembly-region-trimming false |
| --interval-set-rule UNION | --interval-set-rule UNION |
| --interval-padding 0 | --interval-padding 0 |
| --interval-exclusion-padding 0 | --interval-exclusion-padding 0 |
| --interval-merging-rule ALL | --interval-merging-rule ALL |
| --read-validation-stringency SILENT | --read-validation-stringency SILENT |
| --seconds-between-progress-updates 10.0 | --seconds-between-progress-updates 10.0 |
| --disable-sequence-dictionary-validation false | --disable-sequence-dictionary-validation false |
| --create-output-bam-index true | --create-output-bam-index true |
| --create-output-bam-md5 false | --create-output-bam-md5 false |
| --create-output-variant-index true | --create-output-variant-index true |
| --create-output-variant-md5 false | --create-output-variant-md5 false |
| --max-variants-per-shard 0 | --max-variants-per-shard 0 |
| --lenient false | --lenient false |
| --add-output-sam-program-record true | --add-output-sam-program-record true |
| --add-output-vcf-command-line true | --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 | --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 | --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false | --disable-bam-index-caching false |
| --sites-only-vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --QUIET false | --QUIET false |
| --use-jdk-deflater false | --use-jdk-deflater false |
| --use-jdk-inflater false | --use-jdk-inflater false |
| --gcs-max-retries 20 | --gcs-max-retries 20 |
| --gcs-project-for-requester-pays | --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false | --disable-tool-default-read-filters false |
| --minimum-mapping-quality 20 | --minimum-mapping-quality 20 |
| --disable-tool-default-annotations false | --disable-tool-default-annotations false |
| --enable-all-annotations false | --enable-all-annotations false |
| --allow-old-rms-mapping-quality-annotation-data false | --allow-old-rms-mapping-quality-annotation-data false" |
| ",Version="4.2.4.1" | Version="4.2.4.1", |
Prod helios
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉less 63003856_S135_DP_over_30.vcf
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
,Date="January 10, 2023 at 12:26:57 AM CET">
******* DONE VCF : même différence
CLOSED: [2023-01-20 Fri 22:48]
/Work/Projects/bisonex/ref-63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:04:02 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf.gz │ 84708 │
│ 1 │ 63003856_S135_DP_over_30.vcf.gz.tbi │ 1 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11362 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8864 │
│ 4 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6478 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:05:23 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf │ 84724 │
│ 1 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11377 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8884 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6759 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
***** TODO Relancer avec 4 coeurs
-c 4
****** DONE Alignement ok !!
CLOSED: [2023-01-20 Fri 23:24]
$ samtools view -c files/tmp_63003856_S135/63003856_S135.bam
128077207
****** TODO Vérifier les flags de chaque étape
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
******* DONE Clean sam
CLOSED: [2023-01-20 Fri 23:44]
ref
$ samtools view -H 63003856_S135_cleaned.bam | grep PN
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:/mnt/h/tools/samtools sort -@ 4 -O BAM -o /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools PN:samtools PP:bwa VN:1.13 CL:samtools sort -@ 4 -O BAM -o files/tmp_63003856_S135/63003856_S135.bam
******* DONE Markduplicates
CLOSED: [2023-01-20 Fri 23:44]
ref
| @PG ID:MarkDuplicates VN:Version:4.2.4.1 CL:MarkDuplicates | @PG ID:MarkDuplicates VN:Version:4.2.4.1 CL:MarkDuplicates | |
| --INPUT /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135_cleaned.bam | --INPUT files/tmp_63003856_S135/63003856_S135_cleaned.bam | |
| --OUTPUT /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135_marked_dup.bam | --OUTPUT files/tmp_63003856_S135/63003856_S135_marked_dup.bam | |
| --METRICS_FILE /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135_marked_dup.metrix | --METRICS_FILE files/tmp_63003856_S135/63003856_S135_marked_dup.metrix | |
| --VERBOSITY WARNING | --VERBOSITY WARNING | |
| --CREATE_INDEX true | --CREATE_INDEX true | |
| --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 | --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 | |
| --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 | --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 | |
| --SORTING_COLLECTION_SIZE_RATIO 0.25 | --SORTING_COLLECTION_SIZE_RATIO 0.25 | |
| --TAG_DUPLICATE_SET_MEMBERS false | --TAG_DUPLICATE_SET_MEMBERS false | |
| --REMOVE_SEQUENCING_DUPLICATES false | --REMOVE_SEQUENCING_DUPLICATES false | |
| --TAGGING_POLICY DontTag | --TAGGING_POLICY DontTag | |
| --CLEAR_DT true | --CLEAR_DT true | |
| --DUPLEX_UMI false | --DUPLEX_UMI false | |
| --ADD_PG_TAG_TO_READS true | --ADD_PG_TAG_TO_READS true | |
| --REMOVE_DUPLICATES false | --REMOVE_DUPLICATES false | |
| --ASSUME_SORTED false | --ASSUME_SORTED false | |
| --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES | --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES | |
| --PROGRAM_RECORD_ID MarkDuplicates | --PROGRAM_RECORD_ID MarkDuplicates | |
| --PROGRAM_GROUP_NAME MarkDuplicates | --PROGRAM_GROUP_NAME MarkDuplicates | |
| --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> | --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> | |
| --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 | --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 | |
| --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 | --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 | |
| --QUIET false | --QUIET false | |
| --VALIDATION_STRINGENCY STRICT | --VALIDATION_STRINGENCY STRICT | |
| --COMPRESSION_LEVEL 2 | --COMPRESSION_LEVEL 2 | |
| --MAX_RECORDS_IN_RAM 500000 | --MAX_RECORDS_IN_RAM 500000 | |
| --CREATE_MD5_FILE false | --CREATE_MD5_FILE false | |
| --GA4GH_CLIENT_SECRETS client_secrets.json | --GA4GH_CLIENT_SECRETS client_secrets.json | |
| --help false | --help false | |
| --version false | --version false | |
| --showHidden false | --showHidden false | |
| --USE_JDK_DEFLATER false | --USE_JDK_DEFLATER false | |
| --USE_JDK_INFLATER false | --USE_JDK_INFLATER false | |
| @PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135_marked_dup.bam |
******* DONE ApplyBQSR
CLOSED: [2023-01-20 Fri 23:48]
Ref
@PG ID:GATK ApplyBQSR VN:4.2.4.1 CL:ApplyBQSR
| --output /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam | --output files/bam/63003856_S135_recalibrated_hg38.bam |
| --bqsr-recal-file /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135_recal.table | --bqsr-recal-file files/tmp_63003856_S135/63003856_S135_recal.table |
| --input /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135_marked_dup.bam | --input files/tmp_63003856_S135/63003856_S135_marked_dup.bam |
| --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna | --reference /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna |
| --verbosity WARNING | --verbosity WARNING |
| --preserve-qscores-less-than 6 | --preserve-qscores-less-than 6 |
| --use-original-qualities false | --use-original-qualities false |
| --quantize-quals 0 | --quantize-quals 0 |
| --round-down-quantized false | --round-down-quantized false |
| --emit-original-quals false | --emit-original-quals false |
| --global-qscore-prior -1.0 | --global-qscore-prior -1.0 |
| --interval-set-rule UNION | --interval-set-rule UNION |
| --interval-padding 0 | --interval-padding 0 |
| --interval-exclusion-padding 0 | --interval-exclusion-padding 0 |
| --interval-merging-rule ALL | --interval-merging-rule ALL |
| --read-validation-stringency SILENT | --read-validation-stringency SILENT |
| --seconds-between-progress-updates 10.0 | --seconds-between-progress-updates 10.0 |
| --disable-sequence-dictionary-validation false | --disable-sequence-dictionary-validation false |
| --create-output-bam-index true | --create-output-bam-index true |
| --create-output-bam-md5 false | --create-output-bam-md5 false |
| --create-output-variant-index true | --create-output-variant-index true |
| --create-output-variant-md5 false | --create-output-variant-md5 false |
| --max-variants-per-shard 0 | --max-variants-per-shard 0 |
| --lenient false | --lenient false |
| --add-output-sam-program-record true | --add-output-sam-program-record true |
| --add-output-vcf-command-line true | --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 | --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 | --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false | --disable-bam-index-caching false |
| --sites-only-vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --QUIET false | --QUIET false |
| --use-jdk-deflater false | --use-jdk-deflater false |
| --use-jdk-inflater false | --use-jdk-inflater false |
| --gcs-max-retries 20 | --gcs-max-retries 20 |
| --gcs-project-for-requester-pays | --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false PN:GATK ApplyBQSR | --disable-tool-default-read-filters false PN:GATK ApplyBQSR |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO Validation : NA12878
:PROPERTIES:
:CATEGORY: NA18278
:END:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB
:PROPERTIES:
:CATEGORY: GIAB
:END:
**** DONE télécharger données avec Nextflow
CLOSED: [2023-02-25 Sat 19:47]
***** DONE Télécharger NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
****** DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
***** DONE Télécharger Ashkenazy trio HG002
CLOSED: [2023-02-17 Fri 19:29]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
***** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
***** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | SNV | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | SNV | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
| 1 | indel | 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 |
| 2 | indel | 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
**** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
***** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
****** DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
INDEL PASS 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
SNP ALL 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
SNP PASS 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
****** KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
****** DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
INDEL PASS 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
SNP ALL 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
SNP PASS 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
****** DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 2984 2682 1840 3890296 0.5931 0.0008 0.0015
None 2984 2682 1841 3890296 0.5930 0.0008 0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
****** DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11 783006 . A G 50 PASS
Il y a A -> G et C -> A sur cette position
**** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
***** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version europénne) vs RefSeq: [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-16-S8-S2][article 2015]] en faveur de GENCODE mais Alexis conseille Refseq
***** DONE -f, -R ou -T ?
CLOSED: [2023-02-25 Sat 19:47]
Selon la doc : -f avec le bed fourni et -T pour filtrer sor les exons
****** rtg tools
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
3.000 1015 910 206 32531 0.8154 0.0303 0.0583
None 1015 910 206 32531 0.8154 0.0303 0.0583
**** TODO Exons seuls
***** DONE BestRefSeq
CLOSED: [2023-02-19 Sun 12:05]
Dans refseq,[[https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/][2 types de modèles pour le gène]]
- basé sur refseq (NM_, NP_), curée
- basé sur gnomon (XM_ , XP_), prédite
Les modèles basés sur refseq ont la préférénce (cf lien)
On se restreint donc à bestrefseq
#+begin_src sh
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
gunzip https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
#+end_src
On se restrein aux exons codant (NM_)
#+begin_src
awk '/BestRefSeq\texon/ && /transcript_id=NM/ {print $1"\t"$4"\t"$5;}' GRCh38_latest_genomic.gff > exons.csv
#+end_src
Puis intersection
****** DONE Tests après correction bug dans noms de chromosome : precision ~ ok, recall très mauvais -> trop de FN ?
CLOSED: [2023-02-19 Sun 12:05]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 7230 321 6909 1500 290 888 27 18 0.044398 0.526144
INDEL PASS 7230 321 6909 1500 290 888 27 18 0.044398 0.526144
SNP ALL 59052 1653 57399 3338 101 1583 12 2 0.027992 0.942450
SNP PASS 59052 1653 57399 3338 101 1583 12 2 0.027992 0.942450
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.592000 0.081887 NaN NaN 1.54733 6.129187
0.592000 0.081887 NaN NaN 1.54733 6.129187
0.474236 0.054370 2.433271 1.861183 1.57523 2.703663
0.474236 0.054370 2.433271 1.861183 1.57523 2.703663
****** DONE Vérifier exons: on a l'union des exons de tous les transcripts...
CLOSED: [2023-02-19 Sun 12:05]
Il faudrait un .bed d'illumina
On teste Twist for Illumina Exome 2.0 Plus BED File (hg19) sur https://support.illumina.com/downloads/nextera-flex-for-enrichment-BED-files.html
Conversion en hg38 avec ucsc
Renommage des chromosomes
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\\t/:;s:$:\\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -i.bak -f pattern.sed illumina_exons.bed
bedtools intersect -a HG001_GRCh38_1_22_v4.2.1_benchmark.bed -b illumina_exons.bed > HG001_GRCh38_1_22_v4.2.1_benchmark_illumina_exons.bed
#+end_src
Intersection
***** KILL Bed illumina
CLOSED: [2023-02-24 Fri 23:44]
****** KILL Sans filtre vep: Inversion truth et query... on recommence
CLOSED: [2023-02-19 Sun 13:15]
****** KILL Sans filtre vep: mieux mais pas exceptionnel
CLOSED: [2023-02-24 Fri 23:44]
cd work/00/2c72e62400956c96fb101ac7af405e/
$ cat .command.out
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 922 490 432 942 439 0 32 56 0.531453 0.533970
INDEL PASS 922 490 432 942 439 0 32 56 0.531453 0.533970
SNP ALL 24618 18689 5929 21257 2571 0 239 17 0.759160 0.879052
SNP PASS 24618 18689 5929 21257 2571 0 239 17 0.759160 0.879052
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.0 0.532709 NaN NaN 1.628743 2.799180
0.0 0.532709 NaN NaN 1.628743 2.799180
0.0 0.814719 2.899604 2.881526 1.598479 1.564378
0.0 0.814719 2.899604 2.881526 1.598479 1.564378
****** KILL Comprendre pour les FN explosent avec vep annot
CLOSED: [2023-02-24 Fri 23:44]
NC_000001.11 924024 . C G
non couverte dans bam -> 1er exons de SAMD11 mais non couverte
Il est dans NM_001385641.1 mais pas dans NM_152486.4
***** KILL Obtenir les zones couvertes depuis le bam directement
CLOSED: [2023-02-24 Fri 23:44]
#+begin_src sh
samtools sort NA12878_chr1.bam -o NA12878_chr1_sorted.bam
bedtools genomecov -ibam NA12878_chr1_sorted.bam -bg > chr1.bedgraph
head chr1.bedgraph
#+end_src
#+RESULTS:
: NC_000001.11 10001 10021 1
: NC_000001.11 10021 10059 2
: NC_000001.11 10059 10063 1
: NC_000001.11 10063 10087 2
: NC_000001.11 10087 10110 1
: NC_000001.11 10110 10113 3
: NC_000001.11 10113 10121 2
On prend les régions avec 20 reads
awk '$4 > 20 {print $1"\t"$2"\t"$3}' chr1.bedgraph > tomerge.bed
On fusionne les régions
bedtools merge -i tomerge.bed > exons_from_bam.bed
Problème : on manque parfois le bord des exons...
***** KILL Vérifier un bam de référence de NA12878
CLOSED: [2023-02-24 Fri 23:44]
Notamment pour la définitions des exons, on a des reads qui ne semblent pas alignés sur les bons exons selon IGV
On donne directemet à IGV le bam d'illuma WES ( https://github.com/genome-in-a-bottle/giab_data_indexes )
IGV n'aime pas le ftp apparement...
On récupére 2 échantillons pour ce patient sur HiSeq Exome
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/data/giab/GRCh38
wget https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/NA12878/alignment.index.NA12878_HiSeq_Exome_Garvan_GRCh37_09252015 -O todl.txt
awk '{print $1}' todl.txt | wget -i -
#+end_src
On récupére le premier échantillons en prenant just le chromosome 1
#+begin_src
samtools merge NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam project
.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_1_NA12878.bwa.markDuplicates.bam project.NIST_NIST7035_H7AP8ADXX_TA
AGGCGA_2_NA12878.bwa.markDuplicates.bam
samtools index NA12878-NIST7035-HiSeq_Exome_Garan_GRCh37.bam
#+end_src
****** Étude
NC_000001.11:924024 :
- bed d’illumina : pas correct
- bed refseq : meux
- bam : exon non ciblé, probablement parce que Mane select n’a pas été utilisé mais refseq (https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:28706)
NC_000001.11:924310: idem
NC_000001.11:924321: idem
NC_000001.11:924533: idem
NC_000001.11:966227: le bed ne couvre pas le bon exon !
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A965863%2D966591&hgsid=295298291_vhDbIv5YIckCKLpGB7UUaZi2cAkr
NC_000001.11:966272
***** KILL Filtrer variants introniques de référence avec vep
CLOSED: [2023-02-24 Fri 23:44]
****** KILL variant calling seulf + seulement -f: nombreux FP
CLOSED: [2023-02-24 Fri 23:44]
/Work/Users/apraga/bisonex/work/68/f1cf72a5a4078fdf743fb3844b369a
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 519 284 235 52169 37789 14094 12 64 0.547206 0.007511
INDEL PASS 519 284 235 52169 37789 14094 12 64 0.547206 0.007511
SNP ALL 22131 17434 4697 357652 305313 34904 189 32 0.787764 0.054020
SNP PASS 22131 17434 4697 357652 305313 34904 189 32 0.787764 0.054020
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.270160 0.014820 NaN NaN 1.775956 0.509510
0.270160 0.014820 NaN NaN 1.775956 0.509510
0.097592 0.101108 2.971834 1.429533 1.579776 0.324923
0.097592 0.101108 2.971834 1.429533 1.579776 0.324923
***** DONE Bed donnés par GIAB (en hg19) : résultats corrects
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-02 Thu>
Téléchargé à la main et converti via UCSCS. À automatiser, voir : *hg38
Sans oublier de renommer les chromosomes !
| Type | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
|-------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------|
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Type Ssensibilité VPP
INDEL 0.710532 0.692946
SNP 0.855253 0.970759
| METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| 0.090605 | 0.909353 | 2.529551552318896 | 2.4019675131548843 | 1.6206857273037931 | 1.6274012964054214 |
***** DONE Re-tester avec exons Refseq et option -T: résultats un peu moins bons
CLOSED: [2023-03-09 Thu 22:42] SCHEDULED: <2023-03-08 Wed>
On utilise directement les coordonées données par refseq
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 7226 3417 3809 6978 1599 1918 228 353 0.472876 0.683992
SNP ALL 59052 37825 21227 43480 1913 3740 675 35 0.640537 0.951862
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.274864 0.559171 NaN NaN 1.547733 2.756151
0.086017 0.765767 2.433271 2.350281 1.575230 1.492346
***** TODO Vérifier que la zone de capture a vraiment besoin d'un liftover...
**** TODO D'où vient la différence ?
SCHEDULED: <2023-03-09 Thu>
***** DONE Statistiques
CLOSED: [2023-03-15 mer. 13:41] SCHEDULED: <2023-03-04 Sat>
On confirnme le nombre de SNP 6665
- 304 ont un problème d’haploide/allèle différente
- 28 avec une différence de génotype
- La majorité ne sont pas vu 6361
[1] "ALl FN 6665"
[1] "Genotype mismatch 28"
[1] "haplotype mismatch 304"
[1] "Missing 6333"
***** DONE Majorité des FN pour SPN sont peu couverts
CLOSED: [2023-03-16 Thu 16:54]
Chromosome 1 : majorité des FN ne sont pas vus...
cf figure
#+attr_html: :width 500px
[[file:reads.png]]
1623412 : 5’UTR
1953616 : intronique mais dans bed...
2589991
3488573
3488732
3780326
3884683
3914055
4711993
4712657
***** DONE Alignement
CLOSED: [2023-03-15 mer. 13:41]
****** DONE Impact de la version mineure du genome: non
CLOSED: [2023-03-09 Thu 23:13]
******* DONE GHC38 + version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 7226 3417 3809 6979 1599 1919 228 353 0.472876 0.683992
SNP ALL 59052 37827 21225 43483 1913 3741 676 35 0.640571 0.951865
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.274968 0.559171 NaN NaN 1.547733 2.756698
0.086034 0.765792 2.433271 2.350254 1.575230 1.492375
******* DONE GHC38.p13 + notre version alexis + exons refseq
CLOSED: [2023-03-22 Wed 00:02]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 7226 3417 3809 6978 1599 1918 228 353 0.472876 0.683992
SNP ALL 59052 37825 21227 43480 1913 3740 675 35 0.640537 0.951862
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.274864 0.559171 NaN NaN 1.547733 2.756151
0.086017 0.765767 2.433271 2.350281 1.575230 1.492346
***** DONE Vérifier si coordonnées génomiques (vcf)
CLOSED: [2023-03-09 Thu 23:05]
***** DONE Variant calling
CLOSED: [2023-03-22 Wed 23:11]
****** DONE Désactiver dbSNP : idem
CLOSED: [2023-03-22 Wed 15:00]
Après haplotycaller
$ sha256sum work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650 work/94/d4ae5999ca59a25469b1ed221b7b1b/NA12878_NIST.vcf.gz
Fichier utilisé par hap.py
$ sha256sum work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
193fb37e2a581bb2855ae6fd11638f3b97958515def0e4295005e6a60c9dc650 work/5b/199360963641524a076d06b621e7e0/NA12878_NIST.vcf.gz
On a bien le même que [[*Bed donnés par GIAB (en hg19) : résultats corrects][Bed donnés par GIAB (en hg19) : résultats corrects]]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 4871 3461 1410 7048 1554 1987 193 346 0.710532 0.692946
SNP ALL 46032 39367 6665 44599 1186 4042 304 30 0.855209 0.970757
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.281924 0.701629 NaN NaN 1.617499 3.067409
0.090630 0.909327 2.529552 2.402151 1.620686 1.627342
***** DONE Définition des exons
CLOSED: [2023-03-22 Wed 23:16]
****** KILL Vérifier qu'il ne manque pas des exons (avec bam ?)
CLOSED: [2023-03-16 Thu 16:54] SCHEDULED: <2023-03-05 Sun>
****** KILL Vérifier la couverture des variants introniques
CLOSED: [2023-03-22 Wed 15:01]
NC_000001.11 23344310 : intronique mais la région dans le bed couvre en partie l'exon
Profondeur sur le variant : 14 reads sur l'alt
****** Notes: UTR = exon - CDS
intron dans le gene = mRNA - exon
On le vérifie avec
25 │ NC_000001.11 BestRefSeq exon 65520 65573
27 │ NC_000001.11 BestRefSeq CDS 65565 65573
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65520%2D65573&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
mRNA :
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A65419%2D71585&hgsid=295007972_J1rAQiur4dC2KLadnaOCw27WUihG
****** DONE Statistiques : reads par type (introns, UTR, exons)
CLOSED: [2023-03-22 Wed 23:16]
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
#+begin_src julia
using CSV
using DataFrames, DataFramesMeta
using XAM, GenomicFeatures
function exonFromBestRefSeq(f)
println("Reading exons from $(f)")
cols = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute"]
d = DataFrame(CSV.File(f ,header=cols, delim="\t",comment="#"))
@chain d begin
@rsubset :feature in ["exon", "CDS", "mRNA"] :source .== "BestRefSeq"
@select :seqname :source :feature :start :end
@distinct
end
end
# Get exons from refseq (Bestrefeq, gnomon or just refseq)
function getExon(f)
if isfile(f)
println("Reading exons from preprocessed $(f)")
d = DataFrame(CSV.File(f))
else
d = exonFromBestRefSeq("GRCh38_latest_genomic.gff.gz")
CSV.write(f, d)
end
return d
end
# Return false negative and TRUTH annotation (genotype, type...)
# Read vcf manually: it's a tab-separated file and we extract the TRUTH column manually
#BD = Decision for call (TP/FP/FN/N)
#BK = tSub-type for decision (match/mismatch type)
#BI = Additional comparison information
#QQ = Variant quality for ROC creation.
#BV = High-level variant type (SNP|INDEL).
#BL = High-level location type (het|homref|hetalt|homalt|nocall).
function getFN(f)
cols = ["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "TRUTH", "QUERY"]
vcf = DataFrame(CSV.File(f,comment="#", delim="\t",
header = cols))
# Extract genothype and other subcolumn fronm FORMAT
format = split(vcf[1,:FORMAT], ":")
@chain vcf begin
@rtransform $format=split(:TRUTH, ':')
@subset :BD .== "FN"
@select :type=:BVT :CHROM :POS :REF :ALT :GT :BK :BLT
end
end
# Search a variant in refseq exons
# Return the smallest interval (CDS)
function searchVariant(chrom, pos, exons)
res = @subset exons :seqname .== chrom :start .<= pos :end .>= pos
if isempty(res)
return "intronicOutsideGene"
elseif "CDS" in res.feature
return "CDS"
elseif "exon" in res.feature
# UTR = exon - CDS
return "UTR"
elseif "mRNA" in res.feature
# intron = mRNA - exon
return "intronicGene"
else
return "intronicOutsideGene"
end
end
# Once the bam file has been opened, count the reads for a single variant
function countReadsVariant(reader, chrom, pos)
n = 0
for record in eachoverlap(reader, chrom,pos:pos)
n += 1
end
return n
end
# For all variants, get the number of reads from the bam file
function countReads(d)
bamDir = "../script/files/bam"
bam = bamDir * "/NA12878_NIST7035_recalibrated_hg38.bam"
bai = bam * ".bai"
reader = open(BAM.Reader, bam, index=bai)
d2 = @transform(d, @byrow :reads = countReadsVariant(reader, :CHROM, :POS))
close(reader)
return d2
end
function exonicStatus(d, exons)
# Check variants are in exon: for each variant, search in all exons
@transform(d, @byrow :match = searchVariant(:CHROM, :POS, exons))
end
exons = getExon("GRCh38_latest_genomic_exon.csv")
d = getFN("../out/test-bed/test-allchr.vcf") |>
x -> exonicStatus(x, exons) |>
countReads
# # bed = DataFrame(CSV.File("../out/test-bed/exons_illumina_hg38.bed",comment="#", delim="\t", header=false))
CSV.write("falseNegatives.csv", d)
#+end_src
Discordance nombre de reads avec igv mais cohérent avec samtools : vu avec Alexis : probablement du à IVG, on reste sur samtools
$ samtools view NA12878_NIST7035_recalibrated_hg38.bam NC_000001.11:1734812-1734812 -F "0x200" | wc -l
164
***** KILL Comprendre pourquoi le chromosome 6 a plus de FN
CLOSED: [2023-03-29 Wed 22:39]
Vérifier le graphe fait en julia
$ grep "FN.*SNP" test-allchr.vcf | grep NC_000006 -c
1264
@subset d :type .== "SNP" :CHROM .== "NC_000006.12"
1264×10 DataFrame
Région HLA: difficile car nombreux allèles alternatifs.
Mais dans nos données, sont principalement non vus.
La répartition des reads est similaire aux autres régions...
#+begin_src julia
vcf = vcfHappy("test-allchr.vcf")
d = DataFrame(CSV.File("data/falseNegatives.csv"))
d2 = @subset d :POS .>= 29600000 :POS .<= 32000000 :type .== "SNP" :CHROM .=="NC_000006.12"
#+end_src
#+RESULTS:
: julia> nrow(@subset d2 :BK .== ".")
: 612
: julia> nrow(@subset d2 :BK .!= ".")
: 6
: julia> nrow(@subset d2 :BK .!= "am")
: 612
***** KILL comparer avec Kumaran 2021
CLOSED: [2023-03-22 Wed 23:16]
Bed probablement différent, on a presque un facteur 2 sur le nombre de variant (TP..)
***** KILL Comparer avec stats de NA12878 dans example/happy sur chr21 (exons fournis)
CLOSED: [2023-03-04 Sat 11:01]
****** DONE DP_over_30_not_SNP_consensual_sequence.vcf: horrible
CLOSED: [2023-02-25 Sat 19:47]
déplorable
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 519 16 503 1579 1032 531 2 2 0.030829 0.015267 0.336289 0.020421 NaN NaN 1.775956 4.768382
INDEL PASS 519 16 503 1579 1032 531 2 2 0.030829 0.015267 0.336289 0.020421 NaN NaN 1.775956 4.768382
SNP ALL 22131 1191 20940 4346 2342 813 4 1 0.053816 0.337107 0.187069 0.092815 2.971834 1.967235 1.579776 1.492828
SNP PASS 22131 1191 20940 4346 2342 813 4 1 0.053816 0.337107 0.187069 0.092815 2.971834 1.967235 1.579776 1.492828
****** DONE DP _over_30 : mieux
CLOSED: [2023-03-04 Sat 11:01]
/Work/Users/apraga/bisonex/work/69/cc1391f5a0fbf4db958a370ec17a19
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 76 20 56 36 6 10 1 3 0.263158 0.769231
INDEL PASS 76 20 56 36 6 10 1 3 0.263158 0.769231
SNP ALL 582 312 270 370 9 49 0 0 0.536082 0.971963
SNP PASS 582 312 270 370 9 49 0 0 0.536082 0.971963
******* DONE NC_000021.9:45515163 TA -> T : référence non normalisée ??
CLOSED: [2023-03-04 Sat 11:01]
NC_000021.9 45515163 . TA T 50 PASS BS=45515163;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 0/1:FN:lm:d1_5:INDEL:het:. ./.:.:.:.:NOCALL:nocall:0
NC_000021.9 45515163 . TAA T 694.02 PASS BS=45515163;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:694.02
Mais dans notre vcf
NC_000021.9 45515163 rs11284347 TAA T,TA
La séquence est TAAAA donc notre TAA -> TA aurait du être reconnu comme identique. Faut-il utiliser un autre fasta ?
La décomposition de xcmp est
45515163 TAA -> T
45515164 AA -> A
NB: testée avec pre.py
#+begin_src
grep '^#' NA12878_NIST7035_DP_over_30.vcf > NA12878_NIST7035_DP_over_30_chr21.vcf
grep '^NC_000021' NA12878_NIST7035_DP_over_30.vcf >> NA12878_NIST7035_DP_over_30_chr21.vcf
pre.py NA12878_NIST7035_DP_over_30_chr21.vcf out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Selon l'algorithm, on devrait avoir TA -> T et non AA -> A (non left-align)
Si on débug:
#+begin_src
grep "^#" NA12878_NIST7035_DP_over_30_chr21.vcf > test_align.vcf
grep 45515163 NA12878_NIST7035_DP_over_30_chr21.vcf >> test_align.vcf.gz
bgzip test_align.vcf.gz
bcftools index test_align.vcf.gz
multimerge test_align.vcf.gz -o out.vcf.gz -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
#+end_src
Idem... Si on utile un genome de référence plus récent ?
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --process-full=1
#+end_src
Idem. Et vcfeval ?
#+begin_src
tabix HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz
tabix test_align.vcf.gz
rtg vcfeval -b HG001_GRCh38_1_22_v4.2.1_benchmark_chr21.vcf.gz -c test_align.vcf.gz -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf -o vcfeval-test
#+end_src
Selected score threshold using: maximized F-measure
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
99.000 0 0 1 54828 0.0000 0.0000 0.0000
None 0 0 1 54828 0.0000 0.0000 0.0000
On essaie les différentes étapes
#+begin_src
multimerge test_align.vcf.gz -o out.vcf.gz -r /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna --homref-split 1 --homref-vcf-out 1 --trimalleles 1 --splitalleles 1
#+end_src
1er changement
TAA T
TAA TA
Après réflexion, c'est la référence qui n'est pas normalisée ! (left-trimmed)
******* DONE NC_000021.9 14108836 T -> C
CLOSED: [2023-03-04 Sat 11:01]
Dans le bam mais filtré par DP
****** DONE variant calling seul : meilleur score pour l'instant (77% recall, 95% precision)
CLOSED: [2023-03-04 Sat 11:01]
tests/chr21-alexis
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision
INDEL ALL 76 43 33 82 18 20 3 5 0.565789 0.709677
SNP ALL 582 448 134 530 25 57 6 1 0.769759 0.947146
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.243902 0.629617 NaN NaN 1.181818 1.612903
0.107547 0.849289 3.098592 2.925926 1.530435 1.774869
******* NC_000021.9:14144627 : FN, dans le bam mais pas dans le vcf (2 reads/9)
On récupére tout le vcf: pas dedans
Dans le bam : 9/2
Idem pour le bam dans notre pipeline
- base qualité 33 sur les 2 reads
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
On debug
#+begin_src
cd /Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr
samtools view -b NA12878_NIST7035.bam NC_000021.9 -o NA12878_NIST7035_chr21.bam
samtools index NA12878_NIST7035_chr21.bam
#+end_src
******** --debug
#+begin_src
gatk --java-options "-Xmx3g" HaplotypeCaller --input NA12878_NIST7035_chr21.bam \
--output debug_chr1.vcf.gz \
--reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
--dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
--tmp-dir . \
--max-mnp-distance 2 --debug &> lol.txt
#+end_src
Les allèles sont bien retrouvées
#+begin_quote
14:41:05.530 INFO EventMap - === Best Haplotypes ===
14:41:05.530 INFO EventMap - AATTTAATTTTCTTACCTTTCTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO EventMap - > Cigar = 41M
14:41:05.530 INFO EventMap - >> Events = EventMap{}
14:41:05.530 INFO EventMap - AATTTAATTTTCTTACCTTTTTGGGTATGTAAGTGATTTTA
14:41:05.530 INFO EventMap - > Cigar = 41M
14:41:05.530 INFO EventMap - >> Events = EventMap{NC_000021.9:14144627-14144627 [C*, T],}
14:41:05.530 INFO HaplotypeCallerGenotypingEngine - Genotyping event at 14144627 with alleles = [C*, T]
#+end_quote
NB: même en filtranrt sur le chromosome 21 avec julia, haplotypecaller parcourt tous les chromosomes
******** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-02-26 Sun 17:26]
******** DONE examine sortie --bamout : non présent
CLOSED: [2023-02-26 Sun 19:53]
#+begin_src
cd test/chr21-alexis
gatk --java-options "-Xmx3g" HaplotypeCaller \
--input /Work/Users/apraga/bisonex/script/files/bam/NA12878_chr21.bam \
--output debug_chr1.vcf.gz \
--reference /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.fna \
--dbsnp /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz \
--tmp-dir . \
--max-mnp-distance 2 -bamout debug.bam
#+end_src
Pas de reads
#+begin_quote
If you see nothing overlapping your region, then it might not have been flagged as active, or could have failed to assemble.
#+end_quote
******** 14582339: FN mais pas de reads...
******** 14583327 idem
******** 17512551 idem
******** 17567111: difference d'haplotype
******** 17567621 pas de reads
***** TODO Comparer avec sortie du variant calling vcf donné par GIAB
****** STRT vcfeval
#+begin_src sh
wget https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
#+end_src
On renomme les chromosomes
#+begin_src
sed -i 's/^chrM/MT/' project.NIST.hc.snps.indels.vcf
sed -i 's/^chr//' project.NIST.hc.snps.indels.vcf
bcftools annotate --rename-chrs ../../genome/GRCh38.p13/chromosome_mapping.txt project.NIST.hc.snps.indels.vcf -o project.NIST.hc.snps.indels.vcf.gz
#+end_src
On compresse et index
#+begin_src
bgzip HG001_GRCh38_1_22_v4.2.1_benchmark.vcf
tabix HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
tabix project.NIST.hc.snps.indels.vcf.gz
#+end_src
Conversion chromosome pour le bed définissant les exons :
#+begin_src
sed 's:^:s/chr:;s:chrMT:chrM:;s:\s:\t\/:;s:$:\t/:' ../../genome/GRCh38.p13/chromosome_mapping.txt > pattern.sed
sed -f pattern.sed nexterarapidcapture_expandedexome_targetedregions.bed -i.bak
#+end_src
Attention aux samples qui sont différents :
#+begin_src
rtg vcfeval -b HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -c project.NIST.hc.snps.indels.vcf.gz --bed-regions nexterarapidcapture_expandedexome_targetedregions.bed -e HG001_GRCh38_1_22_v4.2.1_benchmark.bed.gz -o lol -t ../../genome/GRCh38.p13/genomeRef.sdf/ --sample HG001,NIST7035
#+end_src
****** TODO happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.PrecisioN
INDEL PASS 4871 3678 1193 7036 1299 2011 208 217 0.755081 0.741493
SNP PASS 46032 41138 4894 47694 1622 4930 362 31 0.893683 0.962071
METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
0.285816 0.748225 NaN NaN 1.617499 2.524051
0.103367 0.926617 2.529552 2.412446 1.620686 1.688868
***** TODO Statistiques avec vcfeval
**** TODO Résumer résultats pour Paul
*** TODO Platinum genome
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquence NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]