B:BD[
18.35] → [
2.16:24592]
#+title: Bisonex
* Biblio
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
/ : genotype unphased
| : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
SCHEDULED: <2023-03-13 lun.>
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
SCHEDULED: <2023-03-04 Sat>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
SNV
| Exome | Pipeline | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | BWA_GATK | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | BWA_GATK | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
indel
| TP | FP | FN | Sensitivity | Precision | F-Score | FDR | |
| 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 | |
| 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 | |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
- Cornish and Guda, 2015
Variant Type
| | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source |
| NA12878 | 100%a | 40% | 0 | 0 | 0 | 0 | 0 | Zook et al18 | NIST |
| Other NIST standard | 71% | 40% | 50% | 0 | 0 | 0 | 0 | Zook et al18 | |
| (e.g. AJ/Asian trios) | | | | | | | | | |
| Platinum | 29% | 0 | 0 | 0 | 0 | 0 | 0 | Eberle et al8 | Platinum |
| Genomes | | | | | | | | | |
| Venter/HuRef | 14% | 40% | 0 | 0 | 0 | 0 | 0 | Trost et al1 | HuRef |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
title = "Systematic comparison of germline variant calling pipelines
cross multiple next-generation sequencers",
author = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
Yuhuan and Du, Hongli",
abstract = "The development and innovation of next generation sequencing
(NGS) and the subsequent analysis tools have gain popularity in
scientific researches and clinical diagnostic applications.
Hence, a systematic comparison of the sequencing platforms and
variant calling pipelines could provide significant guidance to
NGS-based scientific and clinical genomics. In this study, we
compared the performance, concordance and operating efficiency
of 27 combinations of sequencing platforms and variant calling
pipelines, testing three variant calling pipelines-Genome
Analysis Tool Kit HaplotypeCaller, Strelka2 and
Samtools-Varscan2 for nine data sets for the NA12878 genome
sequenced by different platforms including BGISEQ500,
MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
calling performance of 12 combinations in WES datasets, all
combinations displayed good performance in calling SNPs, with
their F-scores entirely higher than 0.96, and their performance
in calling INDELs varies from 0.75 to 0.91. And all 15
combinations in WGS datasets also manifested good performance,
with F-scores in calling SNPs were entirely higher than 0.975
and their performance in calling INDELs varies from 0.71 to
0.93. All of these combinations manifested high concordance in
variant identification, while the divergence of variants
identification in WGS datasets were larger than that in WES
datasets. We also down-sampled the original WES and WGS datasets
at a series of gradient coverage across multiple platforms, then
the variants calling period consumed by the three pipelines at
each coverage were counted, respectively. For the GIAB datasets
on both BGI and Illumina platforms, Strelka2 manifested its
ultra-performance in detecting accuracy and processing
efficiency compared with other two pipelines on each sequencing
platform, which was recommended in the further promotion and
application of next generation sequencing technology. The
results of our researches will provide useful and comprehensive
guidelines for personal or organizational researchers in
reliable and consistent variants identification.",
journal = "Sci. Rep.",
publisher = "Springer Science and Business Media LLC",
volume = 9,
number = 1,
pages = "9345",
month = jun,
year = 2019,
copyright = "https://creativecommons.org/licenses/by/4.0",
language = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
For WES datasets, the BGI platforms displayed the superior performance in SNPs
calling while Illumina platforms manifested the better variants calling
performance in INDELs calling, which could be explained by their divergence in
sequencing strategy that producing different length of reads (all BGI platforms
were 100 base pair read length while all Illumina platforms were 150 base pair
read length). The read length effects, as a key factor between two platforms,
would bring alignment bias and error which are higher for short reads and
ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD 1 String Decision for call (TP/FP/FN/N)
: FORMAT BK 1 String Sub-type for decision (match/mismatch type)
: FORMAT BVT 1 String High-level variant type (SNP|INDEL).
: FORMAT BLT 1 String High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 19196584 . T C 50 PASS platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:781:109,123:138,150:348
: NC_000022.11 19196584 rs1061325 T C 59.32 PASS AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
bcftools filter -i 'POS=31277416' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=31277416' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 31277416 . CA C 50 PASS platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5 GT:PS:DP:ADALL:AD:GQ 1/1:.:465:16,229:0,190:129
: NC_000022.11 31277416 rs57244615 CAA C,CA 389.02 PASS AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572 GT:AD:DP:GQ:PL 1/2:5,10,14:29:64:406,202,313,64,0,88
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina https://github.com/genome-in-a-bottle/giab_data_indexes
puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
Ajout des zones "difficiles"
https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
-> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
-> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données
** TODO Remplacer bam par fastq sur mesocentre
Commande
*** STRT Supprimer les fastq non "paired"
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 }
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name } | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 } | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
Puis on supprime
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** TODO Supprimer bam qui ont des fastq
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz" | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam" | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** STRT Télécharger les données avec nextflow
**** DONE Genome de référence
**** DONE dbSNP
**** TODO VEP 20G
Ajout vérification checksum -> à vérifier
**** TODO transcriptome (spip)
Rajouter checksum manuel
**** KILL Refseq
**** STRT OMIM
codé, à vérifier
**** TODO ACMG incidental
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 | |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . | . | 12 | | | | | | | | |
| . | 12 | 0 | 2 | | | | | | | |
| 2 | 3 | 2 | 2 | 2 | 5 | . | | | | |
| . | 2 | 3 | 2 | 2 | 4 | | | | | |
| . | . | 3 | 12 | 3 | | | | | | |
| . | 5 | 2 | . | | | | | | | |
| . | . | . | 5 | 2 | 2 | | | | | |
| . | 9 | 9 | 9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039 |
| rs3833341 |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
| 764018 | A | ACAGGTCAAT,ACAGGT | .,5 | 2,. | |
| 23637790 | C | G,T | .,.,12 | | |
| 44651586 | C | A,G,T | .,.,.,5 | 2 | 2 |
| 764018 | A | ACAGGTCAAT | Benign | | |
| 23637790 | C | T | Benign | | |
| 44651586 | C | T | Benign | | |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 cli
#+title: Bisonex
* Biblio :biblio:
** Workflow
Comparaison WDL, Cromwell, nextflow
https://www.nature.com/articles/s41598-021-99288-8
Nextflow = bon compromis ?
Comparison alignement, variant caller (2021)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04144-1
** Étapes du pipeline
*** Variant calling: Haplotype caller
https://gatk.broadinstitute.org/hc/en-us/articles/360035531412
Définis l'algorithme + image
** VCF
*** GT genotype
encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:
/ : genotype unphased
| : genotype phased
** Validation
*** NA12878
**** KILL [[https://precision.fda.gov/challenges/truth/results][fdaPrecision challenge]]
Attention, génome et en hg19 donc comparaison non adaptée ...
**** TODO Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease
SCHEDULED: <2023-03-13 lun.>
https://www.nature.com/articles/s41525-020-00154-9
Recommandations générale pour genome, sans données brutes
**** TODO [#A] Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data
SCHEDULED: <2023-03-04 Sat>
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
1. variant calling seul
2. NA12878 + données simulées
3. exome
4. évalué via F-score
Code disponible ! https://github.com/bharani-lab/WES-Benchmarking-Pipeline_Manoj/tree/master/Script
Résultat: BWA/Novoalign_DeepVariant
Aligneurs
- BWA-MEM 0.7.16
- Bowtie2 2.2.6
- Novoalign 3.08.02
- SOAP 2.21
- MOSAIK 2.2.3
Variantcalling
- GATK HaplotypeCaller 4
- FreeBayes 1.1.0
- SAMtools mpileup 1.7
- DeepVariant r0.4
SNV
| Exome | Pipeline | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | BWA_GATK | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | BWA_GATK | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
indel
| TP | FP | FN | Sensitivity | Precision | F-Score | FDR | |
| 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 | |
| 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 | |
Valeur brutes :
https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
Autres articles avec même comparaison en exome sur NA12878
- Hwang et al., 2015 studyi
- Highnam et al, 2015
- Cornish and Guda, 2015
Variant Type
| | SNVs & Indels | CNVs (>10Kb) | SVs | Mitochondrial variants | Pseudogenes | REs | Somatic/ mosaic | Literature/Data | Source |
| NA12878 | 100%a | 40% | 0 | 0 | 0 | 0 | 0 | Zook et al18 | NIST |
| Other NIST standard | 71% | 40% | 50% | 0 | 0 | 0 | 0 | Zook et al18 | |
| (e.g. AJ/Asian trios) | | | | | | | | | |
| Platinum | 29% | 0 | 0 | 0 | 0 | 0 | 0 | Eberle et al8 | Platinum |
| Genomes | | | | | | | | | |
| Venter/HuRef | 14% | 40% | 0 | 0 | 0 | 0 | 0 | Trost et al1 | HuRef |
**** Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers
#+begin_src bibtex
@ARTICLE{Chen2019-fp,
title = "Systematic comparison of germline variant calling pipelines
cross multiple next-generation sequencers",
author = "Chen, Jiayun and Li, Xingsong and Zhong, Hongbin and Meng,
Yuhuan and Du, Hongli",
abstract = "The development and innovation of next generation sequencing
(NGS) and the subsequent analysis tools have gain popularity in
scientific researches and clinical diagnostic applications.
Hence, a systematic comparison of the sequencing platforms and
variant calling pipelines could provide significant guidance to
NGS-based scientific and clinical genomics. In this study, we
compared the performance, concordance and operating efficiency
of 27 combinations of sequencing platforms and variant calling
pipelines, testing three variant calling pipelines-Genome
Analysis Tool Kit HaplotypeCaller, Strelka2 and
Samtools-Varscan2 for nine data sets for the NA12878 genome
sequenced by different platforms including BGISEQ500,
MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten. For the variants
calling performance of 12 combinations in WES datasets, all
combinations displayed good performance in calling SNPs, with
their F-scores entirely higher than 0.96, and their performance
in calling INDELs varies from 0.75 to 0.91. And all 15
combinations in WGS datasets also manifested good performance,
with F-scores in calling SNPs were entirely higher than 0.975
and their performance in calling INDELs varies from 0.71 to
0.93. All of these combinations manifested high concordance in
variant identification, while the divergence of variants
identification in WGS datasets were larger than that in WES
datasets. We also down-sampled the original WES and WGS datasets
at a series of gradient coverage across multiple platforms, then
the variants calling period consumed by the three pipelines at
each coverage were counted, respectively. For the GIAB datasets
on both BGI and Illumina platforms, Strelka2 manifested its
ultra-performance in detecting accuracy and processing
efficiency compared with other two pipelines on each sequencing
platform, which was recommended in the further promotion and
application of next generation sequencing technology. The
results of our researches will provide useful and comprehensive
guidelines for personal or organizational researchers in
reliable and consistent variants identification.",
journal = "Sci. Rep.",
publisher = "Springer Science and Business Media LLC",
volume = 9,
number = 1,
pages = "9345",
month = jun,
year = 2019,
copyright = "https://creativecommons.org/licenses/by/4.0",
language = "en"
}
#+end_src
Comparaison de différents pipeline 2019
https://www.nature.com/articles/s41598-019-45835-3
Combinaison
- variant calling = GATK, Strelka2 and Samtools-Varscan2
- sur NA12878
- séquencé sur BGISEQ500, MGISEQ2000, HiSeq4000, NovaSeq and HiSeq Xten.
Conclusion: strelka2 supérieur mais biais sur NA12878 ?
Illumina > BGI pour indel, probablement car reads plus grand
#+begin_quote
For WES datasets, the BGI platforms displayed the superior performance in SNPs
calling while Illumina platforms manifested the better variants calling
performance in INDELs calling, which could be explained by their divergence in
sequencing strategy that producing different length of reads (all BGI platforms
were 100 base pair read length while all Illumina platforms were 150 base pair
read length). The read length effects, as a key factor between two platforms,
would bring alignment bias and error which are higher for short reads and
ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD 1 String Decision for call (TP/FP/FN/N)
: FORMAT BK 1 String Sub-type for decision (match/mismatch type)
: FORMAT BVT 1 String High-level variant type (SNP|INDEL).
: FORMAT BLT 1 String High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 19196584 . T C 50 PASS platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:781:109,123:138,150:348
: NC_000022.11 19196584 rs1061325 T C 59.32 PASS AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
bcftools filter -i 'POS=31277416' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=31277416' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 31277416 . CA C 50 PASS platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5 GT:PS:DP:ADALL:AD:GQ 1/1:.:465:16,229:0,190:129
: NC_000022.11 31277416 rs57244615 CAA C,CA 389.02 PASS AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572 GT:AD:DP:GQ:PL 1/2:5,10,14:29:64:406,202,313,64,0,88
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina https://github.com/genome-in-a-bottle/giab_data_indexes
puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
Ajout des zones "difficiles"
https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
-> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
-> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème : nécessite singularity ou docker (ou conda)
Potentiellement utilisable avec nix...
* Données :data:
** TODO Remplacer bam par fastq sur mesocentre
Commande
*** STRT Supprimer les fastq non "paired"
nushell
Liste des fastq avec "paired-end" manquant
#+begin_src nu
ls **/*.fastq.gz | get name | path basename | split column "_" | get column1 | uniq -u | save single.txt
#+end_src
#+RESULTS:
: 62907927
: 62907970
: 62899606
: 62911287
: 62913201
: 62914084
: 62915905
: 62921595
: 62923065
: 62925220
: 62926503
: 62926502
: 62926500
: 62926499
: 62926498
: 62931719
: 62943423
: 62943400
: 62948290
: 62949205
: 62949206
: 62949118
: 62951284
: 62960792
: 62960785
: 62960787
: 62960617
: 62962561
: 62962692
: 62967473
: 62972194
: 62979102
On vérifie
#+begin_src nu
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 }
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0.name } | path basename | split column "_" | get column1 | uniq -c
#+end_src
On met tous dans un dossier (pas de suppression )
#+begin_src
open single.txt | lines | each {|e| ls $"fastq/*_($in)/*" | get 0 } | each {|e| ^mv $e.name bad-fastq/}
#+end_src
On vérifie que les dossiier sont videsj
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^ls -l $in
Puis on supprime
open single.txt | lines | each {|e| ls $"fastq/*_($in)" | get 0.name } | ^rm -r $in
*** TODO Supprimer bam qui ont des fastq
On liste les identifiants des fastq et bam dans un tableau avec leur type :
#+begin_src
let fastq = (ls fastq/*/*.fastq.gz | get name | parse "{dir}/{full_id}/{id}_{R}_001.fastq.gz" | select dir id | uniq )
let bam = (ls bam/*/*.bam | get name | parse "{dir}/{full_id}/{id}_{S}.bqrt.bam" | select dir id)
#+end_src
On groupe les résultat par identifiant (résultats = liste de records qui doit être convertie en table)
et on trie ceux qui n'ont qu'un fastq ou un bam
#+begin_src
let single = ( $bam | append $fastq | group-by id | transpose id files | get files | where {|x| ($x | length) == 1})
#+end_src
On convertit en table et on récupère seulement les bam
#+begin_src
$single | reduce {|it, acc| $acc | append $it} | where dir == bam | get id | each {|e| ^ls $"bam/*_($e)/*.bam"}
#+end_src
#+RESULTS:
: bam/2100656174_62913201/62913201_S52.bqrt.bam
: bam/2100733271_62925220/62925220_S33.bqrt.bam
: bam/2100738763_62926502/62926502_S108.bqrt.bam
: bam/2100746726_62926498/62926498_S105.bqrt.bam
: bam/2100787936_62931955/62931955_S4.bqrt.bam
: bam/2200066374_62948290/62948290_S130.bqrt.bam
: bam/2200074722_62948298/62948298_S131.bqrt.bam
: bam/2200074990_62948306/62948306_S218.bqrt.bam
: bam/2200214581_62967331/62967331_S267.bqrt.bam
: bam/2200225399_62972187/62972187_S85.bqrt.bam
: bam/2200293962_62979117/62979117_S63.bqrt.bam
: bam/2200423985_62999352/62999352_S1.bqrt.bam
: bam/2200495073_63010427/63010427_S20.bqrt.bam
: bam/2200511274_63012586/63012586_S114.bqrt.bam
: bam/2200669188_63036688/63036688_S150.bqrt.bam
* Nouveau workflow :workflow:
** TODO Bases de données
*** KILL Nix pour télécharger les données brutes
**** Conclusion
Non viable sur cluster car en dehors de /nix/store
On peut utiliser des symlink mais trop compliqué
**** KILL Axel au lieu de curl pour gérer les timeout?
CLOSED: [2022-08-19 Fri 15:18]
*** DONE Tester patch de @pennae pour gros fichiers
SCHEDULED: <2022-08-19 Fri>
*** STRT Télécharger les données avec nextflow
**** DONE Genome de référence
**** DONE dbSNP
**** TODO VEP 20G
Ajout vérification checksum -> à vérifier
**** TODO transcriptome (spip)
Rajouter checksum manuel
**** KILL Refseq
**** STRT OMIM
codé, à vérifier
**** TODO ACMG incidental
*** HOLD Processing bases de données
**** DONE dbSNP common
**** DONE Seulement les ID dans dbSNP common !
CLOSED: [2022-11-19 Sat 21:42]
172G au lieu de 253M...
**** HOLD common dbSNP not clinvar patho
***** DONE Conclusion partielle
CLOSED: [2022-12-12 Mon 22:25]
- vcfeval : prometteur mais n'arrive pas à traiter toutes les régions
- isec : trop de problèmes avec
- classif clinvar directement dans dbSNP: le plus simple
Et ça permet de rattraper quelques erreurs dans le script d'Alexis
***** KILL Utiliser directement le numéro dbSNP dans clinvar ? Non
CLOSED: [2022-11-20 Sun 19:51]
Ex: chr20
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f 'rs%INFO/RS \n' -i 'INFO/RS != "." & INFO/CLNSIG="Pathogenic"' clinvar_chr20.vcf.gz | sort > ID_clinvar_patho.txt
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz | sort > ID_of_common_snp.txt
comm -23 ID_of_common_snp.txt ID_clinvar_patho.txt > ID_of_common_snp_not_clinvar_patho.txt
wc -l ID_of_common_snp_not_clinvar_patho.txt
# sort ID
#+end_src
#+RESULTS:
: 518846 ID_of_common_snp_not_clinvar_patho.txt
Version d'alexis
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output prod.txt
wc -l prod.txt
zgrep '^NC' dbSNP_common_chr20.vcf.gz | wc -l
#+end_src
#+RESULTS:
| 518832 | prod.txt |
| 518846 | |
***** KILL classification clinvar codée dbSNP ?
CLOSED: [2022-12-04 Sun 14:38]
Sur le chromosome 20
*Attention* CLNSIG a plusieurs champs (séparé par une virgule)
On y accède avec INFO/CLNSIG[*]
Ensuite, chaque item peut avoir plusieurs haploïdie (séparé par un |). IL faut donc utiliser une regexp
NB: *ne pas mettre la condition* dans une variable !!
Pour avoir les clinvar patho, on veut 5 mais pas 255 (= autre) pour la classification !`
Il faut également les likely patho et conflicting
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%INFO/CLNSIG\n' dbSNP_common_chr20.vcf.gz -i \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"^4|" | INFO/CLNSIG[*]=="4" | INFO/CLNSIG[*]~"|4" | INFO/CLNSIG[*]~"^12|" | INFO/CLNSIG[*]=="12" | INFO/CLNSIG[*]~"|12"' | sort
#+end_src
#+RESULTS:
| . | . | 12 | | | | | | | | |
| . | 12 | 0 | 2 | | | | | | | |
| 2 | 3 | 2 | 2 | 2 | 5 | . | | | | |
| . | 2 | 3 | 2 | 2 | 4 | | | | | |
| . | . | 3 | 12 | 3 | | | | | | |
| . | 5 | 2 | . | | | | | | | |
| . | . | . | 5 | 2 | 2 | | | | | |
| . | 9 | 9 | 9 | 5 | 5 | 2 | 3 | 2 | 3 | 2 |
Si on les exclut :
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%ID\n' dbSNP_common_chr20.vcf.gz -e \
'INFO/CLNSIG[*]~"^5|" | INFO/CLNSIG[*]=="5" | INFO/CLNSIG[*]~"|5" | INFO/CLNSIG[*]~"4" | INFO/CLNSIG[*]~"12"' | sort | uniq > common-notpatho.txt
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test_isec
snp=dbSNP_common_chr20.vcf.gz
clinvar=clinvar_chr20_notremapped.vcf.gz
python ../script/pythonScript/clinvar_sbSNP.py \
--clinvar $clinvar \
--chrm_name_table ../database/RefSeq/refseq_to_number_only_consensual.txt \
--dbSNP $snp --output tmp.txt
sort tmp.txt | uniq > common-notpatho-alexis.txt
wc -l common-notpatho-alexis.txt
#+end_src
#+RESULTS:
: 518832 common-notpatho-alexis.txt
On en a 6 de plus que la version d'Alexis mais quelques différences
Ceux d'Alexis qui manquent:
#+begin_src sh :dir ~/code/bisonex/test_isec
comm -23 common-notpatho-alexis.txt common-notpatho.txt > alexis-only.txt
cat alexis-only.txt
#+end_src
#+RESULTS:
| rs1064039 |
| rs3833341 |
| rs73598374 |
On les teste dans clinvar et dbSNP
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz
bcftools query -f '%POS\n' -i 'ID=@alexis-only.txt' dbSNP_common_chr20.vcf.gz > alexis-only-pos.txt
while read -r line; do
bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS='$line clinvar_chr20.vcf.gz
done < alexis-only-pos.txt
# bcftools query -f '%POS %REF %ALT %INFO/CLNSIG\n' -i 'POS=23637790' clinvar_chr20.vcf.gz
#+end_src
#+RESULTS:
| 764018 | A | ACAGGTCAAT,ACAGGT | .,5 | 2,. | |
| 23637790 | C | G,T | .,.,12 | | |
| 44651586 | C | A,G,T | .,.,.,5 | 2 | 2 |
| 764018 | A | ACAGGTCAAT | Benign | | |
| 23637790 | C | T | Benign | | |
| 44651586 | C | T | Benign | | |
On a donc une discordance entre clinvar et dbSNP.
On dirait qu'ils ont mal fait l'intersection avec clinvar.
Par exemple https://www.ncbi.nlm.nih.gov/snp/rs3833341#clinical_significance
Tu as l'impression qu'il y a un 1 cli