7SXNKSNEXIANSPPAFEK6SS3JXEJIWXLBYLTOHNSF3GO7WJNEQTTAC
LCQZYNZN4J37G3KMSPG6VZSQKU65XRDUAJVH4YOWPPJTTMO5SLZQC
MN5IIXFRUQQTOCWN66V6ZWIPUOUOM62EWE4BAKKYCTK6FOWR3ZDAC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
2HYO3KBJWTOQFX2B3H4NHADODGNYOIDVINAUHC7JBTEKZMGE2J3AC
K54WC2QNEXHVKE2YFU3DOCEUYZTGLINXBRLSHQX4BNDADWUMJGFAC
KNZKRW5WZVVQL73LUG3GGFR5QHJJWI4LMOLHJUXUU4PHNBCBLJLAC
4FZ6627CHEHJTPLHE7MF6ZVFJKSDUAKOBGMJZ6QBR2HARWWCXOFQC
be explained by their divergence in
sequencing strategy that producing different length of reads (all BGI pl
atforms
were 100 base pair read length while all Illumina platforms were 150 base pair
read length). The read length effects, as a key factor between two platforms,
would bring alignment bias and error which are higher for short reads and
ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD 1 String Decision for call (TP/FP/FN/N)
: FORMAT BK 1 String Sub-type for decision (match/mismatch type)
: FORMAT BVT 1 String High-level variant type (SNP|INDEL).
: FORMAT BLT 1 String High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 19196584 . T C 50 PASS platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:781:109,123:138,150:348
: NC_000022.11 19196584 rs1061325 T C 59.32 PASS AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
bcftools filter -i 'POS=31277416' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=31277416' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 31277416 . CA C 50 PASS platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5 GT:PS:DP:ADALL:AD:GQ 1/1:.:465:16,229:0,190:129
: NC_000022.11 31277416 rs57244615 CAA C,CA 389.02 PASS AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572 GT:AD:DP:GQ:PL 1/2:5,10,14:29:64:406,202,313,64,0,88
*** Génération de reads
Biblio récente
https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf
Parmi ceux qui gèrent les variations
- *simuscop* reads non centré sur les zones de capture
- *NEAT: exome* mais trop lent en pratique
- *Reseq* exome
- gensim : pas d'exome
- pIRS : non plus
- varsim : non plus
...
Biblio : https://www.nature.com/articles/s41437-022-00577-3
Divers
- Liste ancienne : https://www.biostars.org/p/128762/
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina https://github.com/genome-in-a-bottle/giab_data_indexes
puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
Ajout des zones "difficiles"
https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop (exome)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
-> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
-> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème :
be explained by their divergence in
sequencing strategy that producing different length of reads (all BGI platforms
were 100 base pair read length while all Illumina platforms were 150 base pair
read length). The read length effects, as a key factor between two platforms,
would bring alignment bias and error which are higher for short reads and
ultimately affect the variants calling especially the INDELs identification
#+end_quote
*** Débugger variant calling (haplotypecaller)
https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called
*** Hap.py
Format de sortie :
#+begin_src r
vcf_field_names(vcf, tag = "FORMAT")
#+end_src
#+RESULTS:
: FORMAT BD 1 String Decision for call (TP/FP/FN/N)
: FORMAT BK 1 String Sub-type for decision (match/mismatch type)
: FORMAT BVT 1 String High-level variant type (SNP|INDEL).
: FORMAT BLT 1 String High-level location type (het|homref|hetalt|homa
am = genotype mismatch
lm = allele/haplotype mismatch
. = non vu
**** On vérifie que am = genotype mismatch
référence = T/T
high-confidence = T/C
notre = C/C
#+begin_src sh
bcftools filter -i 'POS=19196584' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=19196584' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 19196584 . T C 50 PASS platforms=5;platformnames=Illumina,PacBio,10X,Ion,Solid;datasets=5;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR,IonExome,SolidSE75bp;callsets=7;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xfreebayes,10XLRGATK,IonExomeTVC,SolidSE75GATKHC;datasetsmissingcall=CGnormal;callable=CS_HiSeqPE300xGATK_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:781:109,123:138,150:348
: NC_000022.11 19196584 rs1061325 T C 59.32 PASS AC=2;AF=1;AN=2;DB;DP=2;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.66;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
**** On vérifie que lm = allele/haplotype mismatch
référence = CAA/CAA
high-confidence = CA/CA
notre = C/CA
#+begin_src sh
bcftools filter -i 'POS=31277416' /Work/Groups/bisonex/data/giab/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | grep -v '#'
bcftools filter -i 'POS=31277416' ../out/NA12878_NIST7035-dbsnp/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz | grep -v '#'
#+end_src
#+RESULTS:
: NC_000022.11 31277416 . CA C 50 PASS platforms=3;platformnames=Illumina,PacBio,10X;datasets=3;datasetnames=HiSeqPE300x,CCS15kb_20kb,10XChromiumLR;callsets=4;callsetnames=HiSeqPE300xGATK,CCS15kb_20kbDV,10XLRGATK,HiSeqPE300xfreebayes;datasetsmissingcall=CCS15kb_20kb,CGnormal,IonExome,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable;difficultregion=GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5,GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5 GT:PS:DP:ADALL:AD:GQ 1/1:.:465:16,229:0,190:129
: NC_000022.11 31277416 rs57244615 CAA C,CA 389.02 PASS AC=1,1;AF=0.5,0.5;AN=2;BaseQRankSum=0.37;DB;DP=37;ExcessHet=0;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;MQRankSum=0;QD=13.41;ReadPosRankSum=-0.651;SOR=0.572 GT:AD:DP:GQ:PL 1/2:5,10,14:29:64:406,202,313,64,0,88
*** Génération de reads
Biblio récente
https://www.biorxiv.org/content/10.1101/2022.03.29.486262v1.full.pdf
Parmi ceux qui gèrent les variations
- *simuscop* reads non centré sur les zones de capture
- *NEAT: exome* mais trop lent en pratique
- *Reseq* exome
- gensim : pas d'exome
- pIRS : non plus
- varsim : non plus
...
Temps de calcul selon l'article de reseq https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
#+begin_quote
Due to ReSeq’s effective parallelization, its elapsed times are low for this benchmark with 48 virtual CPUs (Additional file 1: Figure S34b,e). In contrast, the single-threaded processes implemented in perl or python have strikingly high elapsed times. This is well visible in Hs-HiX-TruSeq and applies to the training of pIRS (over a week), NEAT (several days), and BEAR (half a week) as well as the simulation of NEAT (close to 2 weeks) and BEAR (several weeks).
Biblio : https://www.nature.com/articles/s41437-022-00577-3
#+end_quote
Divers
- Liste ancienne : https://www.biostars.org/p/128762/
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02265-7
* Idées
** Validation analytique
mail Yannis : données patients +/- simulées
*** Utiliser données GCAT et uploader le notre ?
https://www.nature.com/articles/ncomms7275
*** [#A] Variant calling : Genome in a bottle : NA12878 + autres
Résumé : https://www.nist.gov/programs-projects/genome-bottle
Manuscript : https://www.nature.com/articles/s41587-019-0054-x.epdf?author_access_token=E_1bL0MtBBwZr91xEsy6B9RgN0jAjWel9jnR3ZoTv0OLNnFBR7rUIZNDXq0DIKdg3w6KhBF8Rz2RWQFFc0St45kC6CZs3cDYc87HNHovbWSOubJHDa9CeJV-pN0BW_mQ0n7cM13KF2JRr_wAAn524w%3D%3D
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
**** KILL Tester le séquencage aussi
CLOSED: [2023-01-30 lun. 18:30]
Depuis un fastq correspondant à Illumina https://github.com/genome-in-a-bottle/giab_data_indexes
puis on compare le VCF avec les "high confidence"
On séquence directement NA12878 -> inutile pour le pipeline seul
**** TODO Tester seul la partie bioinformatique
Tout résumé ici : https://www.nist.gov/programs-projects/genome-bottle
- methode https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/IlluminaPlatinumGenomes-user-guide.pdf
- vcf
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
NB: à quoi correspond https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/Illumina_PlatinumGenomes_NA12877_NA12878_09162015/hg38/2.0.1/NA12878/ ??
Article comparant les variant calling : https://www.biorxiv.org/content/10.1101/2020.12.11.422022v1.full.pdf
Article pour vcfeval : https://www.nature.com/articles/s41587-019-0054-x
La version 4 ajoute 273 gènes "clinically relevant" https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3.full.pdf
Ajout des zones "difficiles"
https://www.biorxiv.org/content/10.1101/2020.07.24.212712v5.full.pdf
*** [#B] Pipeline : générer patient avec tous les variants retrouvés à Centogene
Comparaison de génération ADN (2019)
https://academic.oup.com/bfg/article/19/1/49/5680294
**** SimuSCop (exome)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03665-5
https://github.com/qasimyu/simuscop
1. Crééer un modèle depuis bam + vcf : Setoprofile
2. Génerer données NGS
** Annotation :
*** Comparaison vep / snpeff et annovar
* Changement nouvelle version
- Dernière version du génome (la version "prête à l'emploi" est seulement GRCh38 sans les version patchées)
* Notes
** Nextflow
*** afficher les résultats d'un process/workflow
#+begin_src
lol.out.view()
#+end_src
Attention, ne fonctionne pas si plusieurs sortie:
#+begin_src
lol.out[0].view()
#+end_src
ou si /a/ est le nom de la sortie
#+begin_src
lol.out.a.view()
#+end_src
** Quelle version du génome ?
Il y a 2 notations pour les chrosome: Refseq (NC_0001) ou chr1, chr2...
dbSNP utilise Refseq
pour le fasta, 2 solutions
- refseq : "https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/${genome}_latest/refseq_identifiers/${fna}.gz"
-> nécessite d'indexer le fichier (long !)
- chromosome https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
-> nécessite d'annoter les chromosomes pour corriger (avec le fichier gff)
On utilise la version chromosome donc on annote dbSNP (à faire)
** Performances
Ordinateur de Carine (WSL2) : 4h dont 1h15 alignement (parallélisé) et 1h15 haplotypecaller (séquentiel)
** Chromosomes NC, NT, NW
Correspondance :
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&chromInfoPage=
Signification
https://genome.ucsc.edu/FAQ/FAQdownloads.html#downloadAlt
- alt = séquences alternatives (utilisables)
- fix = patch (correction ou amélioration)
- random = séquence connue sur un chromosome mais non encore utilisée
** Pipelines prêt-à-l’emploi nextflow
Problème :
rated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
**** TODO Ajout spliceAI
SCHEDULED: <2023-04-30 Sun>
plugin VEP
**** TODO Ajout pLI
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** TODO Spip avec export VCF
SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** TODO interpretation + score + intervalle de confiance séparé
SCHEDULED: <2023-04-30 Sun>
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** TODO CADD: remplacer par plugin VEP
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** TODO Renommer les chromosome avant ...
SCHEDULED: <2023-04-30 Sun>
zcat whole_genome_SNVs.tsv.gz | head -n 100 | gzip -c > cadd-small-test.tsv.gz
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** STRT Remplacer script R par vep ?
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** TODO Vérifier résultats HGVS avec mutalyzer
**** TODO OMIM
**** TODO Grantham
**** TODO ACMG incidental
**** TODO Gnomad ?
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** WAIT Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of
rated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
**** TODO Ajout spliceAI
SCHEDULED: <2023-04-30 Sun>
plugin VEP
**** TODO Ajout pLI
plugin VEP
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** TODO Spip
SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** TODO interpretation + score + intervalle de confiance séparé
SCHEDULED: <2023-04-30 Sun>
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** TODO CADD: remplacer par plugin VEP
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** TODO annoter séparément et fusionner les VCF
SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** STRT Remplacer script R par vep ?
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
**** TODO OMIM
SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
SCHEDULED: <2023-05-01 Mon>
**** TODO ACMG incidental
SCHEDULED: <2023-05-01 Mon>
**** TODO Gnomad ?
SCHEDULED: <2023-05-01 Mon>
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** WAIT Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of
26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
******** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
******** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
******** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******* TODO --adaptive-pruning true
***** TODO Après annotation
SCHEDULED: <2023-04-28 Fri>
***** TODO Après filtre annotation
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py -r ../test-simuscop/genomeRef_chr1.fna -o lol -tr ../test-simuscop/gatad2b-exon6.bed -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 09:52] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
**** KILL Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 09:52]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-reseq
gunzip -c GCF_000001405.25_GRCh37.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
#+RESULTS:
Attention, discordance entre les version de refseq..
On retélécharge fasata
#+begin_src
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
# Samtools wants bgzip
gunzip GCF_000001405.25_GRCh37.p13_genomic.fna.gz
samtools faidx GCF_000001405.25_GRCh37.p13_genomic.fna
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi GCF_000001405.25_GRCh37.p13_genomic.fna -bed exons.gff -fo exons.fa
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
NC_000001.11 g.153817496 A>T
N'y est pas ??
On test sur les 2 premiers
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_000001.11 153817371 153817542 -> NC_000001.11 153817321 153817592
On essaie 1000
NC_000001.11 153817371 153817542 -> NC_000001.11 153816371 153818542 ->
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
mieux mais trop large
#+attr_html: :width 800px
[[./art-exon6-offset1000.png]]
Sur les vraies données, on a une large de 500bp environ
#+attr_html: :width 800px
[[./illumina-ref-exon6.png]]
Ici l'exon fait 250bp donc on rajouter 125bp de chaque côté
NC_000001.11 153817371 153817542
NC_000001.11 153817246 153817667
Résulats incohérents :on a 2 colonnes séparées !
Essai avec +200bp de chaque côt
NC_000001.11 153817371 153817542
NC_000001.11 153817171 153817742
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817171\t153817742"> gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 50
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
En changeant la longueur des reads
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817171\t153817742"> gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 126 -f 100 -p -m 500 -s 50
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
Idem, fait 2 "tas" séparés de chaque côté de l'exon
+/- 100
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817271\t153817642" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 125 -f 100 -p -m 500 -s 30
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
*** NGSNG
https://github.com/RAHenriksen/NGSNGS
**** Essai avec fasta
#+begin_src sh :dir ~/code/bisonex/test-ngsngs :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* DONE Plot : ashkenazim trio
CLOSED: [2023-04-18 Tue 21:28] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******* DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******* DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******* DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******* DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******* DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
******** DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******* DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
******** DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
******** DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******* DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******* DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******* DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******* DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******* DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******* TODO --adaptive-pruning true
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
***** TODO Après annotation
SCHEDULED: <2023-04-28 Fri>
***** TODO Après filtre annotation
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py -r ../test-simuscop/genomeRef_chr1.fna -o lol -tr ../test-simuscop/gatad2b-exon6.bed -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
head -n200 exons.fna > exons-200.fna
bwa index exons-200.fna
#+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
Et pas de bam avec
samtools sort: failed to change sort order header to 'coordinate'
#+begin_src
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
#+end_src
Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_000001.11 153817371 153817542 -> NC_000001.11 153817321 153817592
On essaie 1000
NC_000001.11 153817371 153817542 -> NC_000001.11 153816371 153818542 ->
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
mieux mais trop large
#+attr_html: :width 800px
[[./art-exon6-offset1000.png]]
Sur les vraies données, on a une large de 500bp environ
#+attr_html: :width 800px
[[./illumina-ref-exon6.png]]
Ici l'exon fait 250bp donc on rajouter 125bp de chaque côté
NC_000001.11 153817371 153817542
NC_000001.11 153817246 153817667
Résulats incohérents :on a 2 colonnes séparées !
Essai avec +200bp de chaque côt
NC_000001.11 153817371 153817542
NC_000001.11 153817171 153817742
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817171\t153817742"> gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 50
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
En changeant la longueur des reads
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817171\t153817742"> gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 126 -f 100 -p -m 500 -s 50
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
Idem, fait 2 "tas" séparés de chaque côté de l'exon
+/- 100
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817271\t153817642" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 125 -f 100 -p -m 500 -s 30
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
*** KILL NGSNG : plante
CLOSED: [2023-05-01 Mon 09:27]
https://github.com/RAHenriksen/NGSNGS
**** KILL Essai avec fasta
CLOSED: [2023-05-01 Mon 09:27]
#+begin_src sh :dir ~/code/bisonex/test-ngsngs :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
*** TODO Insérer variants dans patients centogène :generate:
SCHEDULED: <2023-05-01 Mon>
**** TODO SNV
SCHEDULED: <2023-05-01 Mon>
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* DONE Plot : ashkenazim trio
CLOSED: [2023-04-18 Tue 21:28] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]