apraga/org - Change BMKGT3OO2GYX72RBA33J4KG2TRWKTLM2DAV4LITWZB6T3KFGIY3AC

Comparaison simuscop 200x: promising

Created by Alexis Praga on May 4, 2023

BMKGT3OO2GYX72RBA33J4KG2TRWKTLM2DAV4LITWZB6T3KFGIY3AC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 41 [4.35]

B:BD[3.23143] → [3.23143:23434]

B:BD[3.23434] → [5.361301:369202]

   0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TOD
O Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** TODO [#B] 200x
SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
***** TODO Après annotation
SCHEDULED: <2023-04-28 Fri>
***** TODO Après filtre annotation
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to chang

[3.23143]

[5.369202]

   0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** TODO [#B] 200x
SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO VCf eval
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to chang

Replacement in projects/bisonex.org at line 43 [4.35]

B:BD[5.377394] → [2.314:7241]


26×10 DataFrame
 Row │ chrom         pos       variant        variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64     SubStrin…?     String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10  74343027  g.74343027C>T  snv          heterozygous  C          T                61        58         120
   2 │ NC_000015.10  75400778  g.75400778C>G  snv          heterozygous  C          G               108        79         187
   3 │ NC_000015.10  89327201  g.89327201C>T  snv          heterozygous  C          T               243       241         486
   4 │ NC_000015.10  48767448  g.48767448A>C  snv          heterozygous  A          C                72        70         142
   5 │ NC_000015.10  75411685  g.75411685T>C  snv          heterozygous  T          C                79        81         160
   6 │ NC_000015.10  66703292  g.66703292C>T  snv          heterozygous  C          T                70        60         130
   7 │ NC_000015.10  89325639  g.89325639G>A  snv          heterozygous  G          A               257       267         524
   8 │ NC_000015.10  89330184  g.89330184G>A  snv          heterozygous  G          A               258       287         548
   9 │ NC_000015.10  89330184  g.89330184G>A  snv          heterozygous  G          A               258       287         548
  10 │ NC_000015.10  89325639  g.89325639G>A  snv          heterozygous  G          A               257       267         524
  11 │ NC_000015.10  42401752  g.42401752G>A  snv          homozygous    G          A                61       212         273
  12 │ NC_000015.10  89327201  g.89327201C>T  snv          heterozygous  C          T               243       241         486
  13 │ NC_000015.10  38339896  g.38339896G>A  snv          heterozygous  G          A                56        86         144
  14 │ NC_000015.10  26869324  g.26869324A>T  snv          heterozygous  A          T                62        49         113
  15 │ NC_000015.10  66435145  g.66435145G>A  snv          heterozygous  G          A                98        95         193
  16 │ NC_000015.10  60514655  g.60514655G>A  snv          heterozygous  G          A                94        99         194
  17 │ NC_000015.10  42410947  g.42410947A>G  snv          heterozygous  A          G               153       123         276
  18 │ NC_000015.10  75430368  g.75430368C>T  snv          heterozygous  C          T                80        62         142
  19 │ NC_000015.10  25375494  g.25375494T>C  snv          heterozygous  T          C               103       104         207
  20 │ NC_000015.10  60497497  g.60497497C>A  snv          heterozygous  C          A                61        65         126
  21 │ NC_000015.10  74891539  g.74891539C>T  snv          heterozygous  C          T               118       124         242
  22 │ NC_000015.10  48488433  g.48488433A>G  snv          heterozygous  A          G               367       122         492
  23 │ NC_000015.10  89318565  g.89318565A>G  snv          heterozygous  A          G               303        98         404
  24 │ NC_000015.10  89323426  g.89323426C>G  snv          heterozygous  C          G                93       109         202
  25 │ NC_000015.10  89318595  g.89318595T>C  snv          heterozygous  T          C               321       128         453
  26 │ NC_000015.10  48488437  g.48488437T>C  snv          heterozygous  T          C               356       132         488
CLOSED: [2023-05-01 Mon 17:18]
***** TODO Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
****** HOLD regénérer fastq
***** TODO Générer bam données pour tous les chromosomes
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** TODO BAm2fastq pour avoir CIGAR à jour
SCHEDULED: <2023-05-01 Mon>
 rsync -avz 63003856_S135_inserted.bam* meso:/Work/Groups/bisonex/data/synthetic/
***** TODO Lancer pipeline
SCHEDULED: <2023-05-01 Mon>
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* DONE Plot : ashkenazim trio
CLOSED: [2023-04-18 Tue 21:28] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]

[5.377394]


26×10 DataFrame
 Row │ chrom         pos       variant        variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64     SubStrin…?     String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10  74343027  g.74343027C>T  snv          heterozygous  C          T                61        58         120
   2 │ NC_000015.10  75400778  g.75400778C>G  snv          heterozygous  C          G               108        79         187
   3 │ NC_000015.10  89327201  g.89327201C>T  snv          heterozygous  C          T               243       241         486
   4 │ NC_000015.10  48767448  g.48767448A>C  snv          heterozygous  A          C                72        70         142
   5 │ NC_000015.10  75411685  g.75411685T>C  snv          heterozygous  T          C                79        81         160
   6 │ NC_000015.10  66703292  g.66703292C>T  snv          heterozygous  C          T                70        60         130
   7 │ NC_000015.10  89325639  g.89325639G>A  snv          heterozygous  G          A               257       267         524
   8 │ NC_000015.10  89330184  g.89330184G>A  snv          heterozygous  G          A               258       287         548
   9 │ NC_000015.10  89330184  g.89330184G>A  snv          heterozygous  G          A               258       287         548
  10 │ NC_000015.10  89325639  g.89325639G>A  snv          heterozygous  G          A               257       267         524
  11 │ NC_000015.10  42401752  g.42401752G>A  snv          homozygous    G          A                61       212         273
  12 │ NC_000015.10  89327201  g.89327201C>T  snv          heterozygous  C          T               243       241         486
  13 │ NC_000015.10  38339896  g.38339896G>A  snv          heterozygous  G          A                56        86         144
  14 │ NC_000015.10  26869324  g.26869324A>T  snv          heterozygous  A          T                62        49         113
  15 │ NC_000015.10  66435145  g.66435145G>A  snv          heterozygous  G          A                98        95         193
  16 │ NC_000015.10  60514655  g.60514655G>A  snv          heterozygous  G          A                94        99         194
  17 │ NC_000015.10  42410947  g.42410947A>G  snv          heterozygous  A          G               153       123         276
  18 │ NC_000015.10  75430368  g.75430368C>T  snv          heterozygous  C          T                80        62         142
  19 │ NC_000015.10  25375494  g.25375494T>C  snv          heterozygous  T          C               103       104         207
  20 │ NC_000015.10  60497497  g.60497497C>A  snv          heterozygous  C          A                61        65         126
  21 │ NC_000015.10  74891539  g.74891539C>T  snv          heterozygous  C          T               118       124         242
  22 │ NC_000015.10  48488433  g.48488433A>G  snv          heterozygous  A          G               367       122         492
  23 │ NC_000015.10  89318565  g.89318565A>G  snv          heterozygous  A          G               303        98         404
  24 │ NC_000015.10  89323426  g.89323426C>G  snv          heterozygous  C          G                93       109         202
  25 │ NC_000015.10  89318595  g.89318595T>C  snv          heterozygous  T          C               321       128         453
  26 │ NC_000015.10  48488437  g.48488437T>C  snv          heterozygous  T          C               356       132         488
CLOSED: [2023-05-01 Mon 17:18]
***** TODO Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
****** HOLD regénérer fastq
***** TODO Générer bam données pour tous les chromosomes
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl  ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ }                            from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName }             from "${params.modulesDir}/samtools/sort/main"
workflow {
    f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
                         checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
    // Important: use "-n" option !!
    sortBamByName(f)
    SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
***** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run   main.nf -c nextflow.config  -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* DONE Plot : ashkenazim trio
CLOSED: [2023-04-18 Tue 21:28] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]