apraga/org - Change Z4JC5QKADUJB7BIW2IR34WPGU4ELNQW2LTOLG3OYMHCXALWKFEQAC

Tasks

Created by Alexis Praga on June 11, 2023

Z4JC5QKADUJB7BIW2IR34WPGU4ELNQW2LTOLG3OYMHCXALWKFEQAC

Dependencies

In channels

main

Change contents

Replacement in projects.org at line 352 [5.123895]

B:BD[4.148] → [2.12:77]

***** TODO Correction Hirsch + Vidau
SCHEDULED: <2023-05-28 Sun>

[4.148]

[6.503]

***** DONE Correction Hirsch + Vidau
CLOSED: [2023-06-11 Sun 18:39] SCHEDULED: <2023-05-28 Sun>
***** TODO Dernières correction JP
SCHEDULED: <2023-06-11 Sun>

Replacement in projects.org at line 474 [5.123895]
∅:D[7.617] → [8.648:688]
B:BD[9.376] → [8.648:688]
B:BD[8.688] → [10.3:31]
```
***** TODO Version executable pour paul
SCHEDULED: <2023-06-06 Tue>
```
[7.617]
[8.716]
```
***** HOLD Version executable pour paul
```
Insertion in projects.org at line 2217 [5.123895]
[10.390]
[11.608]
```
**** Notes
```
Insertion in projects.org at line 2229 [5.123895]
[11.1084]
[12.153]
```
**** TODO Patch NXF_OFFLINE=true
SCHEDULED: <2023-06-11 Sun>
```
Replacement in projects.org at line 2276 [5.123895]
B:BD[10.1384] → [10.1384:1398]
```
** TODO Happy
```
[10.1384]
[10.1398]
```
** TODO Happy :happy:
```
Insertion in projects.org at line 2285 [5.123895]
[13.85]
[10.1513]

Replacement in projects.org at line 2294 [5.123895]

B:BD[14.986] → [2.417:457]

∅:D[2.457] → [15.90:118]

B:BD[16.104] → [15.90:118]

*** HOLD Changer plaquettes +/- disques
SCHEDULED: <2023-05-09 Tue>

[14.986]

[16.104]

*** KILL Changer plaquettes +/- disques
CLOSED: [2023-06-11 Sun 18:39] SCHEDULED: <2023-05-09 Tue>

Insertion in projects.org at line 2305 [5.123895]

[2.565]

[14.986]

**** TODO Changer amortisseurs
SCHEDULED: <2023-06-13 Tue>
**** TODO Changer phare avant
SCHEDULED: <2023-06-13 Tue>
**** TODO Contrôle pollution
SCHEDULED: <2023-06-13 Tue>
*** TODO Amende
**** DONE Changement d'adresse carte grise
CLOSED: [2023-06-11 Sun 21:40]
**** TODO Envoyer photocopie carte grise pour éviter majoration
SCHEDULED: <2023-06-18 Sun>

Insertion in projects.org at line 2326 [5.123895]

[14.1057]

[17.5260]

** DONE Ordure ménagères
CLOSED: [2023-06-11 Sun 21:40]
Envoyé RIB le <2023-06-11 Sun>

Replacement in projects/bisonex.org at line 7 [19.35]

B:BD[18.14923] → [13.101:8293]

B:BD[13.8293] → [20.27:8219]

_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE simuscop
CLOSED: [2022-12-30 Fri 22:31]
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
*** TODO hap.py
https://github.com/Illumina/hap.py
**** DONE Version sans rtgtools avec python 3
CLOSED: [2023-02-02 Thu 22:15]
Procédure pour tester
#+begin_src
nix develop .#hap-py
$ genericBuild
#+end_src
1. Supprimer l’appel à make_dependencies dans cmakelist.txt : on peut tout installer avec nix
2. Patch Roc.cpp pour avoir numeric_limits ( error: 'numeric_limits' is not a member of 'std')
3. ajout de flags de link (essai, error)
set(ZLIB_LIBRARIES -lz -lbz2 -lcurl -lcrypto -llzma)
4. Changer les appels à print en print() dans le code python et suppression de quelques import
[nix-shell:~/source]$ sed -i.orig 's/print \"\(.*\)"/print(\1)/' src/python/*.py
**** DONE Sérialiser json pour écrire données de sorties
CLOSED: [2023-02-17 Fri 19:25]
**** DONE Tester sur example
CLOSED: [2023-02-04 Sat 00:25]
#+begin_src sh
$ cd hap.py
$ ../result/bin/hap.py example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test -r example/chr21.fa
#+end_src
#+RESULTS:
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
| INDEL | ALL    |        8937 |     7839 |     1098 |       11812 |      343 |      3520 |    45 |   283 |      0.877140 |         0.958635 |       0.298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
SCHEDULED: <2023-05-20 Sat>
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** KILL Reproduire les performances precisionchallenge : attention à HG002 et HG001!
CLOSED: [2023-04-01 Sat 19:43]
https://www.nist.gov/programs-projects/genome-bottle
***** KILL 0GOOR
CLOSED: [2023-04-01 Sat 19:40]
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDEL    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Next
flow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
SCHEDULED: <2023-05-01 Mon>
**** TODO ACMG incidental
SCHEDULED: <2023-05-01 Mon>
**** TODO Gnomad ?
SCHEDULED: <2023-05-01 Mon>
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
SCHEDULED: <2023-05-18 Thu>
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO var

[18.14923]

[3.29]

_norm.vcf.gz
bcftools isec dbsnp_mwi_norm.vcf.gz clinvar_mwi.vcf.gz -n=2
#+end_src
#+RESULTS:
| NC_000020.11 | 10652589 | G | A | 11 |
| NC_000020.11 | 10652589 | G | C | 11 |
******* TODO Sur dbSNP chr20 non
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools norm -m -any dbSNP_common_chr20 -o dbSNP_common_chr20_norm.vcf.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test_isec
bcftools isec -i 'INFO/CLNSIG="Pathogenic"' dbSNP_common_chr20_norm.vcf.gz clinvar_chr20.vcf.gz -p tmp
#+end_src
#+RESULTS:
***** DONE Essai bedtools intersect
#+begin_src sh
bedtools intersect -a  dbSNP_common.vcf.gz -b clinvar.vcf.gz
#+end_src
$ wc -l intersect.vcf
220206 intersect.vcf
** TODO Dépendences avec Nix
*** DONE GATK
CLOSED: [2022-10-21 Fri 21:59]
*** WAIT BioDBHTS
Contribuer pull request
*** DONE BioExtAlign
CLOSED: [2022-10-22 Sat 00:38]
*** WAIT BioBigFile
Revoir si on peut utliser kent dernière version
Contribuer pull request
*** HOLD rtg-tools
Convertir clinvar NC
*** DONE simuscop
CLOSED: [2022-12-30 Fri 22:31]
*** DONE Spip
CLOSED: [2022-12-04 Sun 12:49]
Pas de pull request
*** DONE R + packages
CLOSED: [2022-11-19 Sat 21:05]
*** TODO hap.py
https://github.com/Illumina/hap.py
**** DONE Version sans rtgtools avec python 3
CLOSED: [2023-02-02 Thu 22:15]
Procédure pour tester
#+begin_src
nix develop .#hap-py
$ genericBuild
#+end_src
1. Supprimer l’appel à make_dependencies dans cmakelist.txt : on peut tout installer avec nix
2. Patch Roc.cpp pour avoir numeric_limits ( error: 'numeric_limits' is not a member of 'std')
3. ajout de flags de link (essai, error)
set(ZLIB_LIBRARIES -lz -lbz2 -lcurl -lcrypto -llzma)
4. Changer les appels à print en print() dans le code python et suppression de quelques import
[nix-shell:~/source]$ sed -i.orig 's/print \"\(.*\)"/print(\1)/' src/python/*.py
**** DONE Sérialiser json pour écrire données de sorties
CLOSED: [2023-02-17 Fri 19:25]
**** DONE Tester sur example
CLOSED: [2023-02-04 Sat 00:25]
#+begin_src sh
$ cd hap.py
$ ../result/bin/hap.py example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test -r example/chr21.fa
#+end_src
#+RESULTS:
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score |
| INDEL | ALL    |        8937 |     7839 |     1098 |       11812 |      343 |      3520 |    45 |   283 |      0.877140 |         0.958635 |       0.298002 |        0.916079 |
| INDEL | PASS   |        8937 |     7550 |     1387 |        9971 |      283 |      1964 |    30 |   242 |      0.844803 |         0.964656 |       0.196971 |        0.900760 |
| SNP   | ALL    |       52494 |    52125 |      369 |       90092 |      582 |     37348 |   107 |   354 |      0.992971 |         0.988966 |       0.414554 |        0.990964 |
| SNP   | PASS   |       52494 |    46920 |     5574 |       48078 |      143 |       992 |     8 |    97 |      0.893816 |         0.996963 |       0.020633 |        0.942576 |
**** TODO Version avec rtg-tools
**** TODO Faire fonctionner Tests
***** TODO Essai 2 : depuis nix develop:
SCHEDULED: <2023-05-20 Sat>
#+begin_src
nix develop .#hap-py
genericBuild
#+end_src
Lancé initialement à la main, mais on peut maintenant utiliser run_tests
#+begin_src
HCDIR=bin/ ../src/sh/run_tests.sha
#+end_src
- [X] test boost
- [X] multimerge
- [X] hapenum
- [X] fp accuracy
- [X] faulty variant
- leftshift fails
- [X] other vcf
- [X] chr prefix
- [X] gvcf
- [X] decomp
- [X] contig lengt
- [X]  integration test
- [ ] scmp fails sur le type
- [X] giab
- [X] performance
- [ ] quantify fails sur le type
- [ ] stratified échec sur les résultats !
- [X] pg counting
- [ ] sompy: ne trouve pas Strelka dans somatic
phases="buildPhase checkPhase installPhase fixupPhase" genericBuild
#+end_src
**** KILL Reproduire les performances precisionchallenge : attention à HG002 et HG001!
CLOSED: [2023-04-01 Sat 19:43]
https://www.nist.gov/programs-projects/genome-bottle
***** KILL 0GOOR
CLOSED: [2023-04-01 Sat 19:40]
Le problème venait 1. de l'ADN et 2. du renommage des chromosomes qui était faux
****** DONE HG002
CLOSED: [2023-02-17 Fri 19:31]
 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score
INDEL    ALL       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
INDEL   PASS       525466    491355     34111      1156702     57724     605307   9384  25027       0.935084          0.895313        0.523304         0.914766
  SNP    ALL      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
  SNP   PASS      3365115   3358399      6716      5666020     21995    2284364   4194   1125       0.998004          0.993496        0.403169         0.995745
 TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
                    NaN                     NaN                   1.528276                   2.752637
                    NaN                     NaN                   1.528276                   2.752637
               2.100129                1.473519                   1.581196                   1.795603
               2.100129                1.473519                   1.581196                   1.795603
***** KILL Avec python2
CLOSED: [2023-02-17 Fri 19:25]
****** KILL avec nix
CLOSED: [2023-02-17 Fri 19:25]
conda create -n python2 python=2.7 anaconda
****** KILL avec conda
CLOSED: [2023-02-17 Fri 19:25]
******* Gentoo: regex_error sur test...
Ok avec bash !
#+begin_src
anaconda3/bin/conda create --name py2 python=2.7
conda activate py2
conda install -c bioconda hap.py
#+end_src
******** Faire tourner les tests.
Il faut remplace bin/test_haplotypes par test_haplotypes dans src/sh/run_tests.sh
#+begin_src sh
 HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta HCDIR=~/anaconda3/envs/py2/bin bash src/sh/run_tests.sh
#+end_src
Echec:
test_haplotypes: /opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/lib/tools/Fasta.cpp:81: MMappedFastaFile::MMappedFastaFile(const string&): Assertion `fd != -1' failed.
unknown location(0): fatal error in "testVariantPrimitiveSplitter": signal: SIGABRT (application abort requested)
/opt/conda/conda-bld/work/hap.py-0.3.7/src/c++/test/test_align.cpp(298): last checkpoint
******** Chr21
HGREF=../genome/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta hap.py        example/happy/PG_NA12878_chr21.vcf.gz       example/happy/NA12878_chr21.vcf.gz       -f example/happy/PG_Conf_chr21.bed.gz       -o test
******* Helios
échec
** TODO T2T
Toutes les ressourcs sont décrites ici
https://github.com/marbl/CHM13
Détails sur le pipeline
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hub_3267197_GCA_009914755.4&c=CP068277.2&g=hub_3267197_hgLiftOver
*** TODO URL hg38 + T2T
- [ ] genome
- [ ] dbsnp
- [ ] clinvar
- [ ] vep
https://github.com/Ensembl/ensembl-vep/issues/1409
*** TODO Téléchargement généqiue (aws + ftp)
*** TODO Corriger téléchargement : url directement fichier config
*** TODO Passer en convention "chr" pour chromosome
**** TODO Ne pas renommer clinvar (hg38)
**** TODO Renommer dbsnp (hg38)
*** Liftover pipelines
:PROPERTIES:
:ID:       d2280207-3f65-4a31-a291-41fa9a9658c2
:END:
Contient les chain files
** DONE Exécution
CLOSED: [2022-09-13 Tue 21:37]
*** KILL test Bionix
*** KILL Implémenter execution avec Nix ?
Voir https://academic.oup.com/gigascience/article/9/11/giaa121/5987272?login=false
pour un exemple.
Probablement plus simple d’utiliser Nix pour gestion de l’environnement et snakemake pour l’exécution
Pas d’accès internet depuis le cluster
*** DONE nextflow
CLOSED: [2022-09-13 Tue 21:37]
**** TODO Bug scheduler SGE
Le job se fait tuer car l'utilisateur n'est pas passé correctement à nextflow
***** DONE Forcer l'utilisateur à l'exécution
CLOSED: [2023-04-01 Sat 17:57]
NXF_OPTS=-D"user.name=alex"
***** DONE Vérifier si le problème persiste avec 22.10.6
CLOSED: [2023-04-01 Sat 18:38] SCHEDULED: <2023-04-01 Sat>
oui
***** KILL Packager l'utilisateur dans le programme ?
Mauvaise idée..
** TODO Preprocessing avec nextflow
*** TODO Map to reference
**** TODO Sample ID dans header
/Work/Users/apraga/bisonex/out/63003856_S135/preprocessing/baserecalibrator
*** DONE Mark duplicate
CLOSED: [2022-10-09 Sun 22:30]
*** DONE Recalibrate base quality score
CLOSED: [2022-10-09 Sun 22:30]
** DONE Variant calling avec Nextflow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN  IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO  ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
 module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM	POS	ID	REF	ALT
1	9091	.	A	C
1	69091	.	A	C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab  --dir 109 --merged --pick --use_given_ref   --offline  --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick  --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep  -i test.vcf  -o lol.vcf --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1  230710048 230710048 A/G 1"   --offline --dir  /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz  --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
**** TODO ACMG incidental
**** TODO Gnomad ?
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO var

Replacement in projects/bisonex.org at line 30 [19.35]

B:BD[21.28682] → [21.28682:29386]

B:BD[21.29386] → [22.41246:48734]

                           |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23
:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.9934 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
SCHEDULED: <2023-05-25 Thu>
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=reca

[21.28682]

[22.48734]

                           |
| --gcs-max-retries 20                                                                                          | --gcs-max-retries 20                                                                    |
| --gcs-project-for-requester-pays                                                                              | --gcs-project-for-requester-pays                                                        |
| --disable-tool-default-read-filters false	PN:GATK ApplyBQSR                                                 | --disable-tool-default-read-filters false     PN:GATK ApplyBQSR                         |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type    | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy     | SNP     | 0.8552 |    0.9708 |
| Bisonex | vcfeval   | SNP     | 0.8547 |    0.9727 |
| Bisonex | Happy     | INDEL   | 0.7105 |    0.6929 |
| Bisonex | vcfeval   | Non-SNP | 0.7139 |    0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB    | happy     | INDEL   | 0.7551 |    0.7415 |
| GIAB    | vcfeval   | INDEL   | 0.7598 |    0.7445 |
| GIAB    | happy     | SNP     | 0.8937 |    0.9621 |
| giab    | vcfeval   | SNP     | 0.8937 |    0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type  |   Recall | Precision |
| HG002   | happy     | INDEL | 0.851495 |  0.923616 |
| HG002   | happy     | SNP   | 0.905926 |  0.992158 |
| HG002   | vcfeval   | indel |   0.8523 |    0.9212 |
| HG002   | vcfeval   | snp   |   0.9054 |    0.9934 |
| HG003   | vcfeval   | indel |   0.8363 |    0.9115 |
| HG003   | vcfeval   | snp   |   0.9069 |    0.9928 |
| HG003   | happy     | INDEL | 0.838521 |  0.917296 |
| HG003   | happy     | SNP   | 0.907466 |  0.991204 |
| HG004   | happy     | INDEL | 0.856835 |  0.925086 |
| HG004   | happy     | SNP   | 0.905067 |  0.992704 |
| HG004   | vcfeval   | indel |   0.8568 |    0.9240 |
| HG004   | vcfeval   | snp   |   0.9048 |    0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed  -O exons_illumina.list -SD  ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list  -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD  ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=reca

Replacement in projects/bisonex.org at line 36 [19.35]

B:BD[23.16805] → [23.16805:17716]

B:BD[23.17716] → [20.16556:23837]

0.755081          0.741493
  SNP   PASS        46032     41138      4894        47694      1622       4930    362     31       0.893683          0.962071
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.285816         0.748225                     NaN                     NaN                   1.617499                   2.524051
       0.103367         0.926617                2.529552                2.412446                   1.620686                   1.688868
****** DONE Statistiques avec vcfeval
CLOSED: [2023-04-02 Sun 17:10] SCHEDULED: <2023-04-01 Sat>
***** DONE Résultats finaux
CLOSED: [2023-04-14 Fri 09:53]
Version GIAB avec hap.py + vcfeval:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878-giab
 --test.compare=happy,vcfeval  --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
| INDEL | PASS   |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
| SNP   | ALL    |       46032 |    39367 |     6665 |       44599 |     1186 |      4042 |   304 |    30 |      0.855209 |         0.970757 |        0.09063 |        0.909327 |      2.529551552318896 |      2.402150701647346 |        1.6206857273037931 |        1.6273423688862698 |
| SNP   | PASS   |       46032 |    39367 |     6665 |       44599 |     1186 |      4042 |   304 |    30 |      0.855209 |         0.970757 |        0.09063 |        0.909327 |      2.529551552318896 |      2.402150701647346 |        1.6206857273037931 |        1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000     |             42789 |         42416 |      2598 |      8080 |    0.9423 |      0.8412 |    0.8889 |
| None      |             42798 |         42425 |      2616 |      8071 |    0.9419 |      0.8413 |    0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
 zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000     |             44812 |         44812 |      2878 |      6057 |    0.9397 |      0.8809 |    0.9093 |
| None      |             44813 |         44813 |      2882 |      6056 |    0.9396 |      0.8809 |    0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** TODO Résultats sans trimming
SCHEDULED: <2023-05-25 Thu>
**** DONE HG002 :hg002:
CLOSED: [2023-04-14 Fri 09:54] SCHEDULED: <2023-04-10 Mon>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent:

[23.16805]

[20.23837]

0.755081          0.741493
  SNP   PASS        46032     41138      4894        47694      1622       4930    362     31       0.893683          0.962071
 METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
       0.285816         0.748225                     NaN                     NaN                   1.617499                   2.524051
       0.103367         0.926617                2.529552                2.412446                   1.620686                   1.688868
****** DONE Statistiques avec vcfeval
CLOSED: [2023-04-02 Sun 17:10] SCHEDULED: <2023-04-01 Sat>
***** DONE Résultats finaux
CLOSED: [2023-04-14 Fri 09:53]
Version GIAB avec hap.py + vcfeval:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878-giab --test.compare=happy,vcfeval  --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL    |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
| INDEL | PASS   |        4871 |     3461 |     1410 |        7048 |     1554 |      1987 |   193 |   346 |      0.710532 |         0.692946 |       0.281924 |        0.701629 |                        |                        |        1.6174985978687606 |        3.0674091441969518 |
| SNP   | ALL    |       46032 |    39367 |     6665 |       44599 |     1186 |      4042 |   304 |    30 |      0.855209 |         0.970757 |        0.09063 |        0.909327 |      2.529551552318896 |      2.402150701647346 |        1.6206857273037931 |        1.6273423688862698 |
| SNP   | PASS   |       46032 |    39367 |     6665 |       44599 |     1186 |      4042 |   304 |    30 |      0.855209 |         0.970757 |        0.09063 |        0.909327 |      2.529551552318896 |      2.402150701647346 |        1.6206857273037931 |        1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000     |             42789 |         42416 |      2598 |      8080 |    0.9423 |      0.8412 |    0.8889 |
| None      |             42798 |         42425 |      2616 |      8071 |    0.9419 |      0.8413 |    0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
 zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000     |             44812 |         44812 |      2878 |      6057 |    0.9397 |      0.8809 |    0.9093 |
| None      |             44813 |         44813 |      2882 |      6056 |    0.9396 |      0.8809 |    0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz  | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type  | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL    |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| INDEL | PASS   |        4871 |     3678 |     1193 |        7036 |     1299 |      2011 |   208 |   217 |      0.755081 |         0.741493 |       0.285816 |        0.748225 |                        |                        |        1.6174985978687606 |        2.5240506329113925 |
| SNP   | ALL    |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |        1.6888675840288743 |
| SNP   | PASS   |       46032 |    41138 |     4894 |       47694 |     1622 |      4930 |   362 |    31 |      0.893683 |         0.962071 |       0.103367 |        0.926617 |      2.529551552318896 |     2.4124463519313304 |        1.6206857273037931 |         1.688867584028874 |
***** TODO Résultats sans trimming
**** DONE HG002 :hg002:
CLOSED: [2023-04-14 Fri 09:54] SCHEDULED: <2023-04-10 Mon>
#+begin_src
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
    NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
    NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz  --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
    0.000              24585          24390      10060      39415     0.7080       0.3841     0.4980
     None              24585          24390      10060      39415     0.7080       0.3841     0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
 list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent:

Replacement in projects/bisonex.org at line 40 [19.35]

B:BD[24.16182] → [24.16182:16630]

B:BD[24.16630] → [3.8670:24606]

  0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  13647
7305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval :haplotypecaller:
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
****** TODO Comparer avec hap.py :haplotypecaller:
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** TODO Comparer avec hap.py
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO Comparer avec hap.py
****** TODO VCf eval
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to change sort order header to 'coordinate'
 #+begin_src
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 #+end_src
 Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
 ../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
 ../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
 head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
 NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRain

[24.16182]

[3.24606]

  0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TODO Après haplotypecaller
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval :haplotypecaller:
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
****** TODO Comparer avec hap.py :haplotypecaller:
 NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x  --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** TODO Comparer avec hap.py
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO Comparer avec hap.py
****** TODO VCf eval
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to change sort order header to 'coordinate'
 #+begin_src
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 #+end_src
 Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
 ../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
 ../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
 head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
 NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRain

Replacement in projects/bisonex.org at line 46 [19.35]

B:BD[25.7495] → [25.7495:7883]

B:BD[25.7883] → [3.24995:36464]

us les variants
****** DONE Comprendre pourquoi la répartiton ne suit pas la loi normale
CLOSED: [2023-06-01 Thu 21:44]
Certains hétérozygote soint à 0.01 ou 1...
******* DONE augmenter le nombre d'échantillions: idem
CLOSED: [2023-05-31 Wed 22:24]
******* DONE Vérifier le nombre de reads marqué vs édité
CLOSED: [2023-06-01 Thu 21:44]
******* DONE vérifier que 100 app
el à rand(d, 1)[1] est semblable à un appel de rand(d, 100)
CLOSED: [2023-05-31 Wed 22:24]
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
julia> y = [rand(d, 1)[1] for x in 1:1000];
julia> z = rand(d,1000);
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
 draw(data(df)*histogram(bins=100)*mapping(:y, color=:type,dodge=:type))
****** DONE Améliorer les performances
CLOSED: [2023-06-02 Fri 23:39]
#+begin_src julia
@time include("xamscissors.jl")
#+end_src
430s pour chromosome 22. Majorité dans l'édition de reads:
******* DONE Inserér tous les variants d'un reads d'un coup
CLOSED: [2023-06-01 Thu 23:09]
Ne change rien
******* DONE Test avec -t4: idem
CLOSED: [2023-06-01 Thu 23:17]
******* DONE Test mésocentre : idem
CLOSED: [2023-06-01 Thu 23:40]
348s
******* Changer la structure de données des
Dataframe -> dict = les performances horribles ont disparuse
****** TODO Refaire le test avec la nouvelle version
******* DONE Génération des données
CLOSED: [2023-06-02 Fri 23:40]
Mésocentre
#+begin_src sh
cd /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/mapped
samtools view 63003856_S135_R.bam NC_000022.11 -o 63003856_S135_R_chr22.bam
samtools index 63003856_S135_R_chr22.bam
cp 63003856_S135_R_chr22.bam* /Work/Users/apraga/bisonex/tests/xamscissors/
cd /Work/Users/apraga/bisonex/tests/xamscissors
#+end_src
On génère les données
#+begin_src julia
using XAMScissors
insertSNV("./63003856_S135_R_chr22.bam", "snvs_chr22.csv", "out")
#+end_src
Puis
#+begin_src sh
julia xamscissors.jl
#+end_src
******* DONE Run
CLOSED: [2023-06-03 Sat 18:26]
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/runInserted.nf -profile standard,helios  --input="tests/xamscissors/out/inserted_{1,2}.fq.gz"
******* DONE Après haplotypecaller : ok
CLOSED: [2023-06-03 Sat 18:27] SCHEDULED: <2023-06-03 Sat>
******* TODO Après filtre vep
SCHEDULED: <2023-06-03 Sat>
***** TODO PHase 3 : tous les SNV
SCHEDULED: <2023-06-03 Sat>
****** DONE Générer les données
CLOSED: [2023-06-03 Sat 20:16] SCHEDULED: <2023-06-03 Sat>
#+begin_src julia
using XAMScissors
insertSNV("../../out/63003856_S135_R/preprocessing/mapped/63003856_S135_R.bam", "snvs.csv", "out")
#+end_src
temps d'exécution 73min
#+begin_src sh
nohup bash -c 'time julia xamscissors.jl' &
xamscissors-63003856/*.fq.gz /Work/Groups/bisonex/data/xamscissors/
#+end_src
****** DONE Regénérer avec @time pour avoir les performaces
CLOSED: [2023-06-03 Sat 21:45]
markReads 6.265202 seconds (1.36 M allocations: 137.090 MiB, 1.00% gc time, 9.79% compilation time)
editReads 1327.701623 seconds (1.03 G allocations: 81.996 GiB, 0.59% gc time, 0.03% compilation time)
samtools index 117.743727 seconds (53 allocations: 1.883 KiB)
samtools sort 2820.074930 seconds (66 allocations: 2.789 KiB)
bam2fastq 134.148952 seconds (794 allocations: 40.539 KiB, 0.01% compilation time)
real	73m33.273s
user	77m38.194s
sys	1m26.684s
[bam_sort_core] merging from 60 files and 1 in-memory blocks...
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 126905130 reads
real	73m6.934s
user	77m25.397s
sys	1m21.339s
****** TODO Après haplotypecaller 556/590, majorité = échec alignement
SCHEDULED: <2023-06-04 Sun>
Haplotypecaller 556 found over 590
Amongst 34 missed variant, 2 have a mapping quality > 0
2×7 DataFrame
 Row │ chrom         pos        ref      alt      zygosity  meanQual   stdQual
     │ String15      Int64      String1  String1  String7   Float64    Float64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ NC_000017.11   39672244  G        A        het       60.0       0.0
   2 │ NC_000001.11  155235252  A        G        het        0.258065  2.48868
NC_000017.11   39672244  G        A        het => ok, problème de représentation car 2 variant côte à cote
NC_000001.11  155235252  A        G        het => peu de reads alternatifs (9/93 donc ok)
Position: chromoe 1 et 6 surtout
34×7 DataFrame
 Row │ chrom         pos        ref      alt      zygosity
     │ String15      Int64      String1  String1  String7
─────┼──────────────────────────────────────────────────────
   1 │ NC_000001.11  153817496  A        T        het
   2 │ NC_000001.11  155235252  A        G        het
   3 │ NC_000001.11  155236268  G        A        het
   4 │ NC_000001.11  155290591  C        T        het
   5 │ NC_000001.11  155291918  G        A        het
   6 │ NC_000001.11  155294358  G        T        het
   7 │ NC_000002.12  149010343  C        T        het
   8 │ NC_000006.12   32039426  T        A        het
   9 │ NC_000006.12   32040110  G        T        het
  10 │ NC_000006.12   32040723  G        A        het
  11 │ NC_000006.12   32041006  C        T        het
  12 │ NC_000006.12   32041147  G        A        het
  13 │ NC_000006.12   33443054  G        T        het
  14 │ NC_000006.12   33451815  C        T        het
  15 │ NC_000006.12  170283230  C        A        het
  16 │ NC_000006.12  170283754  G        A        het
  17 │ NC_000006.12  170285637  T        C        het
  18 │ NC_000006.12  170289678  A        C        het
  19 │ NC_000010.11   87961118  G        A        het
  20 │ NC_000012.12    2449086  C        G        het
  21 │ NC_000015.10   74343027  C        T        het
  22 │ NC_000016.10   16163078  G        A        het
  23 │ NC_000016.10   21262032  C        G        het
  24 │ NC_000016.10     21962506  C        T        homo
  25 │ NC_000017.11    7513122  C        T        het
  26 │ NC_000017.11    7513752  C        T        het
  27 │ NC_000017.11   39672244  G        A        het
  28 │ NC_000017.11   46018710  C        T        het
  29 │ NC_000019.10   54144058  G        A        het
  30 │ NC_000021.9    43063074  A        G        het
  31 │ NC_000021.9    43426167  C        T        het
  32 │ NC_000022.11   18918421  A        G        het
  33 │ NC_000022.11   42087168  T        A        homo
  34 │ NC_000022.11   42213078  T        G        het
****** DONE Voir où est l'alignement alternatif: sur NW_ (zone supprimée)
CLOSED: [2023-06-04 Sun 22:15] SCHEDULED: <2023-06-04 Sun>
ex chr15 74343027
  A00853:477:HMLWYDSX3:2:2444:22354:28870
#+begin_src
cd /Work/Groups/bisonex/data/xamscissors
zgrep -A4 "A00853:477:HMLWYDSX3:2:2444:22354:28870" *.fq.gz
 #+end_src
63003856_xamscissors_1.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_1.fq.gz:CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC
63003856_xamscissors_1.fq.gz:+
63003856_xamscissors_1.fq.gz:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF
63003856_xamscissors_2.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_2.fq.gz:GACAGAAAGGAAGTGTTCACCACGATTACCGTGGCATCCTCTACAGACTCCTGGGAGACAGCAAGATGTCCTTCGAGGACATCAAGGCCAACGTCACAGAGATGCCGGCAGGAGGGGTGGACACGGTG
63003856_xamscissors_2.fq.gz:+
63003856_xamscissors_2.fq.gz:FF:FFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:F:FF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,:FFF,FFFFFF:FFFFFFFFFFFFF
******* DONE Avec BLAT: sur _fix
CLOSED: [2023-06-04 Sun 21:07]
1er =
   ACTIONS      QUERY   SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
--------------------------------------------------------------------------------------------------------------
browser details YourSeq   124     1   128   128    98.5%  chr15_ML143370v1_fix  +      172243    172370    128   What is chrom_fix?
browser details YourSeq   124     1   128   128    98.5%  chr15                 +    74342974  74343101    128
browser details YourSeq    23     1    25   128    96.0%  chr19                 -    33396097  33396121     25
Second
--------------------------------------------------------------------------------------------------------------
browser details YourSeq   126     1   128   128    99.3%  chr15_ML143370v1_fix  -      172243    172370    128   What is chrom_fix?
browser details YourSeq   126     1   128   128    99.3%  chr15                 -    74342974  74343101    128
browser details YourSeq    23   104   128   128    96.0%  chr19                 +    33396097  33396121     25
******* DONE Bwa mem à la main GRCh38.p13 : on est dans une zone NW
CLOSED: [2023-06-04 Sun 21:51]
On met les 2 reads dans des fichiers séparés puis
#+begin_src sh
cd /Work/Users/apraga/bisonex/tests/xamscissors/align
bwa mem /Work/Groups/bisonex/data/genome/GRCh38.p13/bwa/genomeRef test1.fq test2.fq
#+end_src
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38.p14: idem
CLOSED: [2023-06-04 Sun 21:51]
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38 : ok
CLOSED: [2023-06-04 Sun 22:15]
 bwa mem /Work/Projects/bisonex/data/genome/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fna test1.fq test2.fq
****** DONE Vérifier que les reads ont la même qualité sur les fichiers d'origine: oui
CLOSED: [2023-06-04 Sun 21:07]
****** TODO Supprimer les NW_ ?
SCHEDULED: <2023-06-04 Sun>
Attente réponse alexis
**** TODO Test Indel
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]

[25.7495]

us les variants
****** DONE Comprendre pourquoi la répartiton ne suit pas la loi normale
CLOSED: [2023-06-01 Thu 21:44]
Certains hétérozygote soint à 0.01 ou 1...
******* DONE augmenter le nombre d'échantillions: idem
CLOSED: [2023-05-31 Wed 22:24]
******* DONE Vérifier le nombre de reads marqué vs édité
CLOSED: [2023-06-01 Thu 21:44]
******* DONE vérifier que 100 appel à rand(d, 1)[1] est semblable à un appel de rand(d, 100)
CLOSED: [2023-05-31 Wed 22:24]
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
julia> y = [rand(d, 1)[1] for x in 1:1000];
julia> z = rand(d,1000);
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
 draw(data(df)*histogram(bins=100)*mapping(:y, color=:type,dodge=:type))
****** DONE Améliorer les performances
CLOSED: [2023-06-02 Fri 23:39]
#+begin_src julia
@time include("xamscissors.jl")
#+end_src
430s pour chromosome 22. Majorité dans l'édition de reads:
******* DONE Inserér tous les variants d'un reads d'un coup
CLOSED: [2023-06-01 Thu 23:09]
Ne change rien
******* DONE Test avec -t4: idem
CLOSED: [2023-06-01 Thu 23:17]
******* DONE Test mésocentre : idem
CLOSED: [2023-06-01 Thu 23:40]
348s
******* Changer la structure de données des
Dataframe -> dict = les performances horribles ont disparuse
****** TODO Refaire le test avec la nouvelle version
******* DONE Génération des données
CLOSED: [2023-06-02 Fri 23:40]
Mésocentre
#+begin_src sh
cd /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/mapped
samtools view 63003856_S135_R.bam NC_000022.11 -o 63003856_S135_R_chr22.bam
samtools index 63003856_S135_R_chr22.bam
cp 63003856_S135_R_chr22.bam* /Work/Users/apraga/bisonex/tests/xamscissors/
cd /Work/Users/apraga/bisonex/tests/xamscissors
#+end_src
On génère les données
#+begin_src julia
using XAMScissors
insertSNV("./63003856_S135_R_chr22.bam", "snvs_chr22.csv", "out")
#+end_src
Puis
#+begin_src sh
julia xamscissors.jl
#+end_src
******* DONE Run
CLOSED: [2023-06-03 Sat 18:26]
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/runInserted.nf -profile standard,helios  --input="tests/xamscissors/out/inserted_{1,2}.fq.gz"
******* DONE Après haplotypecaller : ok
CLOSED: [2023-06-03 Sat 18:27] SCHEDULED: <2023-06-03 Sat>
******* TODO Après filtre vep
SCHEDULED: <2023-06-03 Sat>
***** TODO PHase 3 : tous les SNV
SCHEDULED: <2023-06-03 Sat>
****** DONE Générer les données
CLOSED: [2023-06-03 Sat 20:16] SCHEDULED: <2023-06-03 Sat>
#+begin_src julia
using XAMScissors
insertSNV("../../out/63003856_S135_R/preprocessing/mapped/63003856_S135_R.bam", "snvs.csv", "out")
#+end_src
temps d'exécution 73min
#+begin_src sh
nohup bash -c 'time julia xamscissors.jl' &
xamscissors-63003856/*.fq.gz /Work/Groups/bisonex/data/xamscissors/
#+end_src
****** DONE Regénérer avec @time pour avoir les performaces
CLOSED: [2023-06-03 Sat 21:45]
markReads 6.265202 seconds (1.36 M allocations: 137.090 MiB, 1.00% gc time, 9.79% compilation time)
editReads 1327.701623 seconds (1.03 G allocations: 81.996 GiB, 0.59% gc time, 0.03% compilation time)
samtools index 117.743727 seconds (53 allocations: 1.883 KiB)
samtools sort 2820.074930 seconds (66 allocations: 2.789 KiB)
bam2fastq 134.148952 seconds (794 allocations: 40.539 KiB, 0.01% compilation time)
real	73m33.273s
user	77m38.194s
sys	1m26.684s
[bam_sort_core] merging from 60 files and 1 in-memory blocks...
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 126905130 reads
real	73m6.934s
user	77m25.397s
sys	1m21.339s
****** DONE Après haplotypecaller 556/590, majorité = échec alignement
CLOSED: [2023-06-10 Sat 10:40] SCHEDULED: <2023-06-04 Sun>
Haplotypecaller 556 found over 590
Amongst 34 missed variant, 2 have a mapping quality > 0
2×7 DataFrame
 Row │ chrom         pos        ref      alt      zygosity  meanQual   stdQual
     │ String15      Int64      String1  String1  String7   Float64    Float64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ NC_000017.11   39672244  G        A        het       60.0       0.0
   2 │ NC_000001.11  155235252  A        G        het        0.258065  2.48868
NC_000017.11   39672244  G        A        het => ok, problème de représentation car 2 variant côte à cote
NC_000001.11  155235252  A        G        het => peu de reads alternatifs (9/93 donc ok)
Position: chromoe 1 et 6 surtout
34×7 DataFrame
 Row │ chrom         pos        ref      alt      zygosity
     │ String15      Int64      String1  String1  String7
─────┼──────────────────────────────────────────────────────
   1 │ NC_000001.11  153817496  A        T        het
   2 │ NC_000001.11  155235252  A        G        het
   3 │ NC_000001.11  155236268  G        A        het
   4 │ NC_000001.11  155290591  C        T        het
   5 │ NC_000001.11  155291918  G        A        het
   6 │ NC_000001.11  155294358  G        T        het
   7 │ NC_000002.12  149010343  C        T        het
   8 │ NC_000006.12   32039426  T        A        het
   9 │ NC_000006.12   32040110  G        T        het
  10 │ NC_000006.12   32040723  G        A        het
  11 │ NC_000006.12   32041006  C        T        het
  12 │ NC_000006.12   32041147  G        A        het
  13 │ NC_000006.12   33443054  G        T        het
  14 │ NC_000006.12   33451815  C        T        het
  15 │ NC_000006.12  170283230  C        A        het
  16 │ NC_000006.12  170283754  G        A        het
  17 │ NC_000006.12  170285637  T        C        het
  18 │ NC_000006.12  170289678  A        C        het
  19 │ NC_000010.11   87961118  G        A        het
  20 │ NC_000012.12    2449086  C        G        het
  21 │ NC_000015.10   74343027  C        T        het
  22 │ NC_000016.10   16163078  G        A        het
  23 │ NC_000016.10   21262032  C        G        het
  24 │ NC_000016.10     21962506  C        T        homo
  25 │ NC_000017.11    7513122  C        T        het
  26 │ NC_000017.11    7513752  C        T        het
  27 │ NC_000017.11   39672244  G        A        het
  28 │ NC_000017.11   46018710  C        T        het
  29 │ NC_000019.10   54144058  G        A        het
  30 │ NC_000021.9    43063074  A        G        het
  31 │ NC_000021.9    43426167  C        T        het
  32 │ NC_000022.11   18918421  A        G        het
  33 │ NC_000022.11   42087168  T        A        homo
  34 │ NC_000022.11   42213078  T        G        het
****** DONE Voir où est l'alignement alternatif: sur NW_ (zone supprimée)
CLOSED: [2023-06-04 Sun 22:15] SCHEDULED: <2023-06-04 Sun>
ex chr15 74343027
  A00853:477:HMLWYDSX3:2:2444:22354:28870
#+begin_src
cd /Work/Groups/bisonex/data/xamscissors
zgrep -A4 "A00853:477:HMLWYDSX3:2:2444:22354:28870" *.fq.gz
 #+end_src
63003856_xamscissors_1.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_1.fq.gz:CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC
63003856_xamscissors_1.fq.gz:+
63003856_xamscissors_1.fq.gz:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF
63003856_xamscissors_2.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_2.fq.gz:GACAGAAAGGAAGTGTTCACCACGATTACCGTGGCATCCTCTACAGACTCCTGGGAGACAGCAAGATGTCCTTCGAGGACATCAAGGCCAACGTCACAGAGATGCCGGCAGGAGGGGTGGACACGGTG
63003856_xamscissors_2.fq.gz:+
63003856_xamscissors_2.fq.gz:FF:FFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:F:FF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,:FFF,FFFFFF:FFFFFFFFFFFFF
******* DONE Avec BLAT: sur _fix
CLOSED: [2023-06-04 Sun 21:07]
1er =
   ACTIONS      QUERY   SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
--------------------------------------------------------------------------------------------------------------
browser details YourSeq   124     1   128   128    98.5%  chr15_ML143370v1_fix  +      172243    172370    128   What is chrom_fix?
browser details YourSeq   124     1   128   128    98.5%  chr15                 +    74342974  74343101    128
browser details YourSeq    23     1    25   128    96.0%  chr19                 -    33396097  33396121     25
Second
--------------------------------------------------------------------------------------------------------------
browser details YourSeq   126     1   128   128    99.3%  chr15_ML143370v1_fix  -      172243    172370    128   What is chrom_fix?
browser details YourSeq   126     1   128   128    99.3%  chr15                 -    74342974  74343101    128
browser details YourSeq    23   104   128   128    96.0%  chr19                 +    33396097  33396121     25
******* DONE Bwa mem à la main GRCh38.p13 : on est dans une zone NW
CLOSED: [2023-06-04 Sun 21:51]
On met les 2 reads dans des fichiers séparés puis
#+begin_src sh
cd /Work/Users/apraga/bisonex/tests/xamscissors/align
bwa mem /Work/Groups/bisonex/data/genome/GRCh38.p13/bwa/genomeRef test1.fq test2.fq
#+end_src
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38.p14: idem
CLOSED: [2023-06-04 Sun 21:51]
A00853:477:HMLWYDSX3:2:2444:22354:28870	97	NW_021160016.1	172243	0	128M	=	172243	128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF	NM:i:2	MD:Z:22A30C7MC:Z:128M	AS:i:118	XS:i:118	XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870	145	NW_021160016.1	172243	0	128M	=	172243	-128	CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC	FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF	NM:i:1	MD:Z:22A105	MC:Z:128M	AS:i:123	XS:i:123	XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38 : ok
CLOSED: [2023-06-04 Sun 22:15]
 bwa mem /Work/Projects/bisonex/data/genome/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fna test1.fq test2.fq
****** DONE Vérifier que les reads ont la même qualité sur les fichiers d'origine: oui
CLOSED: [2023-06-04 Sun 21:07]
****** DONE Supprimer les NW_ ?
CLOSED: [2023-06-10 Sat 10:40] SCHEDULED: <2023-06-04 Sun>
**** TODO Test Indel
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]

Insertion in notes/20230511172909-livres.org at line 13 [5.675275]

[26.148]

[27.1275]

** [[https://softwarefoundations.cis.upenn.edu/][Software foundation]] :coq:
*** STRT [[https://softwarefoundations.cis.upenn.edu/lf-current/toc.html][Volume 1]]