B:BD[
5.5956] → [
5.5956:7460]
B:BD[
5.7460] → [
3.34056:56478]
∅:D[
3.56478] → [
6.20953:21603]
B:BD[
6.20953] → [
6.20953:21603]
ty | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision
| METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| SNP | ALL | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.6888675840288743 |
| SNP | PASS | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-04-10 Mon>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
***** TODO Refaire avec génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG004 :hg38:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO Refaire avec génome "prêt l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
foreach(rm, variantRecoderInput())
foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
dNew = mergeAll("recoded-success.csv",
"recoded-failures-mobidetails.csv",
"recoded-failures-mobidetails-redo.csv")
dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
@transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
@select :file :transcript :classification :zygosity
@rename :classificationCentogene = :classification
end
dTmp = outerjoin(dInit, dNew, on = :transcript)
CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de centogene
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
- mauvais patient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org
-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULT
ty | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| SNP | ALL | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.6888675840288743 |
| SNP | PASS | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** KILL Résultats sans trimming
CLOSED: [2023-06-25 Sun 15:53] SCHEDULED: <2023-06-26 Mon>
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
***** KILL Utiliser d'autres données brutes ?
CLOSED: [2023-06-25 Sun 15:58]
https://zenodo.org/record/3597727
Capture en hg37 également. Serait intéressant mais pas le temps..
***** KILL Comparer avec UCSCS liftover
CLOSED: [2023-06-26 Mon 19:02] SCHEDULED: <2023-06-25 Sun>
Picard liftoverinterval est basé sur UCSCS
Mais on n'aurait pas la différence pour NA12878 qu'on voit...
**** TODO HG002 :hg002:hg38:
SCHEDULED: <2023-04-10 Mon>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG003 :hg003:hg38:
***** Notes
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
***** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO HG004 :hg38:
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
**** DONE Résumer résultats pour Paul + article :resultats:hg38:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
**** DONE Plot : ashkenazim trio :hg38:
CLOSED: [2023-04-18 Tue 21:27] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
**** TODO Refaire : HiSeq4000 + agilent sureselect + génome "prêt à l'emploi"
SCHEDULED: <2023-06-25 Sun>
**** TODO NA12878 :na12878:T2T:
SCHEDULED: <2023-06-15 Thu>
*** KILL Platinum genome
CLOSED: [2023-06-14 Wed 22:37]
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Insilico :centogene:
*** TODO tous les variants centogène
**** DONE Extraire liste des SNVs
CLOSED: [2023-04-22 Sat 17:32] SCHEDULED: <2023-04-17 Mon>
***** DONE Corriger manquant à la main
CLOSED: [2023-04-22 Sat 17:31]
La sortie est sauvegardé dans git-annex : variants_success.csv
***** DONE Automatique
CLOSED: [2023-04-22 Sat 17:31]
**** DONE Convert SNVs : transcript -> génomique
CLOSED: [2023-06-03 Sat 17:16]
***** DONE Variant_recoder
CLOSED: [2023-04-26 Wed 21:21] SCHEDULED: <2023-04-22 Sat>
****** KILL Haskell: 160 manquant : recoded-success.csv
CLOSED: [2023-04-25 Tue 18:32]
La liste des variants a été générée en Haskel l et nettoyée à la main.
On générer une liste de variant pour variant_rec oder et on soumet tout d'un coup.
[[file:~/recherche/bisonex/parsevariants/app/Main.hs][parsevariant]]
#+begin_src haskell
recodeVariant = do
prepareVariantRecod er "variant_success.csv" "renamed.csv"
runVariantRecoder "renamed.csv" "recoded.json"
#+end_src
#+RESULTS:
: <interactive>:4:3-19: error:
: Variable not in scope: runVariantRecoder :: String -> String -> t
: gh
Problème : 160 n'ont pas pu être lu sur 820, probablement à cause du numéro mineur de transcrit
La sortie est sauvegardé dans git-annex : variants-recoded-raw.json.
****** KILL Julia
CLOSED: [2023-04-25 Tue 18:32]
On regénère la liste de variant et on passe à Julia pour préparer l'appel en parallèle à variant recoder
[[file:~/recherche/bisonex/parsevariants/variantRecoder.jl][variantRecoder.jl]]
#+begin_src julia
setupVariantRecoder(unique(init), n)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 10
#+end_src
On récupère les résultats
#+begin_src julia
(fails, success) = mergeVariantRecoder(n)
CSV.write(fSuccess, success)
CSV.write(fFailures, fails)
#+end_src
Certains variants ne sont pas trouvé, donc on prépare un nouveau job en enlevant les versionrs mineures des transcrits
#+begin_src julia
# Cleanup json and txt
if isfile(fSuccess) && isfile(fFailures)
foreach(rm, variantRecoderInput())
foreach(rm, variantRecoderOutput())
end
redoFails(fFailures)
#+end_src
Puis
#+begin_src sh
parallel -a parallel-recoder.sh --jobs 3
#+end_src
Il manque encore 70 transcrits
***** DONE Julia avec mobidetails: recode-failures-mobidetails.csv
CLOSED: [2023-04-25 Tue 18:58]
Nouvelle stratégie : on essaie une fois variant recoder.
Pour tous les échecs, on utilise mobidetails (~170).
Si l'ID n'est pas trouvé, on incrémente le numéro de version 2 fois
***** DONE Reste une dizaine à corriger à la main
CLOSED: [2023-04-26 Wed 21:21]
- [X] certains transcrits ont juste été supprimé
- [X] Erreur de parsing, manque souvent un -
#+begin_src julia
lastTryMobidetails("recoded-failures-mobidetails.csv")
#+end_src
***** DONE Fusionner données
CLOSED: [2023-04-26 Wed 22:35]
#+begin_src julia
function mergeAllGenomic()
dNew = mergeAll("recoded-success.csv",
"recoded-failures-mobidetails.csv",
"recoded-failures-mobidetails-redo.csv")
dInit = @chain DataFrame(CSV.File("variant_success.csv")) begin
@transform :transcript = :transcript .* ":" .* :coding .* :codingPos .* :codingChange
@select :file :transcript :classification :zygosity
@rename :classificationCentogene = :classification
end
dTmp = outerjoin(dInit, dNew, on = :transcript)
CSV.write("variant_genomic.csv", dTmp)
end
fSuccess = "recoded-success.csv"
fFailures = "recoded-failures.csv"
# variantRecoder(fSuccess, fFailures)
# mobidetailsOnFailures(fFailures)
# lastTryMobidetails("recoded-failures-mobidetails.csv")
mergeAllGenomic()
#+end_src
***** DONE Formatter donner pour simuscop
CLOSED: [2023-04-28 Fri 11:55] SCHEDULED: <2023-04-26 Wed>
**** TODO Extraire liste des CNVs
SCHEDULED: <2023-04-17 Mon>
**** TODO Simuscop :simuscop:
***** DONE Entrainer le modèle sur 63003856/
CLOSED: [2023-04-29 Sat 19:56]
Relancer le modèle pour être sûr
***** DONE Générer fastq avec simuscop (del et ins seulement) 20x
CLOSED: [2023-04-28 Fri 23:35] SCHEDULED: <2023-04-22 Sat>
****** DONE Génerer un profile avec bed de centogène
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
NA12878 mais à refaire avec un vrai séquencage
Voir [[*Centogène][Bed Centogène]] pour choix
****** DONE Générer les données en 20x
CLOSED: [2023-04-28 Fri 11:54] SCHEDULED: <2023-04-22 Sat>
capture de centogene
****** DONE Regénérer en supprimant les doublons
CLOSED: [2023-04-28 Fri 17:28]
***** DONE Quelle couverture ?
CLOSED: [2023-04-29 Sat 18:26]
ex sur chr11:16,014,966 où on a 11 reads dans la simulation contre 200 !
****** 200 est la plus proche
#+attr_html: :width 500px
[[./simuscop-200-chr1-1.png]]
#+attr_html: :width 500px
[[./simuscop-200-chr1-2.png]]
****** DONE 20x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 50x
CLOSED: [2023-04-29 Sat 15:38]
****** DONE 100x
CLOSED: [2023-04-29 Sat 15:39]
****** DONE 200x
CLOSED: [2023-04-29 Sat 15:39]
***** DONE Reads mal centrés sur des petits exons seuls
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
Capture ok : [[https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr1%3A153817168%2D153817824&hgsid=296556270_F4fkENLPXHXidi2oALXls2jxNH9l][UCSC]] (track noire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
- mauvais patient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
****** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
******* DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULT