* Configuration
:PROPERTIES:
:CUSTOM_ID: configuration
:END:
** Choix du kit de capture
:PROPERTIES:
:CUSTOM_ID: choix-du-kit-de-capture
:END:
Dans les compte-rendu, "Twist Human Core Exome Plus" On utilise la
version en ligne
https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file
, qui n'est pas "plus"

Le problème est que cela ne couvre pas un variant rendu et confirmé en
Sanger... Plusieurs possibilités - la version Plus n'est pas disponible
en ligne mais la contient -> possible - le kit a été mis à jour (2021 ?)
-> possible - c'est un problème lié à hg38 -> non, idem en hg19 (et il a
l'air lifté du hg38)

En comparant avec le BAM, il y a clairement des endroits manquants et ce
n'est pas le meilleur... En terme de couverture, oui.

*** Vérification de la qualité
:PROPERTIES:
:CUSTOM_ID: vérification-de-la-qualité
:END:
**** Couverture
:PROPERTIES:
:CUSTOM_ID: couverture
:END:
Source:
https://www.twistbioscience.com/resources/technical-resources?twist-pick=-1&products=-1&sub-products=-1&application=411&year=-1

- hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed
- hg38_exome_v2.0.2_targets_sorted_validated.re_annotated.bed
- Twist_Comprehensive_Exome_Covered_Targets_hg38.bed
- Twist_Exome_Core_Covered_Targets_hg38.bed
- Twist_Exome_RefSeq_targets_hg38.bed

#+begin_src slurm
#!/bin/bash -l
# Fichier submission.SBATCH

#SBATCH --job-name="mosdepth"
#SBATCH --output=%x.%J.out   ## %x=job name, %J=job id
#SBATCH --error=%x.%J.out
 # walltime (hh:mm::ss) max is 8 days
#SBATCH -t 4:00:00
#SBATCH --partition=smp
#SBATCH -c 4  ## request 16 cores (MAX is 32)
#SBATCH --mem=12G ## (MAX is 96G)
#SBATCH --mail-user=apraga@chu-besancon.fr
#SBATCH --mail-type=END,FAIL   # notify when job end/fail

module load nix/2.11.0
# JULIA_HISTORY=.julia julia mosdepth.jl

bam=out/preprocessing/recalibrated/2300346867_63118093-NA12878-GRCh38/2300346867_63118093-NA12878-GRCh38.recal.bam

mkdir test-capture
mosdepth -b capture/hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed test-capture/na12878-hg38_exome_comp_spikein_v2.0 $bam
mosdepth -b capture/hg38_exome_v2.0.2_targets_sorted_validated.re_annotated.bed test-capture/na12878-hg38_exome_v2.0.2 $bam
mosdepth -b capture/Twist_Comprehensive_Exome_Covered_Targets_hg38.bed test-capture/na12878-Twist_Comprehensive_Exome $bam
mosdepth -b capture/Twist_Exome_Core_Covered_Targets_hg38.bed test-capture/na12878-Twist_Exome_Core $bam
mosdepth -b capture/Twist_Exome_RefSeq_targets_hg38.bed test-capture/na12878-Twist_Exome $bam
#+end_src

Résultat:

Sample Name ≥ 30X Median Mean Cov. na12878-Twist_Comprehensive_Exome
96.0% 66.0X 69.4X na12878-Twist_Exome 96.0% 66.0X 69.3X
na12878-Twist_Exome_Core 97.0% 66.0X 69.4X
na12878-hg38_exome_comp_spikein_v2.0 94.0% 65.0X 68.2X
na12878-hg38_exome_v2.0.2 95.0% 65.0X 67.4X

Twist Exome Core a les meilleurs résultats mais il y a des zones qui
devraient être couvertes mais qui ne le sont pas.

*** Par rapport au bam
:PROPERTIES:
:CUSTOM_ID: par-rapport-au-bam
:END:
On utilise [[https://github.com/telatin/bamtocov][bamtocov]] pour
générer un fichier de captuer à partir du bam. (NB: covtobam ne fait pas
l'union non plus des intervalles...) Mais cela donne les couvertures par
position alors que covtobed donne l'union des positions >= seuil
(pourtant bamtocov est plus récent). On prend le second

#+begin_src sh
conda create --name bamtocov-env
conda activate bamtocov-env
conda install -y -c bioconda bamtocov

bamtocov 2300346867_63118093-NA12878-GRCh38.recal.bam | awk '$4 >= 30' > coverage_gt30.bed
#+end_src

On fusionne les intervalles consécutifs

#+begin_src sh
bedtools merge -i coverage_gt30.bed > coverage_gt30_merged.bed
#+end_src

On regarde s'il existe des régions dans notre BAM qui ne sont pas dans
les différents kit de capture: option -v de bedtools. En calculant le
nombre d'intervalles absents du kit et la taille totale des absences:

#+begin_src nu
[Twist_Comprehensive_Exome_Covered_Targets_hg38.bed Twist_Exome_Core_Covered_Targets_hg38.bed Twist_Exome_RefSeq_ta
rgets_hg38.bed hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed hg38_exome_v2.0.2_targets_sorted_valida
ted.re_annotated.bed] | each {|e| print $e; bedtools intersect -v -a coverage_gt30_merged.bed -b $e | awk  'BEGIN{SUM
=0}{SUM+=$3-$2}END{print SUM}' }

Twist_Comprehensive_Exome_Covered_Targets_hg38.bed
9817186
Twist_Exome_Core_Covered_Targets_hg38.bed
15407667
Twist_Exome_RefSeq_targets_hg38.bed
9715995
hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed
9479802
hg38_exome_v2.0.2_targets_sorted_validated.re_annotated.bed
10820313
#+end_src

#+begin_src nu
[Twist_Comprehensive_Exome_Covered_Targets_hg38.bed Twist_Exome_Core_Covered_Targets_hg38.bed Twist_Exome_RefSeq_ta
rgets_hg38.bed hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed hg38_exome_v2.0.2_targets_sorted_valida
ted.re_annotated.bed] | each {|e| print $e; bedtools intersect -v -a coverage_gt30_merged.bed -b $e | wc -l }
Twist_Comprehensive_Exome_Covered_Targets_hg38.bed
95582
Twist_Exome_Core_Covered_Targets_hg38.bed
114651
Twist_Exome_RefSeq_targets_hg38.bed
95271
hg38_exome_comp_spikein_v2.0.2_targets_sorted.re_annotated.bed
93839
hg38_exome_v2.0.2_targets_sorted_validated.re_annotated.bed
97634
#+end_src

Selon ces mesures, le meilleur est donc Twist Exome 2.0 plus
Comprehensive Exome Spike-in...