Q46IVB6DYWJI7EAXQ7ROWCUTL7XRFYWFUGTWSFOITWEMMSQMQ5DAC
Y4X2CGFKO6ZYMC4MU43CKFQGLHOO45KPCZXXIYD7RL7LYQCVXIQQC
UPNBONLATA6EOXDE4CPO2ZHJI2XC62YATCJVEMAJ3BSFNLNHY5CQC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
7QQOACUMLIEIUWIGG2WPUUBY6FG3XCNK7CYY4GIZCRY7KSSQIVSQC
GZTJGHVAMN425GOH4JX5XAIML6CQ5WGTZ4JHTL5YRTNC7NR6RVWAC
JJ4KXENNDW2GGB6NP5ZJM6QLSMYFULX2QVCVMOG52OTS2BWRIDQAC
ZZPVFXEHFL3QNDP4AMCX5OFVDJUO6NR5MNYH2EUW2DL2R3OG6SIQC
E6IT367DUB6XHDSF5QHCFQNLUBCHPPBW4NTVYSUNGVRO7N42MYMQC
DFVVNGNOV4PHKNZ4EPD7RWWP6VTXRUGBVP66UIF2YCH6YAVHW3LAC
53VZNTMB2MXOF67IOCU2LFKVIB3M5HXT5DE4G2QL66W33SK7U5QAC
GKG3LEQDLFB5YKEI5DZMJS6FKZRSM6L54ZB6ZMQVSNIZ7SFU7UGAC
***** DONE Télécharger NA12878 (HG001)
CLOSED: [2023-02-17 Fri 19:29]
***** TODO Télécharger NA12878 (HG001)
****** TODO Fastq HiSeq
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina_HiSeq_Exome_Garvan_trimmed_fastq_09252015
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
#+begin_src
for j in 1 2; do
for i in 1 2; do wget "ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/NIST7035_TAAGGCGA_L00${j}_R${i}_001_trimmed.fastq.gz"; done; done
#+end_src
Il faut concatener les 4 fastq en 2
#+begin_src
zcat NIST7035_TAAGGCGA_L00*_R1_* | gzip -c > NIST7035_TAAGGCGA_R1_001_trimm
#+end_src
ed.fastq.gz
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
****** TODO Exons (bed)
****** DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
******* TODO Filtrer variants introniques de référence avec vep
******** TODO variant calling seulf + seulement -f: nombreux FP
******* KILL Filtrer variants introniques de référence avec vep
CLOSED: [2023-02-24 Fri 23:44]
******** KILL variant calling seulf + seulement -f: nombreux FP
CLOSED: [2023-02-24 Fri 23:44]
******* variant calling seul : meilleur score pour l'instant
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision
| INDEL | ALL | 76 | 43 | 33 | 82 | 18 | 20 | 3 | 5 | 0.565789 | 0.709677
| INDEL | PASS | 76 | 43 | 33 | 82 | 18 | 20 | 3 | 5 | 0.565789 | 0.709677
| SNP | ALL | 582 | 448 | 134 | 530 | 25 | 57 | 6 | 1 | 0.769759 | 0.947146
| SNP | PASS | 582 | 448 | 134 | 530 | 25 | 57 | 6 | 1 | 0.769759 | 0.947146
******* variant calling seul : meilleur score pour l'instant (77% recall, 95% precision)
cd /Work/Users/apraga/bisonex/work/3a/ebf0249db81166c5b12f03cc8167b2
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision |
| INDEL | ALL | 76 | 43 | 33 | 82 | 18 | 20 | 3 | 5 | 0.565789 | 0.709677 |
| INDEL | PASS | 76 | 43 | 33 | 82 | 18 | 20 | 3 | 5 | 0.565789 | 0.709677 |
| SNP | ALL | 582 | 448 | 134 | 530 | 25 | 57 | 6 | 1 | 0.769759 | 0.947146 |
| SNP | PASS | 582 | 448 | 134 | 530 | 25 | 57 | 6 | 1 | 0.769759 | 0.947146 |
******** NC_000021.9 14109044:
Dans bam filtré sur le chromosome 21 mais non dans vcf...
On appelle haplotype caller sans --max-mnp-distance
#+begin_src slurm
!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=08:00:00
#SBATCH --mem=32G
module load nix/2.11.0
dir=/Work/Groups/bisonex/data/giab/
genomeRef=/Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna
dbsnpDir=/Work/Projects/bisonex/data-alexis-reference/dbSNP
bam=/Work/Users/apraga/bisonex/script/files/bam/NA12878_NIST7035_recalibrated_hg38.bam
vcf=/Work/Users/apraga/bisonex/script/files/vcf/NA12878_NIST7035_nodistance.vcf
gatk --java-options "-Xmx32g" HaplotypeCaller \
-R $genomeRef \
-I $bam \
-O $vcf \
-D "$dbsnpDir"/GCF_000001405.39.gz \
#+end_src
Idem
Sur le bam en entier, on le retrouve bien aussi...