KXGS6Y7FYRAG6W5UCCNQ44ESWXXU4FQO6KQY6E2TUKLRJXVJJP6QC
4RX6SVVGVS3LKLZGNTWY5S6DOJGS5AJWZNDM3D3SEWXXLZG6PTAQC
TCHORBND7SM4UP4YRQXPB3LG6WEBC6LLED4QCWPKFSF5VKIIXU5QC
X77YO7PTCUVY2ISDKAFXM5W2NE6WJUGPJMTVY2FZNMKUXGE5Q5HQC
VK6K2ZKRI4XTJKHSJKHC3F4F447JCPEVHX4TQBJJPVUIB3WURIOAC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
RWZ457JG7GN4JJKPRNLB22OHOGYSKR756OQUWLJDAEXRCHFBK42AC
QZSQEAJE6R2V5ABFIHYBTSDPEY2OW7RN5SQ7QXTVKB4T6PESENZQC
MTCD3SJJNAL53NRAVPZYWDHH2JGMAMF6KHX5M6WRZ3KDNSKWTGUQC
GRF2ZGQU3R4G5ZI4BSFSHZW6D7ZSALA3F3OWP4RRRPV4YZOODFGQC
7SXNKSNEXIANSPPAFEK6SS3JXEJIWXLBYLTOHNSF3GO7WJNEQTTAC
2HYO3KBJWTOQFX2B3H4NHADODGNYOIDVINAUHC7JBTEKZMGE2J3AC
JQUKAFWIHMUVF73PZMBQLSGQYWEJWZ2VK55P7JXPNH45MKEFO5UAC
3GGCMWGQAPZ2O7C3ZEOE34WSCUIYYQFGIY7U55VGCUPUCOUQ6L4QC
N5WHBFPQS5XZRXGLKUJTHN752TUA2SGAUSLR3LLP3DSLLEO3A5WQC
CWUL3YKSB42I6N7FAZTM5AFF32TELNTIMC3VY2QJXPGQYG5Y72DAC
2V5FSCDK24GEKDAEIOM3IFRQYQC2TMJXABXS4WF4F23XCN4GUNUQC
DZ6GQN2ERJAZG3PWZC34EN32YZOAPVFNGVCMJQNMFZVK4TR57UTAC
ZW5JL2RSM3UIBU4DDWMALMRDXGDWGOBW4YFUISRJY3HDEXVXYB7AC
MG6QNY2AZM3YPWL2PTIQ6NNWFFO6BRKWL3MLYJOSOAMNW6HELHEQC
ZBQKEHTAU5FNORD2IBL2K3KWBXETUVBYNVREGQWPAIAV6BYRATUAC
CITWHSGYZUZTEAS2OR566P67LCIOGQ5LDNCZNNF7PV567U5QR5BAC
2ZULESGU4H6GRQUHINYMYXW2FR33QNNA5OWBCCRAD26CX7WHT57QC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
iant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installat
ion nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 1
5:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** WAIT Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs105751
8973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %R
EF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt ol
d.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of
SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DO
NE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"'
/Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new.txt |
| 1070310 | total |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_20_old.vcf.gz
bcftools filter -i 'CHROM="19" | CHROM="20"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_20_old.vcf.gz
#+end_src
#+RESULTS:
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19_20.vcf.gz --dbSNP dbSNP_common_19_20.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_20_old.vcf.gz --dbSNP dbSNP_common_19_20_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
***** DONE Regarder la répartition des différences
CLOSED: [2022-12-11 Sun 16:29]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -23 notpatho.new notpatho.old > nopatho.diff
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@nopatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
On a principalement des coordonnées non consensuelles (non "NC_", voir notes)
#+RESULTS:
: 2 NC_000002.12
: 18 NC_000003.12
: 2 NC_000004.12
: 2 NC_000005.10
: 14 NC_000006.12
: 6 NC_000007.14
: 2 NC_000009.12
: 1 NC_000010.11
: 6 NC_000014.9
: 1 NC_000015.10
: 3 NC_000016.10
: 3 NC_000017.11
: 1 NC_000019.10
: 1 NC_000020.11
: 1 NC_000021.9
: 2 NC_000022.11
: 16018 NT_113793.3
: 17010 NT_113796.3
: 14 NT_113891.3
: 1 NT_167244.2
: 13 NT_167245.2
: 2 NT_167246.2
: 13 NT_167247.2
: 7 NT_167248.2
: 14 NT_167249.2
: 14857 NT_187361.1
: 92 NT_187367.1
: 1 NT_187369.1
: 13 NT_187381.1
: 54 NT_187383.1
: 6 NT_187499.1
: 46 NT_187502.1
: 13754 NT_187513.1
: 611 NT_187517.1
: 1 NT_187520.1
: 1 NT_187524.1
: 249 NT_187526.1
: 18 NT_187532.1
: 1 NT_187546.1
: 886 NT_187562.1
: 1 NT_187564.1
: 346 NT_187576.1
: 13 NT_187600.1
: 5 NT_187601.1
: 494 NT_187606.1
: 1 NT_187607.1
: 12 NT_187613.1
: 307 NT_187614.1
: 1 NT_187625.1
: 445 NT_187633.1
: 43 NT_187648.1
: 18 NT_187649.1
: 1 NT_187652.1
: 512 NT_187661.1
: 18 NT_187678.1
: 49 NT_187681.1
: 1 NT_187682.1
: 18 NT_187688.1
: 12 NT_187689.1
: 18 NT_187690.1
: 18 NT_187691.1
: 404 NT_187693.1
: 2 NW_003315952.3
: 1 NW_003315970.2
: 203 NW_003571054.1
: 322 NW_003571055.2
: 16 NW_003571056.2
: 16 NW_003571057.2
: 16 NW_003571058.2
: 16 NW_003571059.2
: 16 NW_003571060.1
: 213 NW_003571061.2
: 2 NW_009646201.1
: 322 NW_009646205.1
: 321 NW_009646206.1
: 371 NW_012132914.1
:
1 NW_012132915.1
: 13 NW_012132918.1
: 2 NW_013171801.1
: 1 NW_013171807.1
: 49 NW_015148966.1
: 14 NW_015495298.1
: 2 NW_015495299.1
: 1 NW_016107298.1
: 4 NW_017363813.1
: 2 NW_017852933.1
: 1 NW_018654722.1
: 38 NW_021160001.1
: 1 NW_021160003.1
: 1 NW_021160007.1
: 7 NW_021160017.1
***** DONE Regarder la différence avec la version sans les sites non consensuels: ok !
CLOSED: [2022-12-11 Sun 20:11]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > notpatho.diff
wc -l
#+end_src
#+RESULTS:
: 528 notpatho.diff
Il manque 528 variants
rs1057520103
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@notpatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
: 528 NC_012920.1
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
**** DONE Supprimer les sites non consensuels
CLOSED: [2022-12-11 Sun 19:51]
**** DONE Rajouter les mitochondries (vu avec Paul)
CLOSED: [2022-12-13 Tue 17:26]
Ok avec notre version générée. Sur le J: 21155134
$ wc -l dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
21155065 dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
$ wc -l ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
21155065 ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
La différence vient probablement d'une vieille version de clinvar
**** TODO Comprendre la différence nouvelle version et prod
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
#sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
#sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > missing-from-old
comm -23 notpatho.new notpatho.old > missing-from-new
wc -l missing-from-old
wc -l missing-from-new
#+end_src
#+RESULTS:
| 75 | missing-from-old |
| 6 | missing-from-new |
Il manque 75 variants et on a 6 en trop
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@missing-from-old' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
| 16 | NC_000001.11 |
| 2 | NC_000002.12 |
| 18 | NC_000003.12 |
| 7 | NC_000004.12 |
| 1 | NC_000005.10 |
| 5 | NC_000006.12 |
| 3 | NC_000007.14 |
| 2 | NC_000009.12 |
| 1 | NC_000010.11 |
| 5 | NC_000011.10 |
| 3 | NC_000015.10 |
| 1 | NC_000016.10 |
| 4 | NC_000017.11 |
| 2 | NC_000019.10 |
| 1 | NC_000020.11 |
| 3 | NC_000022.11 |
| 1 | NC_000023.11 |
| 2 | NT_113891.3 |
| 2 | NT_167245.2 |
| 2 | NT_167247.2 |
| 1 | NT_167248.2 |
| 2 | NT_167249.2 |
| 18 | NT_187532.1 |
| 1 | NT_187562.1 |
| 1 | NT_187633.1 |
| 18 | NT_187649.1 |
| 18 | NT_187678.1 |
| 18 | NT_187688.1 |
| 12 | NT_187689.1 |
| 18 | NT_187690.1 |
| 18 | NT_187691.1 |
| 1 | NW_013171807.1 |
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
*** DONE Comparer les versions
CLOSED: [2023-01-26 jeu. 17:42]
**** DONE bases de données
CLOSED: [2023-01-26 jeu. 17:42]
***** DONE Genome de référénce:
CLOSED: [2023-01-06 Fri 00:00]
Version calculée
$ find . -type f -name "genomeRef.*" -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./genomeRef.fna
d121084c35037763ea58c59726545eaa1c11025a7bf2d75634677c72ddb72fd1 ./genomeRef.dict
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./genomeRef.fna.fai
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./bwa/genomeRef.ann
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./bwa/genomeRef.bwt
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./bwa/genomeRef.sa
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./bwa/genomeRef.amb
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./bwa/genomeRef.pac
Version de référence
$ find . -type f -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./GRCh38_latest_genomic.fna
dd87d628a8179fc1e37a33b99101eece8282bec88dc17b1998a07ff0b912d4a3 ./GRCh38_latest_genomic.fna.index
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./GRCh38_latest_genomic.fna.fai
974cc313146aedd9ec2ae3f86b382b1317180e078621c1d53e8a803d4ec0d3a9 ./GRCh38_latest_genomic.dict
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./GRCh38_latest_genomic.fna.bwt
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./GRCh38_latest_genomic.fna.ann
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./GRCh38_latest_genomic.fna.amb
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./GRCh38_latest_genomic.fna.sa
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./GRCh38_latest_genomic.fna.pac
Dict ok si on renome le ficdhier d'origine
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/
sed 's/UR:.*/UR:genomeRef.fna/' data-alexis-reference/genome/GRCh38_latest_genomic.dict > lol.dict
diff lol.dict data/genome/GRCh38.p13/genomeRef.dict
#+end_src
#+RESULTS:
***** DONE dbSNP et dbSNP common: ok
CLOSED: [2023-01-03 Tue 23:17]
sha256sum GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
sha256sum data-alexis-reference/dbSNP/GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d data-alexis-reference/dbSNP/GCF_000001405.39.gz"
sha256sum dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b dbSNP_common.vcf.gz
sha256sum data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
***** DONE Clinvar : version différente
CLOSED: [2023-01-06 Fri 22:51]
$ zgrep -v '^#' data-alexis-reference/clinvar/clinvar.vcf.gz | wc -l
1474547
$ zgrep -v '^#' data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
1571404
***** DONE Revérifer checksum de GRCh38 et dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
****** DONE GRCh38
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/genome>sha256sum GRCh38_latest_genomic.fna
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
PS J:\bases_de_donnees\genome> cat .\checksum.txt
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
****** DONE dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/dbSNP>sha256sum GCF_000001405.39.gz 01/26/2023 04:40:48 PM
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
PS J:\bases_de_donnees\dbSNP> cat .\checksums.txt
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
**** DONE Outils
CLOSED: [2023-01-26 jeu. 17:42]
| | Prod | Test |
| VCFtools | 0.1.17 | 0.1.16 |
| bcftools | 1.14 | 1.16 |
| samtools | 1.14 | 1.13 |
| gatk | 4.2.4.1 | 4.3.0.0 |
| bwa | ? | 0.7.17-r1188 |
On a des versions plus vieilles sauf (le plus important) Gatk
*** HOLD 63003856_S135
**** KILL Notes : bonne reproductibilité sur le cluster mais diff avec la version "de prod"
CLOSED: [2023-01-09 Mon 23:03]
tester sequential puis version spécifique bwa mem
Note: clinvar est plus ancien dans la version d'Alexis, cela explique les 8884 de la version pseudo-prod (via ID not clinvar patho)
Attention : le nombre de lignes = celles sans commentaires (et pas just NC... car il devrait y avoir les mitochondries !)
$ find . -name filter-depth.vcf -exec sh -c 'echo {}; grep -c -v '^#' {}' \;
Attention, on a un vcf.gz pour la version de test !!! ne pas utiliser le vcf
| | prod | prod | prod | prod | prod | test | test | test |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| PC | Karine | helios | helios | helios | Helios | Helios | Helios | Helios |
| cores | 4 | 4 | 4 | 24 | 1 | 24 | 24 | 1 |
| gatk | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.3.0 | 4.2.3.1 |
| samtools | 1.10 | 1.10 | 1.13 | 1.13 | 1.13 | 1.13 | | |
| clinvar | 2022-09-03 | | 2022-09-03 | 2022-09-03 | 2022-09-03 | 2022-12-03 | 2022-12-03 | 2022-12-03 |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| bwa mem | 128077207 | 128077207 | 128077207 | 128077211 | 128077210 | 128077211 | 128077211 | 128077210 |
| cleanSam | 128077207 | | | | 128077210 | 128077211 | 128077211 | 128077210 |
| applybqsr | | | | 128077211 | | | | |
| haplotypecaller | 1528059 | 1528066 | 1528066 | 1528093 | 1528089 | 1528093 | 1528093 | 1528093 |
| DP_over_30 | 84708 | 84716 | 84716 | 84724 | 84725 | 84724 | 84724 | 84725 |
| not_SNP | 11362 | 11366 | 11366 | 11377 | 11384 | NA | NA | NA |
| consensual_sequence | 8864 | 8868 | 8868 | 8884 | 8886 | 8898 | 8898 | 8900 |
| tsv | 1087 | | 1121 | | | | | |
***** Convention
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
On a vérifié que grep -c et grep | wc -l donnent le même résultat
#
**** KILL Gatk 4.3.0
CLOSED: [2023-01-04 Wed 19:16]
***** KILL Alignement
CLOSED: [2023-01-04 Wed 19:16]
****** DONE Brut
CLOSED: [2022-12-26 Mon 22:03]
Bam alexis
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
Notre
9f/26cf3d] Cached process > preprocess:BWA_MEM (63003856_S135)
$ samtools view -c work/9f/26cf3deb07b425a3e851be2a7bd782/63003856_S135.bam
On vérifie la sortie
$ samtools view -c out/63003856_S135/preprocessing/mapped/63003856_S135.bam
128077211
Petite différence (< 1e-8) mais selon Alexis, bwa mem est non reproductible. d'autant qu'on utilise une version parallélisée
128077211
****** On vérifie les arguments: ok
#+begin_src
bwa mem \
-R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\
iant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new.txt |
| 1070310 | total |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_20_old.vcf.gz
bcftools filter -i 'CHROM="19" | CHROM="20"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_20_old.vcf.gz
#+end_src
#+RESULTS:
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19_20.vcf.gz --dbSNP dbSNP_common_19_20.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_20_old.vcf.gz --dbSNP dbSNP_common_19_20_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
***** DONE Regarder la répartition des différences
CLOSED: [2022-12-11 Sun 16:29]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -23 notpatho.new notpatho.old > nopatho.diff
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@nopatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
On a principalement des coordonnées non consensuelles (non "NC_", voir notes)
#+RESULTS:
: 2 NC_000002.12
: 18 NC_000003.12
: 2 NC_000004.12
: 2 NC_000005.10
: 14 NC_000006.12
: 6 NC_000007.14
: 2 NC_000009.12
: 1 NC_000010.11
: 6 NC_000014.9
: 1 NC_000015.10
: 3 NC_000016.10
: 3 NC_000017.11
: 1 NC_000019.10
: 1 NC_000020.11
: 1 NC_000021.9
: 2 NC_000022.11
: 16018 NT_113793.3
: 17010 NT_113796.3
: 14 NT_113891.3
: 1 NT_167244.2
: 13 NT_167245.2
: 2 NT_167246.2
: 13 NT_167247.2
: 7 NT_167248.2
: 14 NT_167249.2
: 14857 NT_187361.1
: 92 NT_187367.1
: 1 NT_187369.1
: 13 NT_187381.1
: 54 NT_187383.1
: 6 NT_187499.1
: 46 NT_187502.1
: 13754 NT_187513.1
: 611 NT_187517.1
: 1 NT_187520.1
: 1 NT_187524.1
: 249 NT_187526.1
: 18 NT_187532.1
: 1 NT_187546.1
: 886 NT_187562.1
: 1 NT_187564.1
: 346 NT_187576.1
: 13 NT_187600.1
: 5 NT_187601.1
: 494 NT_187606.1
: 1 NT_187607.1
: 12 NT_187613.1
: 307 NT_187614.1
: 1 NT_187625.1
: 445 NT_187633.1
: 43 NT_187648.1
: 18 NT_187649.1
: 1 NT_187652.1
: 512 NT_187661.1
: 18 NT_187678.1
: 49 NT_187681.1
: 1 NT_187682.1
: 18 NT_187688.1
: 12 NT_187689.1
: 18 NT_187690.1
: 18 NT_187691.1
: 404 NT_187693.1
: 2 NW_003315952.3
: 1 NW_003315970.2
: 203 NW_003571054.1
: 322 NW_003571055.2
: 16 NW_003571056.2
: 16 NW_003571057.2
: 16 NW_003571058.2
: 16 NW_003571059.2
: 16 NW_003571060.1
: 213 NW_003571061.2
: 2 NW_009646201.1
: 322 NW_009646205.1
: 321 NW_009646206.1
: 371 NW_012132914.1
: 1 NW_012132915.1
: 13 NW_012132918.1
: 2 NW_013171801.1
: 1 NW_013171807.1
: 49 NW_015148966.1
: 14 NW_015495298.1
: 2 NW_015495299.1
: 1 NW_016107298.1
: 4 NW_017363813.1
: 2 NW_017852933.1
: 1 NW_018654722.1
: 38 NW_021160001.1
: 1 NW_021160003.1
: 1 NW_021160007.1
: 7 NW_021160017.1
***** DONE Regarder la différence avec la version sans les sites non consensuels: ok !
CLOSED: [2022-12-11 Sun 20:11]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > notpatho.diff
wc -l
#+end_src
#+RESULTS:
: 528 notpatho.diff
Il manque 528 variants
rs1057520103
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@notpatho.diff' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
: 528 NC_012920.1
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
**** DONE Supprimer les sites non consensuels
CLOSED: [2022-12-11 Sun 19:51]
**** DONE Rajouter les mitochondries (vu avec Paul)
CLOSED: [2022-12-13 Tue 17:26]
Ok avec notre version générée. Sur le J: 21155134
$ wc -l dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
21155065 dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
$ wc -l ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
21155065 ../data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
La différence vient probablement d'une vieille version de clinvar
**** KILL Comprendre la différence nouvelle version et prod
CLOSED: [2023-05-23 Tue 08:46]
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
#sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.new
#sort /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > notpatho.old
comm -13 notpatho.new notpatho.old > missing-from-old
comm -23 notpatho.new notpatho.old > missing-from-new
wc -l missing-from-old
wc -l missing-from-new
#+end_src
#+RESULTS:
| 75 | missing-from-old |
| 6 | missing-from-new |
Il manque 75 variants et on a 6 en trop
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools query -i 'ID=@missing-from-old' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -f '%CHROM\n' | sort | uniq -c
#+end_src
#+RESULTS:
| 16 | NC_000001.11 |
| 2 | NC_000002.12 |
| 18 | NC_000003.12 |
| 7 | NC_000004.12 |
| 1 | NC_000005.10 |
| 5 | NC_000006.12 |
| 3 | NC_000007.14 |
| 2 | NC_000009.12 |
| 1 | NC_000010.11 |
| 5 | NC_000011.10 |
| 3 | NC_000015.10 |
| 1 | NC_000016.10 |
| 4 | NC_000017.11 |
| 2 | NC_000019.10 |
| 1 | NC_000020.11 |
| 3 | NC_000022.11 |
| 1 | NC_000023.11 |
| 2 | NT_113891.3 |
| 2 | NT_167245.2 |
| 2 | NT_167247.2 |
| 1 | NT_167248.2 |
| 2 | NT_167249.2 |
| 18 | NT_187532.1 |
| 1 | NT_187562.1 |
| 1 | NT_187633.1 |
| 18 | NT_187649.1 |
| 18 | NT_187678.1 |
| 18 | NT_187688.1 |
| 12 | NT_187689.1 |
| 18 | NT_187690.1 |
| 18 | NT_187691.1 |
| 1 | NW_013171807.1 |
Donc la nouvelle version fonctionne mieux !
ON vérifie bien qu'ils sont dans l'ancienne version et la nouvelle:
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data-alexis/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | wc -l
528
$ grep -w -f notpatho.diff /Work/Groups/bisonex/data/d
bSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
#+end_src
*** DONE Comparer les versions
CLOSED: [2023-01-26 jeu. 17:42]
**** DONE bases de données
CLOSED: [2023-01-26 jeu. 17:42]
***** DONE Genome de référénce:
CLOSED: [2023-01-06 Fri 00:00]
Version calculée
$ find . -type f -name "genomeRef.*" -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./genomeRef.fna
d121084c35037763ea58c59726545eaa1c11025a7bf2d75634677c72ddb72fd1 ./genomeRef.dict
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./genomeRef.fna.fai
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./bwa/genomeRef.ann
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./bwa/genomeRef.bwt
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./bwa/genomeRef.sa
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./bwa/genomeRef.amb
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./bwa/genomeRef.pac
Version de référence
$ find . -type f -exec sh -c 'echo {}; sha256sum {}' \;
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f ./GRCh38_latest_genomic.fna
dd87d628a8179fc1e37a33b99101eece8282bec88dc17b1998a07ff0b912d4a3 ./GRCh38_latest_genomic.fna.index
0a6e215314659929dbcdffc1881714e311e3d149bdc33978a6f5e28206fcc675 ./GRCh38_latest_genomic.fna.fai
974cc313146aedd9ec2ae3f86b382b1317180e078621c1d53e8a803d4ec0d3a9 ./GRCh38_latest_genomic.dict
ac6e465f230da6d9f8339ebdf4cb05bcfc47d03f3a3889cae7256983c7809210 ./GRCh38_latest_genomic.fna.bwt
45a4aa0d8dc1095d090b13e4df180763b2e24d133ec81f026beaead6d41ebafc ./GRCh38_latest_genomic.fna.ann
f665b64275eb76111463966bcb8e91e550c63c9b58263d43e19bae8552be2815 ./GRCh38_latest_genomic.fna.amb
058ffaf8cd38e7bc33c31e86e54e99869a8a4fbabb6737d7420d6b89e8b5988e ./GRCh38_latest_genomic.fna.sa
d0d3731d1203cb4a0d0dd1279c37c85480b85a91da2d1dc543ac391ff927c272 ./GRCh38_latest_genomic.fna.pac
Dict ok si on renome le ficdhier d'origine
#+begin_src sh :dir /ssh:meso:/Work/Groups/bisonex/
sed 's/UR:.*/UR:genomeRef.fna/' data-alexis-reference/genome/GRCh38_latest_genomic.dict > lol.dict
diff lol.dict data/genome/GRCh38.p13/genomeRef.dict
#+end_src
#+RESULTS:
***** DONE dbSNP et dbSNP common: ok
CLOSED: [2023-01-03 Tue 23:17]
sha256sum GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
sha256sum data-alexis-reference/dbSNP/GCF_000001405.39.gz
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d data-alexis-reference/dbSNP/GCF_000001405.39.gz"
sha256sum dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b dbSNP_common.vcf.gz
sha256sum data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
70dfd9be859c39916598d23b5744cc1fbda04add5840cd90a6d0cd005bd3075b data-alexis-reference/dbSNP/dbSNP_common.vcf.gz
***** DONE Clinvar : version différente
CLOSED: [2023-01-06 Fri 22:51]
$ zgrep -v '^#' data-alexis-reference/clinvar/clinvar.vcf.gz | wc -l
1474547
$ zgrep -v '^#' data/clinvar/GRCh38/clinvar.vcf.gz | wc -l
1571404
***** DONE Revérifer checksum de GRCh38 et dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
****** DONE GRCh38
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/genome>sha256sum GRCh38_latest_genomic.fna
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
PS J:\bases_de_donnees\genome> cat .\checksum.txt
e0761a7ba5d10de9e7e97fa331667963925531c0199575bcceafbb13c3147e3f GRCh38_latest_genomic.fna
****** DONE dbSNP
CLOSED: [2023-01-26 jeu. 17:42]
/Work/Projects/bisonex/data-alexis-reference/dbSNP>sha256sum GCF_000001405.39.gz 01/26/2023 04:40:48 PM
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
PS J:\bases_de_donnees\dbSNP> cat .\checksums.txt
452e1112b6339a9b19821c2a226a8a3ba946e92a47e03e6ae464ef8820ee130d GCF_000001405.39.gz
**** DONE Outils
CLOSED: [2023-01-26 jeu. 17:42]
| | Prod | Test |
| VCFtools | 0.1.17 | 0.1.16 |
| bcftools | 1.14 | 1.16 |
| samtools | 1.14 | 1.13 |
| gatk | 4.2.4.1 | 4.3.0.0 |
| bwa | ? | 0.7.17-r1188 |
On a des versions plus vieilles sauf (le plus important) Gatk
*** KILL 63003856_S135
CLOSED: [2023-05-23 Tue 08:45]
**** KILL Notes : bonne reproductibilité sur le cluster mais diff avec la version "de prod"
CLOSED: [2023-01-09 Mon 23:03]
tester sequential puis version spécifique bwa mem
Note: clinvar est plus ancien dans la version d'Alexis, cela explique les 8884 de la version pseudo-prod (via ID not clinvar patho)
Attention : le nombre de lignes = celles sans commentaires (et pas just NC... car il devrait y avoir les mitochondries !)
$ find . -name filter-depth.vcf -exec sh -c 'echo {}; grep -c -v '^#' {}' \;
Attention, on a un vcf.gz pour la version de test !!! ne pas utiliser le vcf
| | prod | prod | prod | prod | prod | test | test | test |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| PC | Karine | helios | helios | helios | Helios | Helios | Helios | Helios |
| cores | 4 | 4 | 4 | 24 | 1 | 24 | 24 | 1 |
| gatk | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.2.4.1 | 4.3.0 | 4.2.3.1 |
| samtools | 1.10 | 1.10 | 1.13 | 1.13 | 1.13 | 1.13 | | |
| clinvar | 2022-09-03 | | 2022-09-03 | 2022-09-03 | 2022-09-03 | 2022-12-03 | 2022-12-03 | 2022-12-03 |
|---------------------+------------+-----------+------------+------------+------------+------------+------------+------------|
| bwa mem | 128077207 | 128077207 | 128077207 | 128077211 | 128077210 | 128077211 | 128077211 | 128077210 |
| cleanSam | 128077207 | | | | 128077210 | 128077211 | 128077211 | 128077210 |
| applybqsr | | | | 128077211 | | | | |
| haplotypecaller | 1528059 | 1528066 | 1528066 | 1528093 | 1528089 | 1528093 | 1528093 | 1528093 |
| DP_over_30 | 84708 | 84716 | 84716 | 84724 | 84725 | 84724 | 84724 | 84725 |
| not_SNP | 11362 | 11366 | 11366 | 11377 | 11384 | NA | NA | NA |
| consensual_sequence | 8864 | 8868 | 8868 | 8884 | 8886 | 8898 | 8898 | 8900 |
| tsv | 1087 | | 1121 | | | | | |
***** Convention
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.bam
post_BaseRecalibrator = _recal.table
post_ApplyBQSR = _recalibrated_hg38.bam
post_haplotypecaller = .vcf
post_depth_filter = _DP_over_30.vcf
post_exclude_SNP = _DP_over_30_not_SNP
post_consensual = _DP_over_30_not_SNP_consensual_sequence.vcf
post_technical = _DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz
On a vérifié que grep -c et grep | wc -l donnent le même résultat
#
**** KILL Gatk 4.3.0
CLOSED: [2023-01-04 Wed 19:16]
***** KILL Alignement
CLOSED: [2023-01-04 Wed 19:16]
****** DONE Brut
CLOSED: [2022-12-26 Mon 22:03]
Bam alexis
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
Notre
9f/26cf3d] Cached process > preprocess:BWA_MEM (63003856_S135)
$ samtools view -c work/9f/26cf3deb07b425a3e851be2a7bd782/63003856_S135.bam
On vérifie la sortie
$ samtools view -c out/63003856_S135/preprocessing/mapped/63003856_S135.bam
128077211
Petite différence (< 1e-8) mais selon Alexis, bwa mem est non reproductible. d'autant qu'on utilise une version parallélisée
128077211
****** On vérifie les arguments: ok
#+begin_src
bwa mem \
-R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\
/DP<=30' 63003856_S135.vcf.gz | bcftools filter -e 'FORMAT/AD[0:1]<=10' -o two-filters.vcf
$ grep '^NC' two-filters.vcf | wc -l
82054
***** DONE Tester bwa en séquentiel
CLOSED: [2023-01-07 Sat 00:05]
**** DONE (save) Version d'Alexis 24 threads, sans télécharger les bases de données, gatk 4.2.4.1
CLOSED: [2023-01-07 Sat 00:06]
***** Bwa mem 24 threads: comme la version test...
$ cd /Work/Users/apraga/bisonex/script/files/tmp_63003856_S135
$ samtools view -c 63003856_S135.bam
128077211
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
En ne conservant que les mapped reads, minime différence
$ samtools view -c -F 260 /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
127941051
$ samtools view -c -F 260 files/tmp_63003856_S135/63003856_S135.bam
127941054
**** DONE Version prod à la maison
CLOSED: [2023-01-15 Sun 23:22]
preprocessing + variant calling sans alignement
***** DONE GATK nix
CLOSED: [2023-01-15 Sun 23:22]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
On a le même nombre
***** DONE GATK compilé: idem
CLOSED: [2023-01-15 Sun 23:22]
script/files-src/tmp_63003856_S135 on prod [$!?]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
Idem...
**** TODO GATK : tests de non régression ??
**** DONE Compiler gatk: idem
CLOSED: [2023-01-19 Thu 22:03]
***** DONE Compilation
CLOSED: [2023-01-14 Sat 22:45]
Requirements :
- java8 avec jDK
- git lfs
#+begin_src sh
git clone https://github.com/broadinstitute/gatk
git checkout tags/4.2.4.0 -b 4.2.4.0
nix-shell -p jdk8 git-lfs
# We need the datasets for testing (otherwise it fails)
git lfs install
git lfs pull --include src/main/resources/large
./gradlew bundle
#+end_src
***** DONE Vérifier tests
CLOSED: [2023-01-19 Thu 22:03]
#+begin_src
./gradlew test
#+end_src
267064 tests completed, 444 failed, 1966 skipped
> There were failing tests. See the report at: file:///Home/Users/apraga/gatk/build/reports/tests/test/index.html
BUILD FAILED in 37m 1s
6 actionable tasks: 1 executed, 5 up-to-date
Ceux qui ont planté sont liés à spark + erreurs d'authentification kerberos
***** KILL Lancer calcul : maison
CLOSED: [2023-01-19 Thu 22:03]
JAVA_HOME=/nix/store/r1r5jr7gv6hcchpiggjmfqjkzbi8y5ja-openjdk-8u322-ga/lib/openjdk PATH=$PATH:$JAVA_HOME/bin ./gatk --version
**** TODO Comparer tsv à la sortie (Alexis)
- diff
- Nombre de lignes
**** TODO Version de prod + nix sur Helios
***** DONE 24 coeurs : idem
CLOSED: [2023-01-20 Fri 22:49]
****** DONE Preprocessing : idem
CLOSED: [2023-01-20 Fri 22:49]
******* DONE Alignement
CLOSED: [2023-01-19 Thu 22:28]
******** DONE Nombre lignes : idem
CLOSED: [2023-01-19 Thu 22:48]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:16:19 PM
128077211
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:11:39 PM
128077207
******** DONE Bwa version ok, arguments ok
CLOSED: [2023-01-20 Fri 22:49]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:46:55 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -v 2 -t 24 /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna files/fastq/63003856_S135_R1_001.fastq.gz files/fastq/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.13 CL:samtools sort -@ 24 -O BAM -o files/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:49:25 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/bin/bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -t 4 /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R1_001.fastq.gz /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:/mnt/h/tools/samtools sort -@ 4 -O BAM -o /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
******* KILL ApplyBQSR
CLOSED: [2023-01-20 Fri 22:48]
/Work/Users/apraga/bisonex/script/files/bam〉samtools view -c 63003856_S135_recalibrated_hg38.bam 01/19/2023 10:21:00 PM
128077211
??
****** DONE Variant calling
CLOSED: [2023-01-20 Fri 22:48]
******* DONE Re vérifier flags
CLOSED: [2023-01-19 Thu 22:44]
/Work/Projects/bisonex/ref-63003856_S135〉less 63003856_S135_DP_over_30.vcf.gz
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
| Ref | Prod helios |
|-------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------|
| --dbsnp /mnt/j/bases_de_donnees/dbSNP/GCF_000001405.39.gz | --dbsnp /Work/Groups/bisonex/data-alexis-reference/dbSNP/GCF_000001405.39.gz |
| --max-mnp-distance 2 | --max-mnp-distance 2 |
| --output /mnt/j/working_directory_pipeline_analyse_exome/vcf/63003856_S135.vcf | --output files/vcf/63003856_S135.vcf |
| --input /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam | --input files/bam/63003856_S135_recalibrated_hg38.bam |
| --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna | --reference /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna |
| --verbosity WARNING | --verbosity WARNING |
| --use-posteriors-to-calculate-qual false | --use-posteriors-to-calculate-qual false |
| --dont-use-dragstr-priors false | --dont-use-dragstr-priors false |
| --use-new-qual-calculator true | --use-new-qual-calculator true |
| --annotate-with-num-discovered-alleles false | --annotate-with-num-discovered-alleles false |
| --heterozygosity 0.001 | --heterozygosity 0.001 |
| --indel-heterozygosity 1.25E-4 | --indel-heterozygosity 1.25E-4 |
| --heterozygosity-stdev 0.01
/DP<=30' 63003856_S135.vcf.gz | bcftools filter -e 'FORMAT/AD[0:1]<=10' -o two-filters.vcf
$ grep '^NC' two-filters.vcf | wc -l
82054
***** DONE Tester bwa en séquentiel
CLOSED: [2023-01-07 Sat 00:05]
**** DONE (save) Version d'Alexis 24 threads, sans télécharger les bases de données, gatk 4.2.4.1
CLOSED: [2023-01-07 Sat 00:06]
***** Bwa mem 24 threads: comme la version test...
$ cd /Work/Users/apraga/bisonex/script/files/tmp_63003856_S135
$ samtools view -c 63003856_S135.bam
128077211
$ samtools view -c /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
128077207
En ne conservant que les mapped reads, minime différence
$ samtools view -c -F 260 /Work/Groups/bisonex/ref_63003856_S135/63003856_S135.bam
127941051
$ samtools view -c -F 260 files/tmp_63003856_S135/63003856_S135.bam
127941054
**** DONE Version prod à la maison
CLOSED: [2023-01-15 Sun 23:22]
preprocessing + variant calling sans alignement
***** DONE GATK nix
CLOSED: [2023-01-15 Sun 23:22]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
On a le même nombre
***** DONE GATK compilé: idem
CLOSED: [2023-01-15 Sun 23:22]
script/files-src/tmp_63003856_S135 on prod [$!?]
❯ samtools view -c 63003856_S135_marked_dup.bam
128077211
Idem...
**** KILL GATK : tests de non régression ??
CLOSED: [2023-05-23 Tue 08:44]
**** DONE Compiler gatk: idem
CLOSED: [2023-01-19 Thu 22:03]
***** DONE Compilation
CLOSED: [2023-01-14 Sat 22:45]
Requirements :
- java8 avec jDK
- git lfs
#+begin_src sh
git clone https://github.com/broadinstitute/gatk
git checkout tags/4.2.4.0 -b 4.2.4.0
nix-shell -p jdk8 git-lfs
# We need the datasets for testing (otherwise it fails)
git lfs install
git lfs pull --include src/main/resources/large
./gradlew bundle
#+end_src
***** DONE Vérifier tests
CLOSED: [2023-01-19 Thu 22:03]
#+begin_src
./gradlew test
#+end_src
267064 tests completed, 444 failed, 1966 skipped
> There were failing tests. See the report at: file:///Home/Users/apraga/gatk/build/reports/tests/test/index.html
BUILD FAILED in 37m 1s
6 actionable tasks: 1 executed, 5 up-to-date
Ceux qui ont planté sont liés à spark + erreurs d'authentification kerberos
***** KILL Lancer calcul : maison
CLOSED: [2023-01-19 Thu 22:03]
JAVA_HOME=/nix/store/r1r5jr7gv6hcchpiggjmfqjkzbi8y5ja-openjdk-8u322-ga/lib/openjdk PATH=$PATH:$JAVA_HOME/bin ./gatk --version
**** KILL Comparer tsv à la sortie (Alexis)
CLOSED: [2023-05-23 Tue 08:44]
- diff
- Nombre de lignes
**** KILL Version de prod + nix sur Helios
CLOSED: [2023-05-23 Tue 08:45]
***** DONE 24 coeurs : idem
CLOSED: [2023-01-20 Fri 22:49]
****** DONE Preprocessing : idem
CLOSED: [2023-01-20 Fri 22:49]
******* DONE Alignement
CLOSED: [2023-01-19 Thu 22:28]
******** DONE Nombre lignes : idem
CLOSED: [2023-01-19 Thu 22:48]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:16:19 PM
128077211
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -c 63003856_S135.bam 01/19/2023 10:11:39 PM
128077207
******** DONE Bwa version ok, arguments ok
CLOSED: [2023-01-20 Fri 22:49]
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:46:55 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -v 2 -t 24 /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna files/fastq/63003856_S135_R1_001.fastq.gz files/fastq/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.13 CL:samtools sort -@ 24 -O BAM -o files/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
/Work/Projects/bisonex/ref-63003856_S135〉samtools view -H 63003856_S135.bam | grep PN 01/19/2023 10:49:25 PM
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:/bin/bwa mem -R @RG\tID:63003856_S135\tSM:63003856_S135\tPL:ILLUMINA\tPM:Miseq\tCN:CHU_Minjoz\tLB:definition_to_add -t 4 /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R1_001.fastq.gz /mnt/j/working_directory_pipeline_analyse_exome/fastq/2200467051_63003856/63003856_S135_R2_001.fastq.gz
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:/mnt/h/tools/samtools sort -@ 4 -O BAM -o /mnt/j/working_directory_pipeline_analyse_exome/tmp_63003856_S135/63003856_S135.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools view -H 63003856_S135.bam
******* KILL ApplyBQSR
CLOSED: [2023-01-20 Fri 22:48]
/Work/Users/apraga/bisonex/script/files/bam〉samtools view -c 63003856_S135_recalibrated_hg38.bam 01/19/2023 10:21:00 PM
128077211
??
****** DONE Variant calling
CLOSED: [2023-01-20 Fri 22:48]
******* DONE Re vérifier flags
CLOSED: [2023-01-19 Thu 22:44]
/Work/Projects/bisonex/ref-63003856_S135〉less 63003856_S135_DP_over_30.vcf.gz
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
| Ref | Prod helios |
|-------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------|
| --dbsnp /mnt/j/bases_de_donnees/dbSNP/GCF_000001405.39.gz | --dbsnp /Work/Groups/bisonex/data-alexis-reference/dbSNP/GCF_000001405.39.gz |
| --max-mnp-distance 2 | --max-mnp-distance 2 |
| --output /mnt/j/working_directory_pipeline_analyse_exome/vcf/63003856_S135.vcf | --output files/vcf/63003856_S135.vcf |
| --input /mnt/j/working_directory_pipeline_analyse_exome/bam/63003856_S135_recalibrated_hg38.bam | --input files/bam/63003856_S135_recalibrated_hg38.bam |
| --reference /mnt/j/bases_de_donnees/genome/GRCh38_latest_genomic.fna | --reference /Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna |
| --verbosity WARNING | --verbosity WARNING |
| --use-posteriors-to-calculate-qual false | --use-posteriors-to-calculate-qual false |
| --dont-use-dragstr-priors false | --dont-use-dragstr-priors false |
| --use-new-qual-calculator true | --use-new-qual-calculator true |
| --annotate-with-num-discovered-alleles false | --annotate-with-num-discovered-alleles false |
| --heterozygosity 0.001 | --heterozygosity 0.001 |
| --indel-heterozygosity 1.25E-4 | --indel-heterozygosity 1.25E-4 |
| --heterozygosity-stdev 0.01
|
| --create-output-variant-md5 false | --create-output-variant-md5 false |
| --max-variants-per-shard 0 | --max-variants-per-shard 0 |
| --lenient false | --lenient false |
| --add-output-sam-program-record true | --add-output-sam-program-record true |
| --add-output-vcf-command-line true | --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 | --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 | --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false | --disable-bam-index-caching false |
| --sites-only-vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --QUIET false | --QUIET false |
| --use-jdk-deflater false | --use-jdk-deflater false |
| --use-jdk-inflater false | --use-jdk-inflater false |
| --gcs-max-retries 20 | --gcs-max-retries 20 |
| --gcs-project-for-requester-pays | --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false | --disable-tool-default-read-filters false |
| --minimum-mapping-quality 20 | --minimum-mapping-quality 20 |
| --disable-tool-default-annotations false | --disable-tool-default-annotations false |
| --enable-all-annotations false | --enable-all-annotations false |
| --allow-old-rms-mapping-quality-annotation-data false | --allow-old-rms-mapping-quality-annotation-data false" |
| ",Version="4.2.4.1" | Version="4.2.4.1", |
Prod helios
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉less 63003856_S135_DP_over_30.vcf
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
,Date="January 10, 2023 at 12:26:57 AM CET">
******* DONE VCF : même différence
CLOSED: [2023-01-20 Fri 22:48]
/Work/Projects/bisonex/ref-63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:04:02 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf.gz │ 84708 │
│ 1 │ 63003856_S135_DP_over_30.vcf.gz.tbi │ 1 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11362 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8864 │
│ 4 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6478 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:05:23 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf │ 84724 │
│ 1 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11377 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8884 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6759 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
***** TODO Relancer avec 4 coeurs
-c 4
****** DONE Alignement ok !!
CLOSED: [2023-01-20 Fri 23:24]
$ samtools view -c files/tmp_63003856_S135/63003856_S135.bam
128077207
****** TODO Vérifier les flags de chaque étape
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.ba
|
| --create-output-variant-md5 false | --create-output-variant-md5 false |
| --max-variants-per-shard 0 | --max-variants-per-shard 0 |
| --lenient false | --lenient false |
| --add-output-sam-program-record true | --add-output-sam-program-record true |
| --add-output-vcf-command-line true | --add-output-vcf-command-line true |
| --cloud-prefetch-buffer 40 | --cloud-prefetch-buffer 40 |
| --cloud-index-prefetch-buffer -1 | --cloud-index-prefetch-buffer -1 |
| --disable-bam-index-caching false | --disable-bam-index-caching false |
| --sites-only-vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --QUIET false | --QUIET false |
| --use-jdk-deflater false | --use-jdk-deflater false |
| --use-jdk-inflater false | --use-jdk-inflater false |
| --gcs-max-retries 20 | --gcs-max-retries 20 |
| --gcs-project-for-requester-pays | --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false | --disable-tool-default-read-filters false |
| --minimum-mapping-quality 20 | --minimum-mapping-quality 20 |
| --disable-tool-default-annotations false | --disable-tool-default-annotations false |
| --enable-all-annotations false | --enable-all-annotations false |
| --allow-old-rms-mapping-quality-annotation-data false | --allow-old-rms-mapping-quality-annotation-data false" |
| ",Version="4.2.4.1" | Version="4.2.4.1", |
Prod helios
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉less 63003856_S135_DP_over_30.vcf
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller
,Date="January 10, 2023 at 12:26:57 AM CET">
******* DONE VCF : même différence
CLOSED: [2023-01-20 Fri 22:48]
/Work/Projects/bisonex/ref-63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:04:02 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf.gz │ 84708 │
│ 1 │ 63003856_S135_DP_over_30.vcf.gz.tbi │ 1 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11362 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8864 │
│ 4 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6478 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
/Work/Users/apraga/bisonex/script/files/tmp_63003856_S135〉ls *.vcf* | insert nblines {|e| (^zgrep -v '^#' $e.name | wc -l)} | select name nblines 01/14/2023 08:05:23 PM
╭───┬───────────────────────────────────────────────────────────────────────────┬─────────╮
│ # │ name │ nblines │
├───┼───────────────────────────────────────────────────────────────────────────┼─────────┤
│ 0 │ 63003856_S135_DP_over_30.vcf │ 84724 │
│ 1 │ 63003856_S135_DP_over_30_not_SNP.recode.vcf │ 11377 │
│ 2 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence.vcf │ 8884 │
│ 3 │ 63003856_S135_DP_over_30_not_SNP_consensual_sequence_not_technical.vcf.gz │ 6759 │
╰───┴───────────────────────────────────────────────────────────────────────────┴─────────╯
***** KILL Relancer avec 4 coeurs
CLOSED: [2023-05-23 Tue 08:45]
-c 4
****** DONE Alignement ok !!
CLOSED: [2023-01-20 Fri 23:24]
$ samtools view -c files/tmp_63003856_S135/63003856_S135.bam
128077207
****** KILL Vérifier les flags de chaque étape
CLOSED: [2023-05-23 Tue 08:44]
post_cleanSam = _cleaned.bam
post_markDuplicate = _marked_dup.ba
:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard Compa
reSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy | SNP | 0.8552 | 0.9708 |
| Bisonex | vcfeval | SNP | 0.8547 | 0.9727 |
| Bisonex | Happy | INDEL | 0.7105 | 0.6929 |
| Bisonex | vcfeval | Non-SNP | 0.7139 | 0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB | happy | INDEL | 0.7551 | 0.7415 |
| GIAB | vcfeval | INDEL | 0.7598 | 0.7445 |
| GIAB | happy | SNP | 0.8937 | 0.9621 |
| giab | vcfeval | SNP | 0.8937 | 0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type | Recall | Precision |
| HG002 | happy | INDEL | 0.851495 | 0.923616 |
| HG002 | happy | SNP | 0.905926 | 0.992158 |
| HG002 | vcfeval | indel | 0.8523 | 0.9212 |
| HG002 | vcfeval | snp | 0.9054 | 0.9934 |
| HG003 | vcfeval | indel | 0.8363 | 0.9115 |
| HG003 | vcfeval | snp | 0.9069 | 0.9928 |
| HG003 | happy | INDEL | 0.838521 | 0.917296 |
| HG003 | happy | SNP | 0.907466 | 0.991204 |
| HG004 | happy | INDEL | 0.856835 | 0.925086 |
| HG004 | happy | SNP | 0.905067 | 0.992704 |
| HG004 | vcfeval | indel | 0.8568 | 0.9240 |
| HG004 | vcfeval | snp | 0.9048 | 0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
SCHEDULED: <2023-05-25 Thu>
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity |
:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** KILL Comparer tsv de sortie
CLOSED: [2023-05-23 Tue 08:45]
***** KILL Regarder où sont les variants différents
CLOSED: [2023-05-23 Tue 08:45]
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy | SNP | 0.8552 | 0.9708 |
| Bisonex | vcfeval | SNP | 0.8547 | 0.9727 |
| Bisonex | Happy | INDEL | 0.7105 | 0.6929 |
| Bisonex | vcfeval | Non-SNP | 0.7139 | 0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB | happy | INDEL | 0.7551 | 0.7415 |
| GIAB | vcfeval | INDEL | 0.7598 | 0.7445 |
| GIAB | happy | SNP | 0.8937 | 0.9621 |
| giab | vcfeval | SNP | 0.8937 | 0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type | Recall | Precision |
| HG002 | happy | INDEL | 0.851495 | 0.923616 |
| HG002 | happy | SNP | 0.905926 | 0.992158 |
| HG002 | vcfeval | indel | 0.8523 | 0.9212 |
| HG002 | vcfeval | snp | 0.9054 | 0.9934 |
| HG003 | vcfeval | indel | 0.8363 | 0.9115 |
| HG003 | vcfeval | snp | 0.9069 | 0.9928 |
| HG003 | happy | INDEL | 0.838521 | 0.917296 |
| HG003 | happy | SNP | 0.907466 | 0.991204 |
| HG004 | happy | INDEL | 0.856835 | 0.925086 |
| HG004 | happy | SNP | 0.905067 | 0.992704 |
| HG004 | vcfeval | indel | 0.8568 | 0.9240 |
| HG004 | vcfeval | snp | 0.9048 | 0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
SCHEDULED: <2023-05-25 Thu>
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity |
r qu'on retrouve les variants
CLOSED: [2023-05-19 Fri 18:41] SCHEDULED: <2023-05-18 Thu>
****** TODO Corrigier position pour avoir une bonne VAF
SCHEDULED: <2023-05-19 Fri>
POS POS+1 VAF ALT
Attention, la base corrigée est à POS+1...
****** DONE Comparaison manuelle avec julia (VAF = 1...)
CLOSED: [2023-05-19 Fri 21:58]
552 found over 585
***** TODO del
***** TODO ins
*** TODO Julia : Bamscissors :bamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
**** TODO Phase 1 : chr22, VAF=1
***** DONE 1 SNV : ok !
CLOSED: [2023-05-20 Sat 19:35] SCHEDULED: <2023-05-20 Sat>
#+begin_src
make run READS="tests/bamscissors/corrected_{1,2}.fq.gz"
Puis on lance le pipeline sur correct_1
- [X] Variant visible dans IGV
- [X] Variant visible après alignement
- [X] Variant visible après appel de variant
***** TODO Tous les SNV du chromosome
SCHEDULED: <2023-05-20 Sat>
**** TODO PHase 2 : chr22, VAF variable: de nombreux reads sont perdus
Problème dansr le BAM car sans les insertion
***** DONE Filtrer les reads sans pair : idem
CLOSED: [2023-05-23 Tue 00:01]
FOUND:Found 16662 unpaired mates
at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:470)
at picard.sam.SamToFastq.doWork(SamToFastq.java:224)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
ex: chr22:42213078
On essaie
samtools view ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135.bam NC_000022.11 -hf 0x2 -o 63003856_chr22.bam
Ne rale pas...
SCHEDULED: <2023-05-22 Mon>
***** TODO Problème de la conversion BAM -> fastq ?
****** TODO Test alignement bamtofastq sur le BAM originel: idem
Dans ~/recherche/code/bisonex/Bamscissors.jl
samtools view ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135.bam NC_000022.11 -hf 0x2 -o 63003856_chr22.bam
On trie bien par nom de read
samtools sort 63003856_chr22.bam -o 63003856_chr22_sorted.bam
Conversion en fastq. Attention à bien prendre le *fichier trié* !
samtools bam2fq -1 63003856_chr22_init1.fq.gz -2 63003856_chr22_init2.fq.gz -n 63003856_chr22_sorted.bam
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 2592110 reads
On envoie sur le mésocentre
scp 63003856_chr22_init{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors
On démarre un job avec 24 coeurs pour aller vite
srun -c 24 -p smp -t 1:00:00 --pty bash
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_init1.fq.gz 63003856_chr22_init2.fq.gz | samtools sort -@24 -o output.bam -
Dans /home/alex/recherche/bisonex/code/BamScissors.jl/aligned
❯ scp meso:/Work/Users/apraga/bisonex/tests/bamscissors/output.bam*
****** TODO Test picard : idem
Dans bisonex/code/BamScissors.jl
❯ picard SamToFastq -I 63003856_chr22_sorted.bam -F testpicard1.fastq -F2 testpicard2.fastq
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef testpicard1.fastq.gz testpicard2.fastq.gz | samtools sort -@24 -o testpicard.bam -
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
r qu'on retrouve les variants
CLOSED: [2023-05-19 Fri 18:41] SCHEDULED: <2023-05-18 Thu>
****** TODO Corrigier position pour avoir une bonne VAF
SCHEDULED: <2023-05-19 Fri>
POS POS+1 VAF ALT
Attention, la base corrigée est à POS+1...
****** DONE Comparaison manuelle avec julia (VAF = 1...)
CLOSED: [2023-05-19 Fri 21:58]
552 found over 585
***** TODO del
***** TODO ins
*** TODO Julia : Bamscissors :bamscissors:
Modification de la séquence dans BAM.
*Pas de mise à jour de CIGAR*
On convertit en fastq et on lance le pipeline pour "corriger"
#+begin_src sh
cd /home/alex/code/bisonex/out/63003856/preprocessing/mapped
samtools view 63003856_S135.bam NC_000022.11 -o 63003856_S135_chr22.bam
cd /home/alex/recherche/bisonex/code/BamScissors.jl
cp ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135_chr22.bam .
samtools index 63003856_chr22.bam
#+end_src
Le script va modifier le bam, le trier et générer le fastq. !!!
Attention: ne pas oublier l'option -n !!!
#+begin_src sh
time julia --project=.. insertVariant.jl
scp 63003856_S135_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
#+end_src
**** TODO Phase 1 : chr22, VAF=1
***** DONE 1 SNV : ok !
CLOSED: [2023-05-20 Sat 19:35] SCHEDULED: <2023-05-20 Sat>
#+begin_src
make run READS="tests/bamscissors/corrected_{1,2}.fq.gz"
Puis on lance le pipeline sur correct_1
- [X] Variant visible dans IGV
- [X] Variant visible après alignement
- [X] Variant visible après appel de variant
***** TODO Tous les SNV du chromosome
SCHEDULED: <2023-05-20 Sat>
**** TODO PHase 2 : chr22, VAF variable: de nombreux reads sont perdus
Problème dansr le BAM car sans les insertion
***** DONE Filtrer les reads sans pair : idem
CLOSED: [2023-05-23 Tue 00:01]
FOUND:Found 16662 unpaired mates
at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:470)
at picard.sam.SamToFastq.doWork(SamToFastq.java:224)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
ex: chr22:42213078
On essaie
samtools view ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135.bam NC_000022.11 -hf 0x2 -o 63003856_chr22.bam
Ne rale pas...
SCHEDULED: <2023-05-22 Mon>
***** TODO Problème de la conversion BAM -> fastq ?
****** DONE Test bamtofastq sur le BAM originel: idem
CLOSED: [2023-05-23 Tue 01:16]
Dans ~/recherche/code/bisonex/Bamscissors.jl
samtools view ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135.bam NC_000022.11 -hf 0x2 -o 63003856_chr22.bam
On trie bien par nom de read
samtools sort -n 63003856_chr22.bam -o 63003856_chr22_sorted.bam
Conversion en fastq. Attention à bien prendre le *fichier trié* !
samtools bam2fq -1 63003856_chr22_init1.fq.gz -2 63003856_chr22_init2.fq.gz -n 63003856_chr22_sorted.bam
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 2592110 reads
On envoie sur le mésocentre
scp 63003856_chr22_init{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors
On démarre un job avec 24 coeurs pour aller vite
srun -c 24 -p smp -t 1:00:00 --pty bash
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_init1.fq.gz 63003856_chr22_init2.fq.gz | samtools sort -@24 -o output.bam -
Dans /home/alex/recherche/bisonex/code/BamScissors.jl/aligned
❯ scp meso:/Work/Users/apraga/bisonex/tests/bamscissors/output.bam*
****** DONE Avec samtools fastq
CLOSED: [2023-05-23 Tue 01:16]
samtools sort -n 63003856_chr22.bam -o 63003856_chr22_sorted.bam
samtools fastq 63003856_chr22_1.fq.gz -2 63003856_chr22_2.fq.gz -0 /dev/null -s /dev/null -o 63003856_chr22_sorted.bam
scp 63003856_chr22_{1,2}.fq.gz meso:/Work/Users/apraga/bisonex/tests/bamscissors/
mesocentre
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz | samtools sort -@24 -o output.bam
scp meso:/Work/Users/apraga/bisonex/tests/bamscissors/output.bam\* aligned/
****** DONE En rajoutant .fna dans le doserrie (+samtools fastq): idem
CLOSED: [2023-05-23 Tue 01:16]
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef* .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef* .
bwa mem -t 24 genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz | samtools sort -@24 -o output.bam
****** DONE Test picard : idem
CLOSED: [2023-05-23 Tue 01:16]
Dans bisonex/code/BamScissors.jl
❯ picard SamToFastq -I 63003856_chr22_sorted.bam -F testpicard1.fastq -F2 testpicard2.fastq
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef testpicard1.fastq.gz testpicard2.fastq.gz | samtools sort -@24 -o testpicard.bam -
****** DONE Il ne manque pas des reads dans les fastq :
CLOSED: [2023-05-23 Tue 01:26]
bisonex/code/BamScissors.jl on bamscissors [!?] via ஃ v1.9.0
❯ samtools flagstat aligned/output.bam
1306273 + 0 in total (QC-passed reads + QC-failed reads)
1296055 + 0 primary
0 + 0 secondary
10218 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1304871 + 0 mapped (99.89% : N/A)
1294653 + 0 primary mapped (99.89% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
bisonex/code/BamScissors.jl on bamscissors [!?] via ஃ v1.9.0
❯ samtools flagstat 63003856_chr22.bam
2609214 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
17104 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2609214 + 0 mapped (100.00% : N/A)
2592110 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2592110 + 0 properly paired (100.00% : N/A)
2592110 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
❯ echo $(zcat 63003856_chr22_init2.fq.gz | wc -l)/4 | bc
1296055
bisonex/code/BamScissors.jl on bamscissors [!?] via ஃ v1.9.0 took 2s
❯ echo $(zcat 63003856_chr22_init1.fq.gz | wc -l)/4 | bc
1296055
❯ samtools flagstat 63003856_chr22_sorted.bam
2609214 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
17104 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2609214 + 0 mapped (100.00% : N/A)
2592110 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2592110 + 0 properly paired (100.00% : N/A)
2592110 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
****** DONE Avec sort -O BAM idem
CLOSED: [2023-05-23 Tue 01:21]
****** DONE Singleton ou non mapped ? nonVirtPosition
CLOSED: [2023-05-23 Tue 01:22]
❯ samtools bam2fq -1 63003856_chr22_init1.fq.gz -2 63003856_chr22_init2.fq.gz -0 both -s single.fq.gz -n 63003856_chr22_sorted.bam
bisonex/code/BamScissors.jl on bamscissors [!?] via ஃ v1.9.0
❯ wc -l 63003856_chr22_init* both single.fq.gz
403540 63003856_chr22_init1.fq.gz
404788 63003856_chr22_init2.fq.gz
0 both
0 single.fq.gz
808328 total
****** Problème d'aligner car les reads sont bien dans le .fastq ?
Ex:
zgrep "A00853:477:HMLWYDSX3:2:2624:2826:18630" 63003856_chr22_{1,2}.fq.gz
63003856_chr22_1.fq.gz:@A00853:477:HMLWYDSX3:2:2624:2826:18630
63003856_chr22_2.fq.gz:@A00853:477:HMLWYDSX3:2:2624:2826:18630
***** Refaire la manip sur bam chr22 non modifié + mail Alexis
[1]_samtools view ~/code/bisonex/out/63003856/preprocessing/mapped/63003856_S135.bam NC_000022.11 -f 0x2 -o 63003856_chr22.bam
Les reads sont bien tous mappé
samtools flagstat 63003856_chr22.bam
2609214 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
17104 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2609214 + 0 mapped (100.00% : N/A)
[2] samtools sort -n 63003856_chr22.bam -o 63003856_chr22_sorted.bam
flagstat idem
[3] samtools fastq -1 63003856_chr22_1.fq.gz -2 63003856_chr22_2.fq.gz -0 /dev/null -s /dev/null -n 63003856_chr22_sorted.bam
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 2592110 reads
Nombre de reads ok (= la moitié) et les séquences ne sont pas identiques pour le premier read (= on n'a pas 2x la même chose)
echo $(zcat 63003856_chr22_1.fq.gz | wc -l)/4 | bc
1296055
echo $(zcat 63003856_chr22_2.fq.gz | wc -l)/4 | bc
1296055
[4]
bwa mem -t 24 /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz -o wtf.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1752492 sequences (240000014 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (12, 702821, 0, 18)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (54, 197, 268)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 696)
[M::mem_pestat] mean and std.dev: (163.92, 129.71)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 910)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (129, 175, 233)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 441)
[M::mem_pestat] mean and std.dev: (185.56, 75.37)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 545)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (56, 192, 239)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (158.00, 110.30)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 788)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::process] read 839618 sequences (114952336 bp)...
[M::mem_process_seqs] Processed 1752492 reads in 375.714 CPU sec, 17.645 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 336379, 0, 5)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (128, 174, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 440)
[M::mem_pestat] mean and std.dev: (184.73, 74.63)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 839618 reads in 183.039 CPU sec, 7.961 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 24 -o wtf.bam /Work/Projects/bisonex/data/genome/GRCh38.p13/bwa/genomeRef 63003856_chr22_1.fq.gz 63003856_chr22_2.fq.gz
[main] Real time: 38.278 sec; CPU: 565.821 sec
Bon nombre de reads pourtant
samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools sort -@24 -o wtf_sorted.bam wtf.sam
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
samtools flagstat wtf.bam
2611059 + 0 in total (QC-passed reads + QC-failed reads)
2592110 + 0 primary
0 + 0 secondary
18949 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
2611058 + 0 mapped (100.00% : N/A)
2592109 + 0 primary mapped (100.00% : N/A)
2592110 + 0 paired in sequencing
1296055 + 0 read1
1296055 + 0 read2
2590970 + 0 properly paired (99.96% : N/A)
2592108 + 0 with itself and mate mapped
1 + 0 singletons (0.00% : N/A)
458 + 0 with mate mapped to a different chr
63 + 0 with mate mapped to a different chr (mapQ>=5)
Effectivement, ce n'est pas un problème d'IGV
❯ samtools mpileup 63003856_chr22.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11 42213078 N 19 TTtTTtTTTtTTtTtTttt kkk_kkkkFkFF_FkFkQk
bisonex/code/BamScissors.jl on bamscissors [!?] via ஃ v1.9.0
❯ samtools mpileup aligned/wtf_sorted.bam -r NC_000022.11:42213078-42213078
[mpileup] 1 samples in 1 input files
NC_000022.11 42213078 N 5 TTtTT _FkFF
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
- Gélose au sang = non sélectif. Pouvoir hémolytique
- Bouilon Schaedler = bactéries anaérobie et exigeantes (incubation 10-14jours en anaérobie) -> prélèvement profonds, IO
- Chromogène urinaire = non sélectif
- rose = activité \beta-galactosidase
- bleue = activité \beta-glucotosidase
- violet = les 2
- brun = tryptophane désaminase
** Autres
- Chocolat enrichi :bactéries exigeants (prélèvmente profond)
- HAE2 :inhibie bactéries gram + -> Haemophilus spp
- VCAT : Neisseria (Vancomycine, Colistine, Amphotéricine, Triméthoprime)
- Drigalski : inhibe Gram+, lactose ->
- jaunes = BGN fermentant lactose
- bleu = BGN ne fermentant pas le lactose
- Cétrimide (ammoniuum)) -> P. aeruginosa (production pycyanine)
- BCSA : inhibe gram+ ->_Burkholderia
- Chappman : NaCL, mannital
- jaune = fermente mannitol -> S. aureus
- CAP = inhibie Gram - () -> streptocoque+++, listeria, corynebcaterie, uropathogène émergents
- Granada (MTX) : inhibe Gram - -> S. agalactiae
- BCYE : Legionnella)
- Haja-Kligler : lactose en haut, glucose en bas
- rouge en haut = pas de fermentation de lactose. Sinon jaune
- rouge en bas = pas de fermentation du glucose. Sinon jaune
Exemple:
- Shigella = lactose-, glucose+
- Pseudomonas aeruginosa = lactose-, glucose - (tout rouge)
- Salmonella = tout noir (production H2S)
- E.coli = tout jaune (lactose+, glucose+)
** Coproculture = Salmonelle, Shigelles, Campylobacter
- Gélose Hektoen : inhibiteur de gram+, sucres (lactose, saccharose)
- centre noir = formation de H2S ->_Salmonelle
- vert/bleu = pas de fermetation -> Salmonelle/Shigelle
- jaune/orange = fermetation
- Bouillon sélénite (inhibe coliforme + entérocoqu, enrichi pour Salmonelle)
- Gélose Salmonelle-Shiguelle : inhibiteur gram+, _lactose_
- centre noir = H2S -> Salmonelle
- jaune = pas de fermentation -> Salmonelle/Shigelle
- rose/rouge = fermentation lactose
- Gélose sélective Campylobacter (inhibe Gram+, Gram-, champignon)
- Yersinia: mannitol, inhibiteur GRam+ (dont Proteus), Gram- (dont Pseudomonas aeruginosa)
- rouge = fermente mannitol -> Yersinia
** Gram
- positif si paroi épaisse de peptidoglycane-> coloration persiste
- négative si paroi plus fine qui permet à l'éthanol de laver la coloration
Aérobies
#+BEGIN_SRC dot :file aerobies.png :exports results
graph {
splines=false;
node [shape=box]
cocciPlus [label="Cocci"]
cocciMoins [label="Cocci"]
bacciPlus [label="Bacilles"]
bacciMoins [label="Bacilles"]
"Aérobies" -- {"Gram +" "Gram -"}
"Gram +" -- {cocciPlus bacciPlus}
cocciPlus -- {"Amas" "Chaînettes" "Entérocoques"}
bacciPlus -- {"Listeria\nCorynebacterium\nBacillus\nErysipelothrix\nNocardia"}
"Amas" -- "Staph. aureus\nStaph coagulase négative"
"Chaînettes" -- {
"Strepto hémolytiques"
"Strepto pneumonia"
"Autres strepto"
}
"Strepto hémolytiques" -- {"Strepto. pyogenes (groupe A)\nStrep. agalactiae (B)\nStrepto dysgalactiae"}
"Gram -" -- {cocciMoins bacciMoins}
cocciMoins -- "Neisseria\nmenigitidis/\ngonorrhoeae"
bacciMoins -- {
"Entérobactéries\nE. coli, Klebsiella\nEnterobacter\nSerratia\nPretous\nSalmonella\nShigella\nYersinia\nCitrobacter"
"Autres:\nPseudomonas\nStenotrophomonoas\nAcinetobacter\nCampolybacter\nVibrio\nBordetella\Haemophilius"
}
}
#+END_SRC
#+RESULTS:
[[file:aerobies.png]]
#+BEGIN_SRC dot :file anaerobies.png :exports results
graph {
node [shape=box]
"Anaérobies" -- {"Gram+" "Gram-"}
"Gram+" -- "Clostridium tetani, botulinum, perfringens, difficile\nPeptococcus\nPropionibacterium\nActinomyces"
"Gram-" -- "Bacteroides\nFusobacterium\nPrevotella\nPorphyromonas"
}
#+END_SRC
#+RESULTS:
[[file:anaerobies.png]]
#+BEGIN_SRC dot :file autres.png :exports results
graph {
node [shape=box]
"Autres bactéries" -- {"Atypiques" "Spirochètes" "Mycobactéries" "Autres"}
"Atypiques" -- {"Intracellulaire\nChlamydia\nRickettsiales\nBartonella\nCoxiella" "Sans paroi\nMycoplasma\nUreaplasma"}
"Spirochètes" -- "Treponema\nBorrelia\nLeptospira"
"Mycobactéries" -- "M. tuberculosis\nleprae\atypiques"
"Autres" -- "Tropheryma whipplei"
}
#+END_SRC
#+RESULTS:
[[file:autres.png]]
** Catalase
oxydoréductase qui intervient dans la résistance à la bactéricidie par le peroxyde d’oxygène
| | Catalase + | Catalase - |
|----------------+-------------------+---------------|
| | aérobies strictes | |
| Cocci gram+ | staph | strepto |
| Bacille Gram + | Listeria | Lactobacillus |
| | Corynebacterium | |
| | Propionobacterium | |
** Oxydase
détecte les cytochrome de type c (pigments hémo-protéiques dans la plupart des cellules vivante)
| Oxydase + | Oxydase - |
|----------------+-----------|
| Aeromonas | |
| Bordetella | |
| Branhamella | |
| Brucella | |
| Burkholderia | |
| Campylobacter | |
| Flavobacterium | |
| Moraxella | |
| Neisseria | staph |
| Pasteurella | |
| Plesiomonas | |
| Pseudomonas | |
| Vibrio | |
** Mobile :
- cocci gram + : non (sauf Enterococcus casseliflavus, gallinarum; Vagococcus, Planococcus)
- Baccile gram+ : Listeria (tournoyante) , Bacillus (ondulante)
- Bacille gram- :
- Vibior, Aeromonas
- Legionella
- Entérobactéries : sauf Shigella, Klebsiella
- Capsule =e défense contre phagocytose, adhésion à la cible
- Protéine M de surface = adhésion aux muqueuses (pharyng, cutané), protection contre la phagocytose
résistance à la phacogytose
- protéine M
- encapsulation
Facteurs de virulence
- enzyme :hyluronidase, stroplysine O et S (favorisent l’invasion tissulaire)
- exotexonie : activation et prolifération d’une sous population lymphocytes T -> cytokine proinflammatoires
- β-lactamne : sensbile
- macrosible : sauvae sensible, résistance par mécansime d’effluxe, modiifcation cible ARN23S
- β-lactamine : sensible
- macrolide : sauvage sensible, résistance par mécanisme d’efflux, modiifcation cible ARN23S
*** Pneumocoque
Gram+ diplocoque encapsulé à multiplication extracullaire.
Classification selon la capsule (vaccins)
**** Habitat
Voie respratoire supérieure
Transmission goutelette, interhumaine
**** Pathogénicité
- adhérence cellules épithéliase rhinopharunx
- facteurs de virulence non caplusaire
- évasion à la phagocytose
- actionation complément, cytokien inflammatoire
**** Résistance
- β-lactamine: Selon les PLP (!inutile d’ulitiser les inhibiteurs de betalactamase)
- sensbilité possiblement dimunée aux fluoroquinolones
**** Clinique
- Infection neuroméningée
- Infection voies respiratoire: pneumonie franche lobaire aigüe, bronchopneumonie, otite, mastoïdite, sinusite, exacerbation BPCO
- Rare : purpura fulminas, endocardite
- Bactérimié, souvent à partir d’un foyer pulmonaire
* Classification
Espèce majoritaire: E. faecalis = 80-90=, faecium = 5-10%
**** Habitat
Ubiquitaire. Surtout tube digestif (homme, animaux), milieu extérieur
Home sain : tube digestif, périnée, parfois vagin, oropharynx
Pulpart des infections = à partir de la flère du patient. Mais exogène possible.
**** Facteurs de virulence
- Pas d’exotoxine, ni de superantigène
- Protéine de surface -> adhère à l’endocarde et l’urothelium -> endocardite et infections urinaire
**** Résistance
- Nombreux antibio ...
- Naturelle : C3G
- β-lactamine/glycopeptide seul = seulement effect bactériostatique sur > 90%. Mais aminoside + inhibiteur de la paroi (β-lactamine, glycopepited, lipopetide) = synergie
Attention: en cas de résistance surajoutée, les aminosides sont inefficaces.
**** Clinique
- Infection urinaire
- Infection de la peau et des parties molles
- Endocardite
- Bactériémie
- Infection abdopelvienne
** Neisseria meningitidis
- Diplococque Gram - aérobie.
- Très gragile. Hautement variable
- Épidémio :
- 2 pics : nourisson < 1 an (système immunitaire immature), ado/jeune adulte (socialisation)
- ceinture de la ménigitde (Afrique sahel + subusaharanienne)
*** Habitat
réservoir 100% humaine. Transmission directement uniquement par goutelettes
Portage pharyngé, avec rarement invasion (sang +/- LCS) -> seulement souches
* Culture
Gram
- positif si paroi épaisse de peptidoglycane -> coloration persiste
- négative si paroi plus fine qui permet à l'éthanol de laver la coloration
Catalase = oxydoréductase qui intervient dasn la résistance à la bactéricidie par le peroxyde d’oxygène
- permet de différencier dans les cocci à Gram + les staphylocoque (catalase+) des streptocoque
Oxydase : détecte les cytochrome de type c (pigmen hémo-protéiques dans la plupart des cellules vivante)
- plupart des staph sont oxydase-négative
- Gélose au sang = non sélectif. Pouvoir hémolytique
- Bouilon Schaedler = bactéries anaérobie et exigeantes (incubation 10-14jours en anaérobie) -> prélèvement profonds, IO
- Chromogène urinaire = non sélectif
- rose = activité \beta-galactosidase
- bleue = activité \beta-glucotosidase
- violet = les 2
- brun = tryptophane désaminase
** Autres
- Chocolat enrichi :bactéries exigeants (prélèvmente profond)
- HAE2 :inhibie bactéries gram + -> Haemophilus spp
- VCAT : Neisseria (Vancomycine, Colistine, Amphotéricine, Triméthoprime)
- Drigalski : inhibe Gram+, lactose ->
- jaunes = BGN fermentant lactose
- bleu = BGN ne fermentant pas le lactose
- Cétrimide (ammoniuum)) -> P. aeruginosa (production pycyanine)
- BCSA : inhibe gram+ ->_Burkholderia
- Chappman : NaCL, mannital
- jaune = fermente mannitol -> S. aureus
- CAP = inhibie Gram - () -> streptocoque+++, listeria, corynebcaterie, uropathogène émergents
- Granada (MTX) : inhibe Gram - -> S. agalactiae
- BCYE : Legionnella)
- Haja-Kligler : lactose en haut, glucose en bas
- rouge en haut = pas de fermentation de lactose. Sinon jaune
- rouge en bas = pas de fermentation du glucose. Sinon jaune
Exemple:
- Shigella = lactose-, glucose+
- Pseudomonas aeruginosa = lactose-, glucose - (tout rouge)
- Salmonella = tout noir (production H2S)
- E.coli = tout jaune (lactose+, glucose+)
** Coproculture = Salmonelle, Shigelles, Campylobacter
- Gélose Hektoen : inhibiteur de gram+, sucres (lactose, saccharose)
- centre noir = formation de H2S ->_Salmonelle
- vert/bleu = pas de fermetation -> Salmonelle/Shigelle
- jaune/orange = fermetation
- Bouillon sélénite (inhibe coliforme + entérocoqu, enrichi pour Salmonelle)
- Gélose Salmonelle-Shiguelle : inhibiteur gram+, _lactose_
- centre noir = H2S -> Salmonelle
- jaune = pas de fermentation -> Salmonelle/Shigelle
- rose/rouge = fermentation lactose
- Gélose sélective Campylobacter (inhibe Gram+, Gram-, champignon)
- Yersinia: mannitol, inhibiteur GRam+ (dont Proteus), Gram- (dont Pseudomonas aeruginosa)
- rouge = fermente mannitol -> Yersinia
* IRC
Doom-emacs utilise circe, à démarrer avec ~=irc~ (et non circe)