apraga/org - Change CITR3HJPJK4B4MTFFONEPTMNY76SPTJ47ZU2J74S5LOBNPHKMSHQC

UTF8 update

Created by Alexis Praga on April 29, 2024

CITR3HJPJK4B4MTFFONEPTMNY76SPTJ47ZU2J74S5LOBNPHKMSHQC

Dependencies

In channels

main

Change contents

File deletion: utf8.org, utf_8.md, utf_8.md, 20230511180443-utf_8.org, utf_8.org, utf8.md, utf8.org

BFD:BFD[3.29] → [3.206:238]

BF:BFD[3.238] → [5.90622:90622]

BF:BFD[6.2] → [7.398:430]

BF:BFD[7.430] → [5.90622:90622]

BFD:BFD[3.29] → [8.5245:5277]

BF:BFD[8.5277] → [5.90622:90622]

BFD:BFD[3.29] → [4.5556:5604]

BF:BFD[4.5604] → [5.90622:90622]

BFD:BFD[3.29] → [9.47402:47435]

BF:BFD[9.47435] → [5.90622:90622]

BFD:BFD[6.2] → [5.91679:91710]

BF:BFD[5.91710] → [5.90622:90622]

BFD:BFD[6.2] → [10.112410:112442]

BF:BFD[10.112442] → [5.90622:90622]

∅:D[8.5340] → [5.90671:90672]

∅:D[10.112489] → [5.90671:90672]

B:BD[5.90671] → [5.90671:90672]

∅:D[8.5374] → [5.90734:91019]

∅:D[10.112591] → [5.90734:91019]

B:BD[5.90734] → [5.90734:91019]

∅:D[8.5446] → [5.91089:91678]

B:BD[5.91089] → [5.91089:91678]

B:BD[5.91019] → [8.5375:5446]

B:BD[5.90672] → [8.5341:5374]

∅:D[8.5288] → [4.5687:5702]

B:BD[4.5687] → [4.5687:5702]

B:BD[4.5702] → [8.5289:5340]

B:BD[5.90622] → [8.5278:5288]


Text (and String) represent a sequence of unicode codepoints. These are
numbers in the range of 0 to 0x10FFFF which have been assigned certain
properties (including a name) by the unicode standard, such as U+0065
LATIN SMALL LETTER E, which, besides the name I listed, is a Lowercase
bytes. You need some way to translate codepoints into bytes. These ways
are encodings, and UTF-8 is the most popular one. UTF-8 has the neat
property that all the codepoints that correspond to the ASCII characters
(U+0000 to U+00FF) are encoded as single bytes identical to their ASCII
representation. But any other codepoint is encoded as multiple bytes.
So, U+00E9 LATIN SMALL LETTER E WITH ACUTE (decimal 233) is encoded as
the two-byte sequence 0xC3 0xA9 (195 162 in decimal). ByteStrings
represent bytes, as found in the file. So to get back to the codepoint,
you have to decodeUtf8
Letter. However, you can\'t put codepoints in a file, you can only put
# UTF8 and Bytestring in Haskell
#+filetags: cs
```
(From \@Melissa on discord #haskell-beginners)
```{=org}

File deletion: utf8.md

BF:BFD[6.2] → [2.1002:1033]

BF:BFD[2.1033] → [2.781:781]

B:BD[2.781] → [2.782:1001]

Pour écrire l'utf8 sous linux, activer ibus et définir un raccourci dans "Emoji" -> "Unicode set point". Ex: Ctrl-Shift-U
Puis l'utiliser et taper 2245 (pour U+2245) et faire "espace". On aura un signe approximatif.

File addition: 202404291043 - UTF-8.md (----------)

[6.2]

---
title: UTF-8
date: 2024-04-29
tags: encoding haskell
---
# UTF8 et Bytestring en Haskell
(From \@Melissa on discord #haskell-beginners)
Un `Text` (et  String) correspondent à une séquence de *codepoint* unicode. Ils sont des nombres avec une propriété.
Par exemple U+0065 est `e` (LATIN SMALL LETTER e).
Le problème est qu'un fichier ne contient que des octets. Il faut donc les traduire, ce sont des encodages.
UTF-8 est le plus populaire. Il a pour propriété que tous les caractères ASCII tiennent sur un octets qui est exactement leur représentation ASCII.
Pour les autres, ils peuvent prendre plusieurs octets. Par exemple, `é` correspond à 195 162 (donc 2 octets).
# Écrire des caractères sous Linux
Activer ibus et définir un raccourci dans "Emoji" -> "Unicode set point". Ex: Ctrl-Shift-U
Puis l'utiliser et taper 2245 (pour U+2245) et faire "espace". On aura un signe approximatif.