CITR3HJPJK4B4MTFFONEPTMNY76SPTJ47ZU2J74S5LOBNPHKMSHQC
DZLPROIRYSO5RSQRXED4B6VLKXJBMDA3OVNBEOWYVRPMHE2JF2RQC
NBJFXQNG6YLIEL6HK7VZDEQZTXK2QJZ43AKNLXIIBX5K3MOX6WCQC
DZ6GQN2ERJAZG3PWZC34EN32YZOAPVFNGVCMJQNMFZVK4TR57UTAC
FOVEJMW4FQB2D2PZVROHCZJINCCWJHPWBVDUIEIFUCASYDPJLMCQC
JGMCSDW663DQSK7XSWDBPVYQE57ZBP7ZVZLSEXUJOQVE7KY6BB4QC
ORKQ5SEYZULGUEW77EL3XBLKU7VAJP7TDPULU7VKTOQFWD2YI4LQC
F7ZOM4ZVXTE2TIAAYM5UVYKDVCGTAFP6ZRTUNDZYNA2UJIVKE6DQC
PDH2BEBXR6WCCO2GRS3L6HMLQNCC2JU5BOKZLV4DLFPQD2UZFJKQC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
Text (and String) represent a sequence of unicode codepoints. These are
numbers in the range of 0 to 0x10FFFF which have been assigned certain
properties (including a name) by the unicode standard, such as U+0065
LATIN SMALL LETTER E, which, besides the name I listed, is a Lowercase
bytes. You need some way to translate codepoints into bytes. These ways
are encodings, and UTF-8 is the most popular one. UTF-8 has the neat
property that all the codepoints that correspond to the ASCII characters
(U+0000 to U+00FF) are encoded as single bytes identical to their ASCII
representation. But any other codepoint is encoded as multiple bytes.
So, U+00E9 LATIN SMALL LETTER E WITH ACUTE (decimal 233) is encoded as
the two-byte sequence 0xC3 0xA9 (195 162 in decimal). ByteStrings
represent bytes, as found in the file. So to get back to the codepoint,
you have to decodeUtf8
Letter. However, you can\'t put codepoints in a file, you can only put
# UTF8 and Bytestring in Haskell
#+filetags: cs
```
(From \@Melissa on discord #haskell-beginners)
```{=org}
---
title: UTF-8
date: 2024-04-29
tags: encoding haskell
---
# UTF8 et Bytestring en Haskell
(From \@Melissa on discord #haskell-beginners)
Un `Text` (et String) correspondent à une séquence de *codepoint* unicode. Ils sont des nombres avec une propriété.
Par exemple U+0065 est `e` (LATIN SMALL LETTER e).
Le problème est qu'un fichier ne contient que des octets. Il faut donc les traduire, ce sont des encodages.
UTF-8 est le plus populaire. Il a pour propriété que tous les caractères ASCII tiennent sur un octets qui est exactement leur représentation ASCII.
Pour les autres, ils peuvent prendre plusieurs octets. Par exemple, `é` correspond à 195 162 (donc 2 octets).
# Écrire des caractères sous Linux
Activer ibus et définir un raccourci dans "Emoji" -> "Unicode set point". Ex: Ctrl-Shift-U
Puis l'utiliser et taper 2245 (pour U+2245) et faire "espace". On aura un signe approximatif.