how unidecoder transliterates utf-8 to ascii
DESCRIPTION
Slides of my talk at Paris.rb on 2014-11-07. How does UTF-8 work? How to leverage it to convert chinese, russian or any non-ASCII character to ASCII? Here is what the Unidecoder gem does.TRANSCRIPT
![Page 1: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/1.jpg)
UnidecoderSimon Courtois - @happynoff
![Page 2: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/2.jpg)
Transliteration
![Page 3: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/3.jpg)
��
Ni Hao
![Page 4: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/4.jpg)
ПРИВЕТPRIVIeT
![Page 5: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/5.jpg)
How does it work?
![Page 6: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/6.jpg)
At the beginning there was ASCII
![Page 7: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/7.jpg)
A 65B 66C 67
a 97b 98c 99
![Page 8: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/8.jpg)
a 97 11 0000164 32 16 8 4 2 1
A 65 10 0000164 32 16 8 4 2 1
![Page 9: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/9.jpg)
b 98 11 0001064 32 16 8 4 2 1
B 66 10 0001064 32 16 8 4 2 1
![Page 10: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/10.jpg)
Then… 8-bit computers!
![Page 11: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/11.jpg)
So every country had its own
encoding(s)!
![Page 12: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/12.jpg)
All was fine until…
![Page 13: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/13.jpg)
TheWorld Wide Web
![Page 14: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/14.jpg)
UTF-8 to the rescue!
![Page 15: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/15.jpg)
Everything on 32 bits?
![Page 16: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/16.jpg)
Bad ideac a f é
![Page 17: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/17.jpg)
Bad idea
f éc a
![Page 18: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/18.jpg)
\0
Bad idea
![Page 19: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/19.jpg)
A better ideaA 65 010 00001
110 XXXXX 10 XXXXXX
1110 XXXX 10 XXXXXX10 XXXXXX
![Page 20: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/20.jpg)
A better idea110 XXXXX 10 XXXXXX
10000110 10 011111
![Page 21: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/21.jpg)
A better idea
10000 1055 П011111
![Page 22: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/22.jpg)
So, how does unidecoder work?
![Page 23: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/23.jpg)
How do we go from П to P ?
![Page 24: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/24.jpg)
Start from a string like “П”
![Page 25: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/25.jpg)
Unpack it“П”.unpack(“U”)
[1055] 00000100 00011111
4 31
![Page 26: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/26.jpg)
4 x04 x04.yml
Ie Io Dj … P
0 1 2 31
![Page 27: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/27.jpg)
How to obtain and 31 ?4
![Page 28: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/28.jpg)
unpacked = 1055 0000010000011111
unpacked >> 80001111100000100
4
![Page 29: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/29.jpg)
How to obtain and ?4 31
![Page 30: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/30.jpg)
31unpacked = 1055
0000010000011111
unpacked & 255000111110000010011111111000000000001111100000100
![Page 31: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/31.jpg)
Brain fried yet?advertising time!
![Page 32: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/32.jpg)
www.tinci.fr
Web Development
Software Development
Consulting & Support
@tincihq
![Page 33: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/33.jpg)
ResourcesCharacters, Symbols and the Unicode Miracle:
bit.ly/why-utf8
Slides: bit.ly/unidecoder
Unidecoder: github.com/norman/unidecoder
![Page 34: How Unidecoder Transliterates UTF-8 to ASCII](https://reader033.vdocuments.us/reader033/viewer/2022052906/558c2b62d8b42abb738b4591/html5/thumbnails/34.jpg)
Thank you!Simon Courtois - @happynoff