Confused about micropython's utf-8 encoding

Discussion about programs, libraries and tools that work with MicroPython. Mostly these are provided by a third party.
Target audience: All users and developers of MicroPython.
Post Reply
LabBaxMX
Posts: 3
Joined: Sat May 28, 2022 11:19 am

Confused about micropython's utf-8 encoding

Post by LabBaxMX » Sat May 28, 2022 11:56 am

Hello there, I'm new to this forum :D

I'm trying to make my LCD driver(st7735s) to support Chinese letters.

My solution is to encode a Chinese letter and get its utf-8 format code, then use this code and a key to search in a dict that contains all the letters used in my project.

Code: Select all

letter_dict = {
	0xe586af: [0x00, ..],  # the pixel bits of char '冯'
}

data_code = '冯'.encode('utf-8')

# the code length is in a range of 1bytes to 4 bytes.
d = data_code[0]
for i in range(1, len(data_code)):
	d = d<<8 | data_code[i]

letter_bits = letter_dict[d]

...
While I am a bit confused because the char '冯' in my Micropython environment (esp32) is b'\xb7\xeb'.

Code: Select all

=== 
b'\xb7\xeb'
��
>>>
But in my python environment (PC) I got it with b'\xe5\x86\xaf'.

Code: Select all

Python 3.8.0 (default, Nov  6 2019, 16:00:02) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '冯'.encode('utf-8')
b'\xe5\x86\xaf'
>>>
I am now stuck in here, as I planned to generate all the letters in my PC (python environment), then download to my esp32 (micropython environment).

The utf-8 code should be the same in both of the environment, only with that can I extract letters in json stream (utf-8 code) then show them on my LCD.

Another thing is that when I use REPL through COM9 (my COM to my board), I cannot type Chinese letters in cmd...

My implement details are as follows:

board : ESP32 DEVKIT V1 GPIO with 30 pins
firmware : esp32-20220117-v1.18.bin lateset firmware, downloaded from https://micropython.org/download/esp32/
IDE : VS Code with RT-Thread, python 3.8.0 with esptool 4.0

Can anyone help my get out of it? Or is there another way to match json char from web API to my local letters?

Best regards!

Lab~

Lobo-T
Posts: 36
Joined: Tue Nov 16, 2021 2:36 pm

Re: Confused about micropython's utf-8 encoding

Post by Lobo-T » Sat May 28, 2022 4:54 pm

Are you sure? I get the correct UTF-8 running on a ESP32:

Code: Select all

>>> data_code = '冯'.encode('utf-8')
>>> print (data_code)
b'\xe5\x86\xaf'
>>> 

Christian Walther
Posts: 169
Joined: Fri Aug 19, 2016 11:55 am

Re: Confused about micropython's utf-8 encoding

Post by Christian Walther » Sat May 28, 2022 6:54 pm

I suspect your source file (the first code block – you don’t show any >>> prompts so I guess this is a source file, not REPL input) isn’t actually encoded in UTF-8. Can you open it in a hex editor and see what bytes are there in the '冯' string literal?

With the amount of encoding-unaware tools around, in my opinion it’s generally safer to use escapes like '\u51af' instead of non-ASCII characters like '冯' in string literals in source files. (But as someone who writes in mostly-ASCII languages, it may be easier for me to say that than for you.)

But, taking a step back, do you even need to encode the character in UTF-8 and then assemble the bytes into a number? Why not just use ord('冯') as the dictionary key?

LabBaxMX
Posts: 3
Joined: Sat May 28, 2022 11:19 am

Re: Confused about micropython's utf-8 encoding

Post by LabBaxMX » Mon May 30, 2022 10:01 am

Lobo-T wrote:
Sat May 28, 2022 4:54 pm
Are you sure? I get the correct UTF-8 running on a ESP32:

Code: Select all

>>> data_code = '冯'.encode('utf-8')
>>> print (data_code)
b'\xe5\x86\xaf'
>>> 
Thank you for your reply, I tried another IDE (Thonny), and this problem disappeared, the mismatch problem only occurred when I use VS Code XD, now I just use VS Code to code(prefer to its python plugins) and debug on Thonny.

I think its RT-Thread plugin's problem but I'm not sure...

LabBaxMX
Posts: 3
Joined: Sat May 28, 2022 11:19 am

Re: Confused about micropython's utf-8 encoding

Post by LabBaxMX » Mon May 30, 2022 10:23 am

Christian Walther wrote:
Sat May 28, 2022 6:54 pm
I suspect your source file (the first code block – you don’t show any >>> prompts so I guess this is a source file, not REPL input) isn’t actually encoded in UTF-8. Can you open it in a hex editor and see what bytes are there in the '冯' string literal?

With the amount of encoding-unaware tools around, in my opinion it’s generally safer to use escapes like '\u51af' instead of non-ASCII characters like '冯' in string literals in source files. (But as someone who writes in mostly-ASCII languages, it may be easier for me to say that than for you.)

But, taking a step back, do you even need to encode the character in UTF-8 and then assemble the bytes into a number? Why not just use ord('冯') as the dictionary key?
I double checked my src file, the file was auto-saved with UTF-8 encoding, the first code block is a pseudo code as I'm not sure if I have made myself clear, this did occurred in REPL XD.
Christian Walther wrote: to use escapes like '\u51af' instead of non-ASCII characters like '冯' in string literals
I'm trying to get a json string from a web api, so I actually don't know what exactly the letter is, '冯' is just a test to check if I make it correct.
Christian Walther wrote: just use ord('冯') as the dictionary key
My original idea is to use ord(char) function (the easiest way I think), but it generates the ascii format code which doesn't match the utf-8 format json string, so in the end I have to switch to utf-8 code as the dict key.

Now the code run correctly on my board and the result is just what I needed, maybe it still can be optimized and , I decided to leave it still out of time concern.

Anyway, thank you for your reply~

Post Reply