Understand the encoding/decoding of Python strings (Unicode/UTF-8)
Demystify Python string encoding/decoding
String is a common data type in Python and is used by us every day. In this article, the basics of the encoding/decoding of strings will be introduced which can clear your confusion in this field. At the end of the article, two methods to generate unique hashes from strings with encoding/decoding are introduced, which shall be helpful for your work.
What is the data type of a string?
This may seem like a silly question. However, if you have been working with Python 2 and have just switched to Python 3, it can be very confusing.
In Python 2, a string is by default a binary string and you need to use u''
to mark a string as a Unicode string. However, in Python 3, a string by default is a Unicode string, and you need to use b''
to explicitly mark a string as a binary string.
Actually, this is a major problem if you upgrade a Python 2 codebase to Python3. You would need to manually fix bytes/Unicode string-related problems, especially when you have encoding and decoding in your code.
Since Python 2 is deprecated now, in the following sections, we use solely use Python 3, with the latest version 3.10 at the time of writing.