Understand the encoding/decoding of Python strings (Unicode/UTF-8)

Demystify Python string encoding/decoding

Lynn G. Kwong
5 min readDec 23, 2021

String is a common data type in Python and is used by us every day. In this article, the basics of the encoding/decoding of strings will be introduced which can clear your confusion in this field. At the end of the article, two methods to generate unique hashes from strings with encoding/decoding are introduced, which shall be helpful for your work.

Image by geralt on Pixabay

What is the data type of a string?

This may seem like a silly question. However, if you have been working with Python 2 and have just switched to Python 3, it can be very confusing.

In Python 2, a string is by default a binary string and you need to use u'' to mark a string as a Unicode string. However, in Python 3, a string by default is a Unicode string, and you need to use b'' to explicitly mark a string as a binary string.

Actually, this is a major problem if you upgrade a Python 2 codebase to Python3. You would need to manually fix bytes/Unicode string-related problems, especially when you have encoding and decoding in your code.

Since Python 2 is deprecated now, in the following sections, we use solely use Python 3, with the latest version 3.10 at the time of writing.

--

--

Lynn G. Kwong

I’m a Software Developer (https://medium.com/@lynn-kwong) keen on sharing thoughts, tutorials, and solutions for the best practice of software development.