Understand the encoding/decoding of Python strings (Unicode/UTF-8)

Demystify Python string encoding/decoding

Lynn G. Kwong

--

String is a common data type in Python and is used by us every day. In this article, the basics of the encoding/decoding of strings will be introduced which can clear your confusion in this field. At the end of the article, two methods to generate unique hashes from strings with encoding/decoding are introduced, which shall be helpful for your work.

Image by geralt on Pixabay

What is the data type of a string?

This may seem like a silly question. However, if you have been working with Python 2 and have just switched to Python 3, it can be very confusing.

In Python 2, a string is by default a binary string and you need to use u'' to mark a string as a Unicode string. However, in Python 3, a string by default is a Unicode string, and you need to use b'' to explicitly mark a string as a binary string.

Actually, this is a major problem if you upgrade a Python 2 codebase to Python3. You would need to manually fix bytes/Unicode string-related problems, especially when you have encoding and decoding in your code.

Since Python 2 is deprecated now, in the following sections, we use solely use Python 3, with the latest version 3.10 at the time of writing.

Let’s demonstrate the types of strings in Python 3:

It’s obvious that in Python 3 a string is by default a Unicode string and the u'' prefix is optional.

What are ASCII, Unicode, and UTF-8?

The technical details of ASCII, Unicode, and UTF-8 can be difficult to fully digest if you don’t have a computer science background. However, most of the time you don’t need to care too much about the technical details as long as you know the meaning and basic usage of them.

ASCII is an abbreviation for American Standard Code for Information Interchange. The full name is even more elusive :). It is the first character set and…

--

--

Lynn G. Kwong

I’m a Software Developer (https://medium.com/@lynn-kwong) keen on sharing thoughts, tutorials, and solutions for the best practice of software development.