Understanding Character Encodings: UTF-8 vs ASCII

Why do broken characters (Mojibake) appear? We explain the fundamental differences between ASCII, UTF-8, and Unicode for developers.

Hello! This is Cheetset.

Have you ever seen weird characters like or garbled text? This is called Mojibake.

This happens when the computer tries to read text using the wrong Encoding. Today, we'll dive deep into the world of character encodings.

1. ASCII: The Beginning

In the early days, computers only needed to support English. ASCII uses 7 bits to represent 128 characters (A-Z, 0-9, etc.).

'A' = 65 (0x41)
'a' = 97 (0x61)
'0' = 48 (0x30)

To support all languages, Unicode was created. It assigns a unique number to every character in the world. For example, 'A' is U+0041.

However, Unicode is just a standard map, not the way it's stored on disk.

UTF-8 is the most popular way to store Unicode text.

Variable Length: Uses 1 byte for English (same as ASCII) and 3-4 bytes for other languages.
Efficiency: It saves space while supporting every language.
Compatibility: It is backward compatible with ASCII.

If you save a file as Windows-1252 (or EUC-KR) and open it as UTF-8, the characters will not map correctly.

💡 How to Fix

1. Always include this in your HTML head:

<meta charset="UTF-8">

2. Ensure your text editor saves files in UTF-8 format.

BOM (Byte Order Mark) is a hidden character at the start of a file to indicate encoding.

Recommendation: Use UTF-8 without BOM for web development to avoid unexpected errors in scripts.

Always use UTF-8 for your web projects to ensure your text looks correct on every device! It's the global standard for a reason.