Understanding Character Encodings: UTF-8 vs ASCII
Why do broken characters (Mojibake) appear? We explain the fundamental differences between ASCII, UTF-8, and Unicode for developers.
Hello! This is Cheetset.
Have you ever seen weird characters like or garbled text? This is called Mojibake.
This happens when the computer tries to read text using the wrong Encoding. Today, we'll dive deep into the world of character encodings.
1. ASCII: The Beginning
In the early days, computers only needed to support English. ASCII uses 7 bits to represent 128 characters (A-Z, 0-9, etc.).
'A' = 65 (0x41)
'a' = 97 (0x61)
'0' = 48 (0x30)
- Pros: Very simple and small.
- Cons: Cannot represent non-English characters like Korean or Emoji.
2. Unicode: One Standard for All
To support all languages, Unicode was created. It assigns a unique number to every character in the world. For example, 'A' is U+0041.
However, Unicode is just a standard map, not the way it's stored on disk.
3. UTF-8: The King of Encodings
UTF-8 is the most popular way to store Unicode text.
- Variable Length: Uses 1 byte for English (same as ASCII) and 3-4 bytes for other languages.
- Efficiency: It saves space while supporting every language.
- Compatibility: It is backward compatible with ASCII.
4. Why Mojibake Happens
If you save a file as Windows-1252 (or EUC-KR) and open it as UTF-8, the characters will not map correctly.
π‘ How to Fix
1. Always include this in your HTML head:
<meta charset="UTF-8">2. Ensure your text editor saves files in UTF-8 format.
5. What is BOM?
BOM (Byte Order Mark) is a hidden character at the start of a file to indicate encoding.
- Recommendation: Use UTF-8 without BOM for web development to avoid unexpected errors in scripts.
Conclusion
Always use UTF-8 for your web projects to ensure your text looks correct on every device! It's the global standard for a reason.