UTF-8 operates by dynamically allocating bytes depending on the Unicode code point range of each character. This allows for absolute backward compatibility with legacy ASCII formats while opening the door to massive international character sets:
✓ 1-Byte (ASCII Range)
Standard English characters (a-z, A-Z), numeric figures (0-9), and basic punctuation marks are stored using exactly 1 byte.
✓ 2-Bytes (Latin Accents)
Diacritics, accented letters (e.g. é, ü), Cyrillic, Greek, Hebrew, and Arabic characters consume exactly 2 bytes in memory.
✓ 3-Bytes (CJK & Symbols)
Chinese (Hanzi), Japanese (Kanji), Korean (Hangul), Hindi, special currency symbols (like €), and standard math constants consume 3 bytes.
✓ 4-Bytes (Emojis & Scripts)
Modern emojis (e.g., 😂, 🚀, 💻), rare historical languages, and special mathematical script layouts consume exactly 4 bytes in storage.