Unicode Grapheme Clusters

In the post, I am going to talk about a topic that not many programmers know - Unicode grapheme clusters. If you need to write internationalized software, this is an important subject that you should be aware of. It affects how you count the number of characters and how you break the words.

User-perceived Characters

In English, a single character is represented by a single Unicode code point. However, in many writing systems, an user-perceived character may be composed of several Unicode characters. For example, g̈ can be represented by U+0067 (LATIN SMALL LETTER G) and U+0308 (COMBINING DIAERESIS). Other examples are Hangul syllables such as gag. 각 can be represented by U+1100 (HANGUL CHOSEONG KIYEOK), U+1161 (HANGUL JUNGSEONG A) and U+11A8 (HANGUL JONGSEONG KIYEOK).

Grapheme Cluster Boundaries

If you are going to write a program that counts the number of characters. One naive approach is to count the number of Unicode code points. However, this may be different from users’ expectation. A better approach is to count the number of grapheme clusters.

#include <unicode/ubrk.h>
#include <unicode/ustring.h>

void find(UChar *string) {
    UErrorCode error = U_ZERO_ERROR;
    UBreakIterator *iterator = ubrk_open(UBRK_CHARACTER, NULL, string, -1, &error);
    int32_t index = ubrk_first(iterator);

    while (index != UBRK_DONE) {
        printf("%d\n", index);
        index = ubrk_next(iterator);
    }

    ubrk_close(iterator);
}

The above program shows how you can find the boundaries of grapheme clusters using ICU library. ICU is a very mature C/C++ and Java libraries for handling Unicode.

Unicode is a complicated subject. For more information, I suggest you to read Unicode Explained written by Jukka K. Korpela.