Monday, March 24, 2008

Who knew characters were so complicated...

In one of my current things at work it is dealing with a decent amount of differing character sets. I knew they were confusing from reading articles, etc.. but I didn't know they were this bad. Here is a little excerpt from a Wikipedia entry.

Note that merely having different "meanings" is not sufficient grounds to split a grapheme into several characters: Thus, the acute accent may represent word accent in Welsh or Swedish, it may express vowel quality in French, and it may express vowel length in Hungarian, Icelandic or Irish. Since all these languages are written in the Latin alphabet, the acute accent in its various meanings is considered one and the same combining diacritic character (U+0301). Confusingly, there is however a separate "combining diacritic acute tone mark" at U+0341 for the romanization of tone languages.


Well back to trying to wrap my head around UTF-8, UTF-32, ASCII, UCS-2, UCS-4 and god knows what other character sets.