Download or view graphemesTest.frink in plain text format
/** Test Frink's parsing of graphemes.
To quote the Unicode standard:
"It is important to recognize that what the user thinks of as a
'character'--a basic unit of a writing system for a language--may not be
just a single Unicode code point. Instead, that basic unit may be made up
of multiple Unicode code points. To avoid ambiguity with the computer use
of the term character, this is called a user-perceived character. For
example, 'G' + acute-accent is a user-perceived character: users think of
it as a single character, yet is actually represented by two Unicode code
points. These user-perceived characters are approximated by what is called
a grapheme cluster, which can be determined programmatically.
Samples are taken from:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
*/
printAll[""] // Test empty string.
// Grapheme clusters (both legacy and extended)
printAll["you \u0067\u0308o\u0308"] // G with combining diaeresis
printAll["\uAC01\u1100\u1161\u11A8"] // Hangul gag
printAll["\u0E01"] // Thai ko
/** Extended grapheme clusters
An extended grapheme cluster is the same as a legacy grapheme cluster, with
the addition of some other characters. The continuing characters are
extended to include all spacing combining marks, such as the spacing (but
dependent) vowel signs in Indic scripts. For example, this includes U+093F
DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be
used in implementations in preference to legacy grapheme clusters, because
they provide better results for Indic scripts such as Tamil or Devanagari
in which editing by orthographic syllable is typically preferred. For
scripts such as Thai, Lao, and certain other Southeast Asian scripts,
editing by visual unit is typically preferred, so for those scripts the
behavior of extended grapheme clusters is similar to (but not identical to)
the behavior of legacy grapheme clusters.
*/
printAll["\u0BA8"] // Tamil na
printAll["\u0BA8\u0BBF"] // Tamil ni (hmmm... don't combine?)
printAll["\u0E40"] // Thai character sara e
printAll["\u0E01\u0E33"] // Thai "ko kai" + "sara am" = "kam"
// hmmm... don't combine?
printAll["\u0937\u093F"] // Devanagari SSA + Vowel sign I = ssi
/* Legacy grapheme clusters.
A legacy grapheme cluster is defined as a base (such as A or カ) followed
by zero or more continuing characters. One way to think of this is as a
sequence of characters that form a “stack”.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The Unicode
Standard, or be any sequence of Regional_Indicator (RI) characters. The RI
characters are used in pairs to denote Emoji national flag symbols
corresponding to ISO country codes. Sequences of more than two RI
characters should be separated by other characters, such as U+200B ZERO
WIDTH SPACE (ZWSP).
The continuing characters include nonspacing marks, the Join_Controls
(U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) used in Indic
languages, and a few spacing combining marks to ensure canonical
equivalence. Additional cases need to be added for completeness, so that
any string of text can be divided up into a sequence of grapheme
clusters. Some of these may be degenerate cases, such as a control code, or
an isolated combining mark.
*/
printAll["\u0E33"] // Thai character sara am
printAll["\u0937"] // Devanagari letter ssa
printAll["\u093F"] // Devanagari vowel sign i (combining but alone?)
// Tailored grapheme clusters
printAll["\u0063\u0068"] // Slovak ch digraph
printAll["\u006B\u02B7"] // k^w (sequence with letter modifier) hmmm.. not combining?
printAll["\u0915\u094D\u0937\u093F"] // Devanagari letter ka + sign virama + letter ssa + vowel sign i = kshi
// Something from StackOverflow:
//
printAll["\u{1F468}\u{200D}\u{2764}\u{FE0F}\u{200D}\u{1F48B}\u{200D}\u{1F468}"]
printAll[str] :=
{
printGraphemes[str]
printGraphemes[normalizeUnicode[str]]
printGraphemes[reverse[str]]
/* g = new graphics
g.font["SansSerif", 10]
g.text["$str\n" + reverse[str], 0, 0]
g.show[] */
}
printGraphemes[str] :=
{
graphemes = array[graphemeList[str]]
print["$str (" + length[str] + "," + graphemeLength[str] + "):\t"]
println[inputForm[graphemes]]
}
Download or view graphemesTest.frink in plain text format
This is a program written in the programming language Frink.
For more information, view the Frink
Documentation or see More Sample Frink Programs.
Alan Eliasen was born 20217 days, 23 hours, 27 minutes ago.