A Guide to Using Myanmar Unicode

Line Breaking and Tokenizing Myanmar Text

Myanmar text often does not have spaces between words. It therefore needs a more advanced approach to line breaking than just whitespace analysis. Ideally, a line breaking algorithm should use a dictionary lookup and this is the approach used in the latest builds of Myanmar OpenOffice.

A Dual Weight Algorithm based on syllables has also been developed. This gives preference to spaces and section marks, but also allows breaks between syllables when the former are insufficient. When a syllable break occurs before a stacked character, then line breaking is obviously prevented. The algorithm can also be adapted to tokenize Myanmar for indexing and searching applications.

An example implementation is available in Java from the Mercurial repository.

hg clone http://thanlwinsoft.co.uk/cgi-bin/hgwebdir.cgi/MyanmarParser/

The GDL rules used in the Padauk font give similar line breaking behaviour using Graphite's multi-weight line breaking features, though the algorithm is different.

Searching>>