29  Text analysis in “Hard” Languages

29.1 Objectives

  • Non-English characters, and encoding issues
  • Segmentation in languages that do not use whitespace delimiters between words
  • Right-to-left languages
  • Emoji

29.2 Methods

Applicable methods for the objectives listed above.

29.3 Examples

Chinese, Japanese, Korean.

Hindi, Georgian.

Arabic and other RTL languages.

29.4 Issues

Stemming, syllables, character counts may be off.

Font issues.

29.5 Further Reading

Further reading here.

29.6 Exercises

Add some here.