29 Text analysis in “Hard” Languages
29.1 Objectives
- Non-English characters, and encoding issues
- Segmentation in languages that do not use whitespace delimiters between words
- Right-to-left languages
- Emoji
29.2 Methods
Applicable methods for the objectives listed above.
29.3 Examples
Chinese, Japanese, Korean.
Hindi, Georgian.
Arabic and other RTL languages.
29.4 Issues
Stemming, syllables, character counts may be off.
Font issues.
29.5 Further Reading
Further reading here.
29.6 Exercises
Add some here.