The romanization of non-Latin scripts is a complex computational task that is highly language dependent. This presentation will focus on three of the most challenging non-Latin scripts: Chinese, Japanese, and Arabic (CJA).
Much progress has been made in personal name machine-transliteration methodologies, as documented in the various NEWS reports over the last several years. Such techniques as phrase-based SMT, RNN-based LM and CRF have emerged, leading to gradual improvements in accuracy scores. But methodology is only one aspect of the problem. Equally important is the high level of ambiguity of the CJA scripts, which poses special challenges to named entity extraction and machine transliteration. These difficulties are exacerbated by the lack of comprehensive proper noun dictionaries, the multiplicity of ambiguous transcription schemes, and orthographic variation.
This presentation will clear up the differences between three basic concepts -- transliteration, transcription, and romanization -- that are a source of much confusion, even among computational linguists, and will focus on (1) the major linguistics issues, that is, the special characteristics of the CJA scripts that impact machine transliteration, and (2) the important role played by lexical resources such as personal name dictionaries. (See the full abstract here)
Jack Halpern is a Japan-based lexicographer specializing in Chinese characters or kanji. He is best known as editor-in-chief of the Kodansha Kanji Learner's Dictionary and as the inventor of the SKIP system for kanji lookup. Halpern is also an active unicyclist, having served as founder and president of the International Unicycling Federation. He currently resides in Saitama, Japan.
Halpern is CEO of the CJK Dictionary Institute (CJKI), which specializes in dictionary compilation for Chinese, Japanese, Korean, Arabic, and other languages. With CJKI, Halpern has published various lexicographical tools for language learners including the Kodansha Kanji Learner's Dictionary and the New Japanese-English Character Dictionary. CJKI has also produced a large number of technical dictionaries covering such topics as mechanical engineering, economics, and medicine. Aside from dictionary compilation, CJKI maintains and licenses large-scale lexical databases covering a total of approximately 24 million entries in Japanese, Chinese, Korean, and Arabic.
Halpern is also a noted polyglot with speaking ability in eleven languages: English, Japanese, Hebrew, Yiddish, Portuguese, Spanish, German, Chinese, Esperanto, Arabic, and Vietnamese. His reading ability extends to Ladino, Papiamento, and Aramaic.