CJKV Information Processing

Paperback
from $0.00

Author: Ken Lunde

ISBN-10: 0596514476

ISBN-13: 9780596514471

Category: Programming - General & Miscellaneous

CJKV Information Processing is the guide for tackling the difficult issues faced when dealing with complex Asian languages - Chinese, Japanese, Korean, and Vietnamese - in the context of computing or Internet services.

Search in google:

Eight years after its publication, CJKV Information Processing remains the ultimate English-language source of information for information on processing text in Chinese, Japanese, Korean, and Vietnamese. While its pre-eminence has not been challenged, its contents have aged. Unicode is becoming much more important, and the mix of technologies, encodings, and of course fonts continues to evolve. In this update, Ken Lunde re-examines the challenges of working with these languages, showing developers in a wide range of fields the latest tools for sharing information that can reach East Asia directly.

Chapter 9: Information Processing Techniques\ Perl\ Usually described as a scripting language, Perl, developed by Larry Wall, is much, much more than that. Perl's main strengths include rapid development. regular expressions (described later in this chapter), and hashes (associative arrays). It is not so much these individual features that provide Pert with extraordinary text-manipulation capabilities , but rather how these features are intertwined with one another. Other programming languages offer similar features, but there is often no convenient way for them to function together. in Perl, for example, a regular expression can be used to parse text, and at the same time used to 'store the resulting items into a hash for subsequent lookup.\ Perl is the programming language of choice for those who write CGI programs or do other web-related programming (a topic that is discussed at the end of Chapter 13, The World Wide Web), because it is well suited for the task.\ Although the current incarnation of Perl has no built-in support for internationalization (to the level that Java currently has), it is something that is being discussed by its developers. There are, however, clever ways to use Perl for handling multiple-byte data, most of which make use of regular expression tricks and techniques. The Perl code examples provided in Appendix W should he studied by any serious Pert programmer. Gisle Aas and Martin Schwartz have been diligently working on some extremely useful Unicode modules for Perl (Such as Unicode:: String, Unicode::Map8, and Unicode::Map), so you can expect some useful and interesting things to happen in the future. The Unicode Map module byMartin Schwartz, in particular, already supports code conversion between Unicode and a number of legacy CJKV encodings.\ Kazumasa Utashiro has developed a useful japanese-enabling Perl library called jcodepl, which includes Japanese code conversion routines.** Some may find the Japanese version of Perl, called JPerl, to be useful, although I suggest using programming techniques. that avoid JPerl for optimal portability. JPerl adds: Japanese support to the following features: regular expressions, formats, some built-in functions (chop and split), and the tr / / / operator. The definitive guide to Perl is Programming Perl, Second Edition, by Larry Wall et al. (O'Reilly & Associates, 1996). Tom Christiansen and Nathan Torkington's Perl Cookbook (O'Reilly & Associates, 1998) is also highly recommended as a companion volume to Programming Perl. The comp.langperl.misc newsgroup should also be of interest. The best place to find Perl is at CPAN (Comprehensive Perl Archive Network).\ Python\ Like Perl, Python is also sometimes described as a scripting language. Python was developed by Guido van Rossum, and is a high-level programming language that provides valuable programming features such as hashes and regular expressions.\ An excellent guide to Python is Mark Lutz's Programming Python (O'Reilly & Associates, 1996). The comp.1angpython newsgroup should also be of interest if you want to learn about recent Python developments and join discussions. There is also a Python web site from which Python itself is available.\ Tcl\ Tcl, which stands for Tool Command Language, is a programming language that was originally developed by John Ousterhout while a professor at UC Berkeley. Like Perl and Python, Tcl is considered a high-level scripting language that provides built-in facilities for hashes and regular expressions. John later founded Scriptics Corporation where Tcl is now being advanced.\ Some important milestones in Tcl's history include its byte-code compiler introduced for Version 8.0, and support for Unicode (in the form of UTF-8 encoding) that began with Version 8.1. Tcl will also have a regex package comparable to Perl's by the time you read this. The lack of a byte-code compiler has always kept Tcl slower than Perl.\ Tcl is rarely used alone, but rather with its GUI (Graphical User Interface) component called TK (standing for Tool Kit).\ Other Programming Environments\ While it is possible to write multiple-byte-enabled programs using all of the programming languages mentioned above, there are some programming environments that have done all this work for you, meaning that you need not worry about multiple-byte enabling your own source code because you depend on a module to do it for you. This may not sound terribly exciting for companies with sufficient resources and multiple-byte expertise, but may be a savior for smaller companies with limited resources.\ One example of such a programming environment is Visix's Galaxy Global, multilingual product based on their Galaxy product. (Visix Software has since gone out of business.)\ Perhaps of greater interest is Basis Technology's "Rosette: C++ Library for Unicode," which is a compact, general-purpose Unicode-based source code library. Embedded into an application, this library adds Unicode text processing capabilities that are robust and efficient across a variety of platforms (MacOS, Unix, Windows, and so on). Its functions adhere to the latest Unicode specifications. Major functions include code conversion between major legacy encodings and Unicode encodings, character classification (identification of a character), and character property conversion (such as half- to full- width katakana conversion). Basis Technology also offers a general-purpose code conversion utility, called "Uniconv," built using this library. Also of interest is UniScape's Global C and Global Checker packages, Sybase's Unilib, and Alis Technologies' Batam (their own Tango web browser is an example of this library's usage in a real product).\ Code Conversion Algorithms\ It is very important to understand that only the encoding methods for the national character sets are mutually compatible, and work quite well for round-trip conversions. The vendor-defined character sets often include characters that do not map to anything meaningful in the national character set standards. When dealing with the Japanese, ISO- 2022-JP, Shift-JIS, and EUC-JP encodings, for example, algorithms are used to perform code conversion - this involves mathematical operations that are applied equally to every character represented under an encoding method: This is known as algorithmic conversion....

Foreword Preface 1. CJKV Information Processing Overview 2. Writing Systems 3. Character Set Standards 4. Encoding Methods 5. Input Methods 6. Font Formats 7. Typography 8. Output Methods 9. Information Processing Techniques 10. Operating Systems, Text Editors, and Word Processors 11. Dictionaries and Dictionary Software 12. The Internet 13. The World Wide Web A. Code Conversion Tables B. Notation Conversion Table C. Vendor Character Set Standards D. Vendor Encoding Methods E. GB 2312-80 Table F. GB/T 12345-90 Table G. CNS 11643-1992 Table H. Big Five Table I. Hong Kong GCCS Table J. JIS X 0208:1997 Table K. JIS X 0212-1990 Table L. KS X 1001:1992 Table M. KS X 1002:1991 Hanja Table N. Hangul Reading Table O. TCVN 6056:1995 Table P. Code Table Indexes Q. Character Lists and Mapping Tables R. Chinese Character Lists S. Single-Byte Code Tables T. Software and Document Sources U. Mailing Lists V. Professional Organizations W. Perl Code Examples X. Glossary Bibliography Index