~---------- Forwarded message ----------
Date: Sat, 2 Sep 1995 13:29:57 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Chinese Characters FAQ (about Buddhism)
¥H¤U¬Oªñ¨Ó¦Ü¦U³B·j¶°¨Óªº¤¤¤å¤º½X¬ÛÃö¤å¥ó¤¤¡A¤ñ¸û«nªº¡C
¥Ø«e¦ò¨åªº¤º½X¿ï¥Î¥H¤Î³y¦r°ÝÃD¡A¬O¦ò¨å¿é¤Jªº²~ÀV¡A¥H¤U
¸ê°T¨Ñ¦U¦ì¾Çªø°Ñ¦Ò¡C
ps. Y¦¹ mail alias ªº¾Çªø¡A¦³¦b¦x°|¤u§@©Î¬O¹ï©ó¬ÛÃö
¤¤¤å¿é¤Jªº°T²ß¡]¤º½X¡B³y¦rµ¥¡^«Ü¦³¿³½ìªº¤H¡A
½Ð¦^¨ç§iª¾¥½¾Ç¤@Án¡A¥H±N±z¥[¦b¦ò¨å¿é¤Jªº¦ò±Ð¾÷ºc mail alias ¤¤¡C
¦³¨ÇÃö©ó¤¤¤å¤º½Xªº§Þ³N©Ê°ÝÃD¡A±N¤£·|¦b¥Ø«eªº mail alias ¤ºµo§G¡C
¡]¥Ø«e¥u¦³ corbon copy ¦Ü¥x¤j¦ò¾Ç¬ã¨s¤¤¤ß¡B»¥ú¦x
¦Ûlªk®v¡B¹AÁI¦xªG¥úªk®vµ¥´X¦ìªk®v¡A¤Î´X¦ì¯S§O¼ö¤ßªº¾Çªøªº±b¸¹¦Ó¤w¡C¡^
¥H¤U§Y¬O¦ò¨å¬ÛÃö¤¤¤å¤º½Xªº«n FAQ:
¡]¥½¾Ç¤W¦¸´¿Âà¶K¼Æ¤Q«Ê¬ÛÃöªº«H¥óµ¹¥H¤W¦ò±Ð¾÷ºc¡A
´£¨Ñ§@¬°°Ñ¦Ò¡C¦pªG¦³¾Çªø¹ï¦¹¯S§O¦³¿³½ìªº¸Ü¡A
¥i¥H¦V¥½¾Ç¯Á¨ú§ó¸Ô²Óªº¤å¥ó¡A©Îª½±µ¥[¤J¦ò±Ð¾÷ºc
ªº¦W³æ¤¤¡C¡^
=========================================================================
Date: Sat, 13 May 1995 10:07:34 +0800
From: Shann Wei-Chang <shann@math.ncu.edu.tw>
§@ªÌ²¤¶¡G¤¤¥¡¼Æ¾Ç¨t³æºû¹ü±Ð±Â¡A¹ï©ó°ê¾Ç·¥¦³¿³½ì¡A¹ï©ó UNIX ¨t²Î¥ç«D±`¼ô¡A
°Ñ»Pºô¸ô¤W¤º½Xªº°Q½×¤w¦h¦~¡C
Subject: internal code
¤jè,
¤è¤~Ū¤F§Aªº report, ¦³Ãö¦ò¨å¿é¤J¸I¨ìªº¨u¨£¦r°ÝÃD. §Aª¾¹D§Ú¦b CCNET ©M
CHPOEM ªº mailing list ¤W«Ü¤[¤F, §Ú̱`±`°Q½×³o¤@Ãþªº°ÝÃD. Ãö©ó¥¦ªº
¸Ñ¨M¤è®×, ¨ä¹ê¬O¨S¦³¦@ÃѪº©w®×, ¦Ó¥B§Ú¦Û¤vªº·Qªk¤]ÀH®É¶¡§ïÅÜ (¤£ª¾¬O¤£¬O
¶VÅܶV¦¨¼ô´N¦³«Ý®É¶¡¦ÒÅç¤F).
Åý§Ú§i¶D§A§Ú²{¦bªº·Qªk, ¥H¸ê°Ñ¦Ò. ²Ä¤@, §Ú¤£³ßÅw Big-5 ©M·íªì³]p¥¦ªº
¨º¤@À°¤H, ³o¬O¨å«¬ªº¦H¹ôÅX¨}¹ôªº¨Ò¤l. ¦ý¬O, ÀHµÛ¹ï¨Æ¹êªº»{ª¾»P§´¨ó
(³oÀ³¸Ó¬O»P¦~ÄÖ¦³Ãö), §Ú¶}©l©Ó»{, ¥ô¦ó·Qn´¶¹M¬y¶Çªº¤¤¤å¹q¤lÀÉ®×,
¥²¶·»P Big-5 ¬Û®e; ª½±µ¬Û®e, ¤ð¶·Âà½X©Î¯S®í³B²z.
n output ¯S®í¦r¤ñ input ²³æ, (input for search, for instance). ¦ý¬O,
¤@½g¹q¤l¤å¥ó³q±`¥u¦³¦r½X, ¦Ó¤£ªþ±a¦r«¬ (glyph, the bitmap binary file
or in other formats). ¦pªG¤å¥ó¬O©ñ¦bºÏ¤ù©Î¥úºÐ¤W¬y³q, ³oÓ°ÝÃD¤ñ¸û¤p,
¦ý¬O§ÚÌÁ`§Æ±æ¦P¼Ëªº¤å¥ó, À³¸Ó¯à¦b·¥¤ÖªºÅܰʤU©ñ¨ìºô¸ô¤W¬y¶Ç. ³o®ÉÔ,
¤å¥ó»P¾\Ū¾¹´N¬O¨â½X¤l¨Æ. ³o¬O³Ì»Ýnªá¤O®ðªº¦a¤è.
§Ú¥Ø«eªº·Qªk¬O, °ò¥»¤W¨Ï¥Î Big-5 ½X, ¸I¨ì¨u¥Î¦r, ¥Î Escape sequence
¹j¶}, ´N¹³®ü¥~¯d¾Ç¥Í±`¥Îªº HZ ½X, ©Î¬O¤é¥»ªº JIS ¼Ð·Ç, ¥H¤Î¤j³¡¥÷ UNIX
¤u§@¯¸¤§´©ªº EUC. ¦pªG¨Ï¥ÎºÝªº¾\Ū¾¹µLªkÃѧO³oÓ Escape sequence, ©Î¬O
¨S¦³¬Û¹ïÀ³ªº¦r«¬, «hŪªÌ¥i¯à¬Ý¨ì¤@¦ê¶Ã¤C¤KÁVªº¦r, ¦ý¬O³q±`³o¨Ç¦rÀ³¸Ó
¤£¦h, ¤£¦Ü©ó¼vÅT¾ãӤ峹ªº¤º®e. ¦Ü©ó¸Ó¥Îþ¨Ç¦r¦ê§@¬° Escape sequence?
§Ú°êªº CNS ½X¤w¸g¦b°ê»Ú¤Wµù¥U, §ÚÌÀ³¸Ó¾¨¶q¸òÀH³oӼзÇ, ¤£¯à¸òªº®ÉÔ,
À³¸Ó¹B¥Îºô¸ô¤j²³¶Ç¼½ªº¤O¶q, ¥[¤W¬Fªv´å»¡ªº¤O¶q, §â§ÚÌ¿ï©wªº Escape sequence
³]¦¨¼Ð·Ç. ¦Ü©ó¨u¥Î¦r¸Ó¦p¦ó½s½X, ¦P¼ËÀ³¸Ó¥ý°Ñ¦Ò¤¤¥¡¼Ð·Ç§½¦b 1992 ¦~¤½¥¬
ªº¼Ð·Ç¥æ´«½X. ³oÓ½Xªº½s±Æ²Å¦X°ê»Ú¼Ð·Ç, ¥Ø«e¦@¦³¤CÓ¦r±, ÁÙ¦³«Ü¦h¬A¥R
ªºªÅ¶¡, ¨CÓ¦r±¨Ì°ê»Ú¼Ð·Ç±Æ¤J 94*94 Ó¦r½X (two bytes, each byte is
between 33 and 126, decimal inclusive). ²Ä¤@¤G¦r±©Ò¿ï©wªº¦r°ò¥»¤W»P
Big-5 ¬Û¦P, ¦ý§ï¥¿¤F´XÓ (¤]³\¬O©Ò¦³ªº) ¿ù»~. ²Ä¤T¨ì¤C¦r±©w¸q¤F¤T¸U
¦hÓ¨u¥Î¦r, ©ÎÅé¦r, ²§Åé¦r, ©M¤@¨Ç¥u¥X²{¦bºâ©R¥ý¥Íªº©R¦W¾Ç¤Wªº©_©_©Ç©Ç
ªº¦r: ¥¦Ìªº¦r½X¥H¤Î¦r«¬. ¤K¨ì¤Q¤»¦r±ªÅµÛ, ²Ä¤Q¤G¦r±¬O user defined.
§Úªº¾ÇÃѤ£¨¬¥H¾ÌÂ_³o¨Ç¦b²Ä¤T¨ì²Ä¤C¦r±ªº¦r¬O§_§¹¾ã©Î±Æ§Ç§´·í, ¦]¬°¥¦Ì
¥þ¬O§Ú¤£»{ÃѪº¦r. ¦pªG¦ò¸g¸Ìªº¦rÁÙ¦³¦b³o¸Ì§ä¤£¨ìªº, §Ú«Øij¤£n¥Î²Ä¤Q¤G
¦r±, ¦Ó¬O¹B¥Î¦ò±Ð¹ÎÅ骺¬Fªv¤O¶q¥hª§¨ú¤@Ó¦r±, ¨Ò¦p¤Q¤T, §@¬°©v±Ð¨u¥Î¦r±.
¦]¬°, ©Ò¿× user defined, ¨ì³Ì«á¤@©w¬O¤@¹Î¨S¥Îªºµ}ªd.
¦Ü©ó¨u¥Î¦rªº¿é¤J, «Ü©úÅ㪺, ¥²¶·µo®i¹ïÀ³ªº¤¤¤å¿é¤J³nÅé¥H¤Î¦r«¬. ¦b X window
¤W¤w¸g¦³¤@®M§@ªk¥i´`, ¨ä¥L¨t²Î¤W¤]¤£¸Ó¦³§Þ³N¤Wªº§xÃø.
§Ú̪º¬F©²¤£ª¾¹D¦b°µ¤°»ò, ¥H»OÆWªº¦Û³\¬°¹q¸£¤ý°êªº¦a¦ì, §Ú̪º°ê®a¥æ´«½X¨ì
1986 ¤~º¦¸¤½¥¬, ¦Ó¥B¤S·¾³q¤£¨}, ¾ÉP¥«³õ¤W¨S¤H²z¥¦ (¤£²z¬F©²¦ü¥G¬Oªñ¥N¨â
©¤¤¤°ê¤Hªº¦@¦P¯S¼x). §Ú·Q, §Y¨Ï²{¦b, ÁÙ¬O«Ü¦h°é¤l¸Ìªº¤H¨SÅ¥»¡¹L³oӼзÇ,
©Î¬OÅ¥»¡¤F¦ý¬O¨S¦Ò¼{¹Ln¥Î¥¦. ˬO¸êµ¦·|©M¤@¨Ç¤½®a³æ¦ì¶}©l (¤]³\¬O³Q¢)
¨Ï¥Î¥¦, °ê¥~ªº¤@¨Ç¤½¥q¶}©l¤ä´©¥¦, ¦]¬°¥¦²¦³º¬O¦b°ê»Ú¤Wµù¥Uªº°ê®a¼Ð·Ç½X.
®É¶¡¥^«P, ¼g¤F¨Ç§O¦r, ¦ý¦¹ editor ¤£®e©ö§ó¥¿, ½Ðì½Ì.
-Shann
========================================================================
Date: Mon, 28 Aug 1995 22:57:15 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Recommend Chinese Code -- CNS
¤U¤å§Y¬OÃö©ó¦UºØ¤º½XªºÂ²¤¶¡A¨ú¦Ûªá¶é¤j¾ÇÁI¾Ç WWW:
http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm
¡]¤@¨Ç«nªº¤º®e¡A§Ú·|ÀH¤âªþ¤W¤¤¤å½Ķ¡A¤£¹L¤£«OÃÒ¨S½¿ù¡C
¤@¤Á±o¥Hì¤å¬°·Ç¡C¡^
_________________________________________________________________
Chinese character codes: an update
¤¤¤å¤º½Xªº±´¯Á¡Gק睊
by Christian Wittern
§@ªÌ²¤¶¡G¤é¥»¨Ê³£ªá¶é¤j¾ÇÁI¾Ç¤¤¤ß¡]§Y¡u¹q¤l¹F¼¯¡v¥Zª«µo¦æªÌ¡^
ªº¸ê²`¤Hû¡Cªá¶é¤j¾ÇÁI¾Ç¤¤¤ß¹ï©ó¦ò¨å¹q¤l¤Æªº¥þ¥@¬É
Ápµ¸¤u§@¡A¦Û 1992 ¦~¥H«e§Y¶}©l¶i¦æ¡A¥i¬O·í¤µ°ê»Ú¤W
³Ì¤jªºÁpµ¸ºô¡C
_________________________________________________________________
Summary
This article presents an update to Christian Wittern's and Urs App's
articles concerning Chinese character codes (Electronic Bodhidharma
No. 3). In those articles, Urs App argued that database creators must
make the most crucial distinction between master data and user data.
Master data should be of the highest quality, recording even minute
detail like studio recording equipment. User data, on the other hand,
must conform to what codes and equipment we presently have. Christian
Wittern's article compared different codes and concluded that CCCII, a
very large Taiwanese code that also includes Japanese and Korean
letters, seems to be the best choice for the master data set of
Chinese text databases.
ºKn
¥»¤å§ï¶i¤F Christian Wittern ¥ý¥Í©M Urs App Ãö©ó¤¤¤å¤º½XªºµûªR
¡]¥Z¸ü©ó¡u¹q¤l¹F¼¯¡v´Á¥Z²Ä¤T´Á¡^¡C¦b¸Ó¤å¤¤¡A Urs App ªí¥Ü¸ê®Æ®w
ªº«Ø¥ßªÌ¥²¶·¹ï©ó master data ¤Î user data §@¤U«D±`«D±`«nªº¨M©w¡C
Master data ¥²¶·¨ã¦³³Ì°ªªº«~½è¡A¦p¦P¿ý¼v¾¹§÷°O¿ý¤U¨C¤ÀÄÁªºµe±¤@¯ë¡F
¥t¤@¤è±¡A user data ¥²¶·¶¶±q©ó¨ººØ¤º½X¬O§Ú̲{¦³ªº¡C
Christian Wittern ¥ý¥Íªº¤å³¹¤ñ¸û¤F´XºØ¤£¦Pªº¤º½X¡Aµ²½×¬O¡G
¡u CCCII¡]¤@ºØ«D±`Ãe¤jªº¥xÆWªº¤º½X¡A¨Ã¥B¥]§t¤F¤é¥»¤ÎÁú°ê¦r¡^
¦ü¥G¬O¤¤¤å¤º½Xªº master data ªº³Ì¨Î¿ï¾Ü¡C¡v
We shelled out US $ 2000 for a CCCII board, only to discover that both
the code itself and its implementation are seriously flawed. We thus
had to continue using Big-5 for all practical purposes while looking
for better solutions. Finally, Christian decided that the only
practical approach at this time was to build on Big-5 (and other
national codes such as JIS) and extend them through code references
that are both stable and portable. His ingenious approach forms the
basis of the IRIZ KanjiBase and its encoding scheme -- a scheme which
will be as useful after the introduction of Unicode as it proves to be
right now. (U.A.)
§Ú̪á¤U¤F¬üª÷ 2000 ¤¸¡A¶R¤F¤@Ó CCCII ªºªO±¡Aµ²ªGµo²{¸Ó½X¥»¨¤Î
¥¦ªºªþÄݳ]³Æ¡A³£¨ã¦³ÄY«ªº·å²«¡C¦]¦¹¡A§Ú̦b¹ê»Úªºª¬ªp¤W¡A¥u¦nÄ~Äò
¨Ï¥Î BIG-5¤º½X¡Aµ¥µÛÄ~Äò´M§ä§ó¦nªº¸Ñ¨M¤è®×¡C³Ì«á¡A Christian ¥ý¥Í
¨M©w¤F¡A²{®É°ß¤@¹ê»Ú¥i¦æªº¤èªk¬O«Ø¥ß¦b BIG-5 ¡]¤Î¤é¥»°ê¤º´¶¹M¬y¦æªº
JIS ½X¡^¤W±¡A¨Ã¥BÂǥѬJéw¤S¨ã¥iÄâ©Êªº¡u¤º½X°Ñ·Óªí¡v¡]code references¡^
¨ÓÂX®i¥¦Ì¡C¥Lªº³o¶µÁo©ú´£Ä³²£¥Í¤F¡uIRIZ º~¦r®w¡vªº°ò¦¡A¥H¤Î¡uIRIZ
º~¦r®w¡vªº¡uÂàĶ¾¹¡v¢w¢w¤@ºØ¦b±N¨Ó Unicode ¤Þ¶i«á¡A¯à°÷¦p¦P²{¦b§ÚÌ
ÃÒ©ú¥¦¦³°÷¹ê¥ÎªºÂàĶ¾¹¡C
_________________________________________________________________
* Some kanji codes for computers
1. Japanese JIS Codes
2. Taiwanese Big5
3. Taiwanese CNS
4. CCCII and EACC
5. Unicode
¡¯¤@¨Ç¹q¸£¤Wªºº~¦r¤º½X¡G
1. ¤é¥» JIS ¤º½X
2. ¥xÆW BIG-5 ¤º½X
3. ¥xÆW¤¤¥¡¼Ð·Ç§½ CNS ¤º½X
4. CCCII¤º½X¤Î EACC µ{¦¡
5. Unicode
* More information is available at ifcss.org in Ross Patterson's
document CJK Codes and in Ken Lunde: Understanding Japanese
Information Processing p35ff.
¡¯¦b ifcss.org(.jp) ¤W¦³§ó¦h¦³¥Îªº¸ê°T¡A´N¬O Ross Patterson ¥ý¥Íªº
¡u CJK ¤º½X¡v¤@¤å¡A¤Î Ken Lunde¥ý¥Íªº¡G¡u¤F¸Ñ¤é¥»¦b³B²z p35ff ¤W
ªº¸ê°T¡v¤å¥ó¡C
_________________________________________________________________
Development of kanji codes for computers
¹q¸£º~¦r¤º½Xªºµo®i
Japanese JIS Codes
¤é¥» JIS ½X
The first character code designed to make the processing of
ideographic characters on computers possible was the JIS C 6226-1978.
It was developed according to the guidelines laid down in the ISO
standard 2022-1973 and became the model for most other code standards
used today in East Asia (the most notable exception is Big5). Covering
approximately 6500 characters, this standard has been revised two
times, in 1983 and 1990, where the assignment of some characters where
changed and a few added. Revising a standard is about the worst thing
a standard body can do and has caused much grieve and headache among
manufacturers and users alike. Today we finally have fonts that bear
the year of the standard they cover in their name, so that users can
know which version is encoded in that font and select if accordingly.
Our texts and tools are based on the latest version.
The version of 1990 has become known under the name JIS X 0208-1990
and has been together with an additional set of 5800 characters (JIS X
0212) the base of the Japanese contribution to Unicode.
The JIS code is almost never used in computers as it was defined;
rather, some changes are made in the way the code numbers are
represented. This is necessary to allow JIS be mixed with ASCII
characters and, as in the case of ShiftJis (or MS-Kanji, the most
popular encoding on personal computers) with earlier Japanese
encodings of half-width kana. East Asian text is thus most frequently
based on a multibyte encoding, a character stream that contains a
mixture of characters represented by one single byte and of characters
represented by two bytes.
In addition to the characters in the national standard, many Japanese
vendors have added their own private characters to JIS, making the
conversion between these different encodings difficult beyond belief.
Big5
¡]¤¤¤å BIG-5 ½X¡^
There are different legends about the beginnings of Big5; some say
that the code had been developed for an integrated application with 5
parts, and others say it was an agreement of five big vendors in the
computer industry. No matter which one is true (and it might as well
be something else), the Taiwanese government did not realize the need
for a practical encoding of Chinese characters timely enough.
Government agencies had apparently been involved also in the
development of Big5, but it was only in 1986 that an official code was
announced, a time by which Big5 was already a de facto standard with
numerous applications in daily use.
Ãö©ó BIG-5 ¤º½X¶}©lªº¶Ç»¡¡A¦³³\¦h¤£¦Pªºª©¥»¡G¦³¤H»¡¦¹¤º½X¬O¥Ñ¤@Ó
¾ã¦X¤Ó³¡¥÷ªºÀ³¥Î³nÅé©Ò²£¥Íªº¡A¤S¦³¤H»¡¥¦¬O¤Ó¤j«¬ªº¹q¸£¼t°Ó©Ò
¦@¦P¬ù©wªº¡C¤£ºÞþ¤@Ӷǻ¡¬O¯uªº¡A¥xÆW¬F©²¨Ã¥¼§Y®É¤F¸Ñ¤¤¤å¤º½X
ªº«n©Ê¤Î¶·¨D©Ê¡CÁöµM¬F©²¾÷Ãö«Ü©úÅã¦a¤]°Ñ»P¤F BIG-5 ªº¶}µo¤u§@¡A
¤£¹Lª½¨ì 1986 ¦~¡A©x¤èªº¤º½X¤~¥¿¦¡¹ï¥~«Å§G¡A³o®É BIG-5 ¤º½X¦¤w¬O
¬°¼Æ·¥¦hªº¤é±`À³¥Î³nÅé©Ò±Ä¥Îªº¼Ð·Ç¤F¡C
Big5 defines 13051 Chinese characters, arranged in two parts according
to their frequency of usage. The arrangement within these parts is by
number of strokes, then Kangxi radical. As Big5 was apparently
developed in a great hurry, some mistakes were made in the stroke
count (and thus placement) of characters, and two characters are twice
represented. On the other hand, some frequently used characters were
left out and were later implemented by individual companies.
All implementations agree on the core part of Big5, but different
extensions by individual vendors aquired much weight, most notably in
the case of the ETEN Chinese system that was very popular in the late
eighties and early nineties. As there is no document that defines Big5
apart from the documentation provided by the vendors with their
products, it is impossible to single out one standard Big5. This was
actually a big problem in the process of designing Unicode -- and it
remains one even today.
¡]³o¤@¬qÁ¿¨ì BIG5 µLªk²Î¤@¼Ð·Çªº¤j°ÝÃD¡Aª½¨ì¤µ¤éÁÙ¬O¦p¦¹¡A¦b±N¨Ó
Unicode ¨î©w®É¥ç·|³y¦¨³Â·Ð¡C¡^
CNS X-11643-1986 and CNS X-11643-1992
¡]¤¤¥¡¼Ð·Ç§½ CNS X-11643-1986 ¤Î CNS X-11643-1992¡^
This is the Chinese National Code for Taiwan. In the form published in
1992, it defines the glyph-shape, stroke count and radical heading for
48027 characters. For all these characters a reference font in a 40 by
40 grid ( and for most of them also in 24 by 24 grid ) is available
from the issuing body. These characters are assigned to 7 levels with
the more frequent at the lower levels and the variant forms at the two
top levels. The whole architecture reserves space for five more
standard levels and four level are reserved for non-standard, private
encoding, bringing the total to 16 levels, with a hypothetical space
for roughly 120 000 ideographs. On top of the currently defined ones,
one more level with about 7000 characters is currently under revision
and expected to be published in the course of 1995. This will bring
the total number of assigned characters to roughly 55000.
³o¬O¥xÆWªº¤¤¥¡¼Ð·Ç½X¡C¦b 1992 ¦~µo§Gªº®æ¦¡¤W¡A¥¦¬° 48027 Ó¤¤¤å¦r
©w¸q¤F glyph-shape¡Astroke count¡A¥H¤Î radical heading ¡C¹ï©ó³o¨Ç
©Ò¦³ªº¤¤¤å¦r¡A¨Ã¦³¬ÛÀ³ªº 40 x 40 ®æ¤lªº¦r«¬¡]¤j³¡¥÷ªº¥ç¦³24 x 24
¦r«¬¡^ªþ¦bµoªíªº¤º®e¤W¡C
³o¨Ç¤¤°ê¦r³Q¤À°t¦Ü¤CÓ¦r±¡A¥H³Ì±`¥Îªº¦rÂ\¦b¤U¼h¦r±¡A¥H¤ÎÅܲ§ªº
¦rÅéÂ\¦b¤W±¤G¼h¦r±¡C¤¤¥¡¼Ð·Ç½Xªº§Þ³N¡A¨Ï¥¦«O¯d¤F¤Ó¥H¤Wªº¼Ð·Ç¦r±
¥H¤Î¥|Ó«D¼Ð·Ç¡B¨p¤H¥Î¦r±¡A¨Ï±o¥¦Á`¦@¥i¥H¦³ 16 Ó¦r±¡A¨Ã¥B¹ï©ó²Ê²¤
ºâ¨Ó 120 000 Ó¦r¸¹¦³Ó°²³]ªºªÅ¶¡¡C
¦b¥Ø«e¤w©w¸qªº³Ì¤W¼h¦r±¡]²Ä¤C¼h¡^¡A¤@¼h¦hªº¦r±¡]¨ã¦³¬ù 7000 Ó¦r¡^
¥¿¦b¥[¥H«·s¼f®Ö¡A¨Ã¥B¥´ºâ¦b 1995 ¦~¤½§G¡C³o±N¨Ï±o¥¦©Ò«ü©wªº¤¤¤å¦r¤¸
¥i¹F¨ì±Nªñ 55000 Ó¦r¡C
The overall structure has already been outlined; but how does the CNS
code relate to other code sets in use in East Asia, e.g. the Korean
KSC, the Japanese JIS, and the mainland Chinese GB? And what about
Unicode?
³o¾ãÅ骺µ²ºc¤w¸g³Q¤Äµe¥X¨Ó¤F¡C¦ý¬O CNS ½X»P¨ä¥¦ªF¨È©Ò¥Îªº¤º½X
¡]¨Ò¦pÁú°ê KSC ½X¡B¤é¥» JIS ½X¡B¤¤°ê¤j³°Â²Åé GB ½Xµ¥¡^¦³¤°»ò
Ãö«Y©O? ©M Unicode ªºÃö«Y¤S¦p¦ó©O?
The answer to this is somewhat disappointing: Although CNS defines
roughly eight times the number of characters, more than three hundred
characters present in the Japanese JIS are still missing from the CNS.
In relation to GB, the CNS misses roughly 1800 simplified characters.
With this it is also clear that the CNS code will miss quite a number
of Unicode Han characters. Upon closer examination, the reason is soon
obvious: CNS in its higher levels occasionally defines some
abbreviated forms, but in general it does not include characters
created as a result of the modern character reforms. I consider this a
serious drawback and an obstacle to a true universal character set.
But this seems to h³B²z³o¶µ¶·¨D¡C¹ê»Úªº¤u§@
Åã¥Ü¤F©µ¥Î¤w²ßºDªº¤u§@Àô¹Ò¡]°t¦X¦r«¬¡B½s¿è¾¹µ¥¡^¬O¦h»òªº«n¡C
¦]¦¹¡A§Ú²{¦b´£Ò¨Ï¥Î¤@ºØ¥Ø«e°ê»Ú³q¦æªº¤º½X¡]¥xÆWBIG5 ©Î¤é¥» JIS¡^
°t¦X¡uIRIZº~¦r®w¡v¡A¬O¤ñ°_±Ä¥Î CCCII ¨Ó±o¦nªº¤è®×¡C
_________________________________________________________________
1. Before launching large database projects, one ought to find out
what has already been done in the area and study its qualities and
defaults. Often one learns much by asking programmers and database
designers what they would do differently if they could start all
over again. In the field of Buddhist studies, the Electronic
Buddhist Text Initiative tries to help in this coordination and
learning process.
This may sound trite, but it is a fact that even major projects in
the field are unaware of what is happening elsewhere Ñ and sometimes
even in their own institution. On the recent field trip organized by
the Electronic Buddhist Text Initiative, we found for example that
the people managing the Chinese University of Hong Kong concordance
project were not aware of the very similar effort in Oslo; and a
long-time resident scholar at the Academia sinica found out through
us that important materials for a Chinese text he has been
translating are on his instituteÕs computer. That electronic
versions of a text exist does not mean much in itself; one must
evaluate data quality, accessibility, and suitability for oneÕs
project.
2. One must classify data input projects by the amount of data
involved and their destination. Thus one must distinguish between
small amounts of data and large amounts of data, data destined for
individual users or small groups and data destined for large user
groups and institutions, etc. The present guidelines apply to
large input projects that contain many full-form Chinese
characters and are aimed at a large and diverse group of users.
Failure to make such distinctions may lead to inadequate demands for
data quality, search strategies, etc. For example, certain automatic
or half-automatic methods of scanner input can be quite useful and
efficient for an individual user prepared to spend a substantial
amount of time for data correction; but the very same method may
prove totally inadequate for large-scale institutional data input
because of the high cost of error correction. Similarly, a
relatively high number of mistakes may not bother some users but is
unacceptable for data that are to be distributed to other users.
Again, the use of many self-defined characters can be acceptable for
individuals but not for institutions.
3. It is of the greatest importance to make basic decisions at the
beginning of a project and to discuss them with specialists. In
making these decisions, both present and future possibilities of
use must be kept in mind. This applies particularly to the choice
of source text, text editing, annotation, basic data character
(character encoding, data format, non-standard character handling,
etc.), and hard/software environments. Such questions must be
discussed by a team of specialists at the outset of a large
project, i.e. before the main input activity starts, and an action
plan should be approved by the whole team.
Failure to do this can result in gigantic waste of money. Several
Chinese text databases I know of started out with little planning;
mostly they were designed to fit the hardware and software
environment of some years ago at a specific location. Later, when
trying to convert the data to present requirements and for use by
other institutions, they found that automatic conversion was not
possible or corrupted the data set. Prior planning and consultation
with specialists could have prevented this. Another example: tagging
data during the input or correction / editing process can improve
the value of a database enormously, for example in making it
possible to look for all plant names or place names in the whole
Pali canon. Doing something like this at a later point would be
another major enterprise that could have been avoided through
careful planning.
4. If the electronic text is (or may at a later point in time be)
destined for international users and a variety of hardware and
software environments, it is necessary to make a basic data set
(master data set) that can later be automatically converted into
any necessary code or format. It is important to treat this master
data set as a separate entity whose input conditions, character
code, hardware environment, etc. can be very different from that
of the eventual user, just as studio quality music recording and
editing equipment is different from the reproduction equipment of
the consumer.
With Chinese text, the difference shows particularly in the way rare
characters and different national standards are handled.
Institutions that do not separate master data and user data
invariably produce data that follow the low standards of character
codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this
number by C. Wittern). Of the institutions visited on the recent
field trip, those who did not distinguish between master and user
data all suffer from data quality problems which will become even
more serious as larger codes become available. Those who were wise
enough to make this distinction are: the libraries of Taiwan
National University and Hong Kong University of Science and
Technology (both use master data in CCCII code and user data in
BIG-5) and the Chinese Academy of Social Sciences (master data in
their own 45,000 character code, user data in various formats). Just
like master tapes in the music business, master data must be of such
quality that it can be used in many different environments, present
and future. Most of the Chinese text data so far input in Japan,
Korea, and mainland China will have about as much future as the
recording of a concert made on a Walkman.
5. In order to assure such convertibility and adaptability, the
master data must contain the greatest possible amount of
information. This is an important factor of data quality. In the
case of Chinese, Korean, or Japanese data (or any other text set
that maip, we
met programmers who admitted that they have never actually used the
database they have been working on for years...
9. Databases are made for users; therefore the wishes, working
environment, and likely working habits of users must be carefully
studied and respected. For example, most users search while
writing a paper or book; therefore it must be possible to use the
database concurrently with a word processing program. Any large
text database should also let the user attach notes and tags to
the main text. Such notes should also be searchable, printable
(together with the text or separately), savable as separate files
with location tags, and portable to updated versions of the
electronic text. Search engines must also be adapted to many
usersÕ needs. Therefore it must be flexible and adaptable to a
variety of usersÕ preferences (just like word processing programs)
rather hard-coded. Search results should be viewable and printable
and file saveable in a variety of formats according to the userÕs
wishes. Since the main aim of databases is the retrieval of
information, such retrieval should be carefully planned with many
options for the user.
In projects whose input takes many years of work, one must make
programmers produce multiple test versions of search software and
have scholars and other prospective users evaluate it even while
input is going on. If necessary, data structure decisions have to be
reevaluated. Users should have a say in all important software
decisions, and programmers should assist users to evaluate test
versions and to formulate their wishes by telling them about
alternative possibilities.
Author:Urs App
Last updated: 95/04/23
==========================================================================
Date: Mon, 24 Jul 1995 23:39:11 +0800
From: Shann Wei-Chang <sq¥Lªº¤å¥ó¨Ó¬Ý¡A¦ü¥G¨S¦³
µ´¹ï¼ÖÆ[ªº¸Ñ¨M¤èªk¡Cªº½T¥O¤HW´o¡C
¥¼¨Óªº¤@¦Ü¤G¶g¡A§Ú±N§ë¤J¥þ¤O¼g¤@¥÷¤¤¤å TeX ªº¨Ï¥Î¤â¥U¡AµM«án¨ó§U¤uŪ¥Í
©Mp¤¤¼g accounting ªº³B²z scripts. «u¡A¦h»¡µL¯q¡AÁ`¤§§Ú«Ü·QÀ°¦£¦ý¬O¹ê¦b
µL¯à¬°¤O¡C
> ¤£¹L¨º¦ìʤѪº¤uµ{¤Hû¼B©ú«Â¥ý¥Íªí¥Ü¡A±oµ¥¦³¤@©w¼Æ¶qªº
> ¦ò±Ð¹ÎÅé¤ä«ù¦¹¤@ÂX¥Rªººc·Q«á¡A¼B¥ý¥Í¤~·|¥h¶i¦æµ{¦¡×§ïªº¤u
> ¨ã¡A¥H§K¨ìÀY¨Ó¥Õ¦£¤@³õ¡C
>
> ·Ó³o¼Ë¤l¨Ó¬Ý¦¹ Big-5 ªº§ï¨}ª©¥»®£©È·|¦³°ÝÃD? ¤£¹ê¥Î?
> ¦]¦¹¤@¯ë user ¨Ï¥Îªº¤´µM¬Oªº Big-5 ª©¥»...
> ¦]¦¹³oÓª©¥»¬J¤£¦p CCCII, Unicode µ¥¯à´£¨Ñ "¥þ¼Æ" ªº³y¦r¡A
> ¤S¤£¹³ Big-5 ¯ëªº¬y³q¡A¦ü¥G¥u¯à§@¹L´ç¤§¥Î?
§Ú¤£¤ÓÀ´³o¤@¬q¸Üªº·N¸q¡C CCCII ªº°ÝÃD Wittern ¤w¸g»¡±o«Ü²M·¡ (§Ú¥H«e¨S³o»ò
²M·¡¡A¥u¬O¦b²z½×±À²z¤W¡A»{¬°¥¦¤£¬O¤@Ó¦n¥D·N¡A²{¦b Wittern µ¹¤F«Ü©ú½Tªº§Þ³N
¸ê®Æ¡A»¡©ú¥¦¤£¬O¤@Ó¦n¥D·N), ¦ý§Ú¤£»{¬° Unicode ¯à´£¨Ñ¥þ¼Æªº³y¦r, ¥¦²¦³º¬O
¤@Ó©T©w¤j¤p 256*256 ªº¦rªO¡A³y¦rªºÓ¼Æ¬O¦³¤Wªº¡F¦Ó¥B³oÓ½XÁÙn¥þ¥@¬É¨Ó¤À
µÛ¥Î¡A¤£¥i¯à§â©Ò¦³³y¦rªÅ¶¡³£µ¹¤F§Ú̧a¡HÁÙ¦³¡A§A»¡°µ¹L´ç¤§¥Î¡A«üªº¬O½Ö¡H
¬O§ï¨}ªº Big-5 ¶Ü¡H¥i¬O§Aè¤~¤£¬O¤~»¡Ê¤Ñ²{¦b¤£¯à®³¥X¨Ó¥Î¶Ü¡H
§Ú«ÜÃÙ¦¨ Wittern ¤å³¹¤¤ (©Î¬O¥t¤@¤H¼gªº¡AÁ`¤§¬O§Aªþªº¨º½g) ©Ò»¡ªº¡A¸ê®Æ
n¤À¤º½X (master data) ©M ¥~½X (user data)¡C¦pªG§A±µ¨ü³oÓÆ[©À¡A¨º»ò§Y¨è
¥i¥H¿ï¤@ӳ̾A·íªº¦r½X¨Ó»s³y master data¡C¬Æ¦Ü¤£¥²²z·|¥ô¦ó¼Ð·Ç½X¡C¦Ó§Ú
Ó¤Hªº«Øij (¤@Ó¤£°Ñ»P¤u§@ªº¤H»¡³o»ò¦h«Øij¡A¹ê¦b«Ü¤ßµê) ¬O¡A¸ò¥H«e¤@¼Ë¡A
¾¨¶q¥Î CNS, ¤£¨¬ªº¦r¦Û¦æ©w¸q¡A¥Î¸õ²æ½Xªí¥Ü§A̪º¯S®í³y¦r¡A¦b PC ¤W¦³«Ü¦h
³y¦rµ{¦¡¦@±z̥ΡA¦b UNIX ¤W¤j®a¤@«ß¥Î X Window ªº bitmap ©Î BDF ®æ¦¡§Y¥i¡C
¤@¥¹¤º½X³y¦¨¤F¡A»P¥~½Xªº¹ïÀ³¥u¬O¤@±iªí®æªº°ÝÃD¡C
> ½Ð°Ý¤@¤U¡A¥H CNS ¼Ð·Ç¿é¤Jªº¤å¥ó¡A¦b BIG-5 ¤U±¥i¥H¬Ý¶Ü?
CNS ¼Ð·Çªº¨âÓ bytes ³£¬O low bytes (0 .. 127), ³o¬O ISO ªº¼Ð·Ç¡C
¤£¦P¦r±ªº CNS ¥Î¸õ²æ½X¡A©Ò¥H°ò¥»¤W©M Big-5 ¬OºIµM¤£¦P¡C¦ý¬O¦b PC ¤W
ʤѴ£¨Ñ CNS ½X¡A¥Lªº·N«ä¬O shift-CNS (like shift-JIS). ¥L¥u¥Î CNS ªº
²Ä¤@¤G¨âÓ¦r±, ²Ä¤@¦r± shift ²Ä¤@Ó byte 128...255, ²Ä¤G¦r±§â¨âÓ byte
³£ shift. ¬GÄY®æ¨Ó»¡Ê¤Ñ©Òµ¹ªº CNS ½X¤]¤£¬O¼Ð·Çªº.
¦Ó¥B CNS ªº«e¨âÓ¦r±©M Big-5 ¤]¤£¬O order-preserving one-to-one mapping,
©Ò¥H§Y¨Ï¬O shift-CNS ¤]¤£µ¥©ó Big-5. ¥h¦~§Ú´¿ªá¤F¦Ü¤Ö¤@Ó¤U¤È¥h·d²M·¡
Big-5 ©M CNS plans 1,2 ªº®t²§¡A¨Ã½T©w Big-5 ªº¿ù»~¤§³B¡A§Ú´¿¼g¤@¥÷³ø§i
post µ¹ CCNET-L, ²{¦b¨S®É¶¡§ä¥X½Z.
¦ý¦³¤@Ó betty µ{¦¡¥i¥H¤Î®É§â shift-CNS Âন Big-5 (vice versa), ¦ý¥¦¥u¦b
UNIX ¤W°õ¦æ.
> ½Ð±Ð¤@¤U¡A¤£ª¾ CNS ªº¤¤¤å¨t²În¦p¦ó¨ú±o©O?
°Ý˧ڤF¡C°£¤FʤѤWªº shift-CNS §Ú¨S¨£¹L¨ä¥Lªº implementations. ³o·íµM
¤£¬O PD µ{¦¡¡C§Ú²q¸êµ¦·|©M¬Y¨Ç¬F©²³æ¦ì¤@©w¦³³o³nÅé¡A¥u¬O°Ó³õ¤W¥¦²@µL¥ß¨¬
¤§¦a¡A©Ò¥H¤@¯ëªº¨Ï¥ÎªÌ¬Ý¤£¨ì³oºØ²£«~¡C¦b UNIX ¤W§Ú·Q§Úª¾¹D¦p¦ó°t¦X CXTERM
implement ¤@¥÷ shift-CNS ªº¤¤¤åÀô¹Ò¡A¦Ü©ó¦Û³y¦rªº¸õ²æ½X³B²z¡A§Ú·Q¥i¥Hקï
betty µ{¦¡¨Ó implement. Betty ªº§@ªÌ¦b²M¤j (§Æ±æ¥LÁÙ¨S²¦·~), ¥i¥H½Ð¥L
«ü¾É¡C
-Shann
/End of lin