¬ÝªO: BudaTech ¡· ¦ò¨å¹q¤l¤Æ°Q½× ªO¥D: HeavenChow |
¾\Ū¤å³¹¡G ²Ä 58/2032 ½g | ¤W½g | ¤U½g | ¦^ÂÐ | Âà±H | Âà¶K | m H d | ªð¦^ |
µo«H¤H: b83050@cctwin.ee.ntu.edu.tw (Post Gateway), «H°Ï: BudaTech ¼Ð ÃD: ¤¤¤å¦ò¨å³y¦r°Q½× FAQ µo«H¯¸: ¥Ñ ·à¤l§q¯¸ ¦¬«H (Fri Mar 29 17:30:18 1996) ~---------- Forwarded message ---------- Date: Sat, 2 Sep 1995 13:29:57 +0800 (CST) From: David Chiou <b83050@cctwin.ee.ntu.edu.tw> Subject: Chinese Characters FAQ (about Buddhism) ¥H¤U¬Oªñ¨Ó¦Ü¦U³B·j¶°¨Óªº¤¤¤å¤º½X¬ÛÃö¤å¥ó¤¤¡A¤ñ¸û«nªº¡C ¥Ø«e¦ò¨åªº¤º½X¿ï¥Î¥H¤Î³y¦r°ÝÃD¡A¬O¦ò¨å¿é¤Jªº²~ÀV¡A¥H¤U ¸ê°T¨Ñ¦U¦ì¾Çªø°Ñ¦Ò¡C ps. Y¦¹ mail alias ªº¾Çªø¡A¦³¦b¦x°|¤u§@©Î¬O¹ï©ó¬ÛÃö ¤¤¤å¿é¤Jªº°T²ß¡]¤º½X¡B³y¦rµ¥¡^«Ü¦³¿³½ìªº¤H¡A ½Ð¦^¨ç§iª¾¥½¾Ç¤@Án¡A¥H±N±z¥[¦b¦ò¨å¿é¤Jªº¦ò±Ð¾÷ºc mail alias ¤¤¡C ¦³¨ÇÃö©ó¤¤¤å¤º½Xªº§Þ³N©Ê°ÝÃD¡A±N¤£·|¦b¥Ø«eªº mail alias ¤ºµo§G¡C ¡]¥Ø«e¥u¦³ corbon copy ¦Ü¥x¤j¦ò¾Ç¬ã¨s¤¤¤ß¡B»¥ú¦x ¦Ûlªk®v¡B¹AÁI¦xªG¥úªk®vµ¥´X¦ìªk®v¡A¤Î´X¦ì¯S§O¼ö¤ßªº¾Çªøªº±b¸¹¦Ó¤w¡C¡^ ¥H¤U§Y¬O¦ò¨å¬ÛÃö¤¤¤å¤º½Xªº«n FAQ: ¡]¥½¾Ç¤W¦¸´¿Âà¶K¼Æ¤Q«Ê¬ÛÃöªº«H¥óµ¹¥H¤W¦ò±Ð¾÷ºc¡A ´£¨Ñ§@¬°°Ñ¦Ò¡C¦pªG¦³¾Çªø¹ï¦¹¯S§O¦³¿³½ìªº¸Ü¡A ¥i¥H¦V¥½¾Ç¯Á¨ú§ó¸Ô²Óªº¤å¥ó¡A©Îª½±µ¥[¤J¦ò±Ð¾÷ºc ªº¦W³æ¤¤¡C¡^ ========================================================================= Date: Sat, 13 May 1995 10:07:34 +0800 From: Shann Wei-Chang <shann@math.ncu.edu.tw> §@ªÌ²¤¶¡G¤¤¥¡¼Æ¾Ç¨t³æºû¹ü±Ð±Â¡A¹ï©ó°ê¾Ç·¥¦³¿³½ì¡A¹ï©ó UNIX ¨t²Î¥ç«D±`¼ô¡A °Ñ»Pºô¸ô¤W¤º½Xªº°Q½×¤w¦h¦~¡C Subject: internal code ¤jè, ¤è¤~Ū¤F§Aªº report, ¦³Ãö¦ò¨å¿é¤J¸I¨ìªº¨u¨£¦r°ÝÃD. §Aª¾¹D§Ú¦b CCNET ©M CHPOEM ªº mailing list ¤W«Ü¤[¤F, §Ú̱`±`°Q½×³o¤@Ãþªº°ÝÃD. Ãö©ó¥¦ªº ¸Ñ¨M¤è®×, ¨ä¹ê¬O¨S¦³¦@ÃѪº©w®×, ¦Ó¥B§Ú¦Û¤vªº·Qªk¤]ÀH®É¶¡§ïÅÜ (¤£ª¾¬O¤£¬O ¶VÅܶV¦¨¼ô´N¦³«Ý®É¶¡¦ÒÅç¤F). Åý§Ú§i¶D§A§Ú²{¦bªº·Qªk, ¥H¸ê°Ñ¦Ò. ²Ä¤@, §Ú¤£³ßÅw Big-5 ©M·íªì³]p¥¦ªº ¨º¤@À°¤H, ³o¬O¨å«¬ªº¦H¹ôÅX¨}¹ôªº¨Ò¤l. ¦ý¬O, ÀHµÛ¹ï¨Æ¹êªº»{ª¾»P§´¨ó (³oÀ³¸Ó¬O»P¦~ÄÖ¦³Ãö), §Ú¶}©l©Ó»{, ¥ô¦ó·Qn´¶¹M¬y¶Çªº¤¤¤å¹q¤lÀÉ®×, ¥²¶·»P Big-5 ¬Û®e; ª½±µ¬Û®e, ¤ð¶·Âà½X©Î¯S®í³B²z. n output ¯S®í¦r¤ñ input ²³æ, (input for search, for instance). ¦ý¬O, ¤@½g¹q¤l¤å¥ó³q±`¥u¦³¦r½X, ¦Ó¤£ªþ±a¦r«¬ (glyph, the bitmap binary file or in other formats). ¦pªG¤å¥ó¬O©ñ¦bºÏ¤ù©Î¥úºÐ¤W¬y³q, ³oÓ°ÝÃD¤ñ¸û¤p, ¦ý¬O§ÚÌÁ`§Æ±æ¦P¼Ëªº¤å¥ó, À³¸Ó¯à¦b·¥¤ÖªºÅܰʤU©ñ¨ìºô¸ô¤W¬y¶Ç. ³o®ÉÔ, ¤å¥ó»P¾\Ū¾¹´N¬O¨â½X¤l¨Æ. ³o¬O³Ì»Ýnªá¤O®ðªº¦a¤è. §Ú¥Ø«eªº·Qªk¬O, °ò¥»¤W¨Ï¥Î Big-5 ½X, ¸I¨ì¨u¥Î¦r, ¥Î Escape sequence ¹j¶}, ´N¹³®ü¥~¯d¾Ç¥Í±`¥Îªº HZ ½X, ©Î¬O¤é¥»ªº JIS ¼Ð·Ç, ¥H¤Î¤j³¡¥÷ UNIX ¤u§@¯¸¤§´©ªº EUC. ¦pªG¨Ï¥ÎºÝªº¾\Ū¾¹µLªkÃѧO³oÓ Escape sequence, ©Î¬O ¨S¦³¬Û¹ïÀ³ªº¦r«¬, «hŪªÌ¥i¯à¬Ý¨ì¤@¦ê¶Ã¤C¤KÁVªº¦r, ¦ý¬O³q±`³o¨Ç¦rÀ³¸Ó ¤£¦h, ¤£¦Ü©ó¼vÅT¾ãӤ峹ªº¤º®e. ¦Ü©ó¸Ó¥Îþ¨Ç¦r¦ê§@¬° Escape sequence? §Ú°êªº CNS ½X¤w¸g¦b°ê»Ú¤Wµù¥U, §ÚÌÀ³¸Ó¾¨¶q¸òÀH³oӼзÇ, ¤£¯à¸òªº®ÉÔ, À³¸Ó¹B¥Îºô¸ô¤j²³¶Ç¼½ªº¤O¶q, ¥[¤W¬Fªv´å»¡ªº¤O¶q, §â§ÚÌ¿ï©wªº Escape sequence ³]¦¨¼Ð·Ç. ¦Ü©ó¨u¥Î¦r¸Ó¦p¦ó½s½X, ¦P¼ËÀ³¸Ó¥ý°Ñ¦Ò¤¤¥¡¼Ð·Ç§½¦b 1992 ¦~¤½¥¬ ªº¼Ð·Ç¥æ´«½X. ³oÓ½Xªº½s±Æ²Å¦X°ê»Ú¼Ð·Ç, ¥Ø«e¦@¦³¤CÓ¦r±, ÁÙ¦³«Ü¦h¬A¥R ªºªÅ¶¡, ¨CÓ¦r±¨Ì°ê»Ú¼Ð·Ç±Æ¤J 94*94 Ó¦r½X (two bytes, each byte is between 33 and 126, decimal inclusive). ²Ä¤@¤G¦r±©Ò¿ï©wªº¦r°ò¥»¤W»P Big-5 ¬Û¦P, ¦ý§ï¥¿¤F´XÓ (¤]³\¬O©Ò¦³ªº) ¿ù»~. ²Ä¤T¨ì¤C¦r±©w¸q¤F¤T¸U ¦hÓ¨u¥Î¦r, ©ÎÅé¦r, ²§Åé¦r, ©M¤@¨Ç¥u¥X²{¦bºâ©R¥ý¥Íªº©R¦W¾Ç¤Wªº©_©_©Ç©Ç ªº¦r: ¥¦Ìªº¦r½X¥H¤Î¦r«¬. ¤K¨ì¤Q¤»¦r±ªÅµÛ, ²Ä¤Q¤G¦r±¬O user defined. §Úªº¾ÇÃѤ£¨¬¥H¾ÌÂ_³o¨Ç¦b²Ä¤T¨ì²Ä¤C¦r±ªº¦r¬O§_§¹¾ã©Î±Æ§Ç§´·í, ¦]¬°¥¦Ì ¥þ¬O§Ú¤£»{ÃѪº¦r. ¦pªG¦ò¸g¸Ìªº¦rÁÙ¦³¦b³o¸Ì§ä¤£¨ìªº, §Ú«Øij¤£n¥Î²Ä¤Q¤G ¦r±, ¦Ó¬O¹B¥Î¦ò±Ð¹ÎÅ骺¬Fªv¤O¶q¥hª§¨ú¤@Ó¦r±, ¨Ò¦p¤Q¤T, §@¬°©v±Ð¨u¥Î¦r±. ¦]¬°, ©Ò¿× user defined, ¨ì³Ì«á¤@©w¬O¤@¹Î¨S¥Îªºµ}ªd. ¦Ü©ó¨u¥Î¦rªº¿é¤J, «Ü©úÅ㪺, ¥²¶·µo®i¹ïÀ³ªº¤¤¤å¿é¤J³nÅé¥H¤Î¦r«¬. ¦b X window ¤W¤w¸g¦³¤@®M§@ªk¥i´`, ¨ä¥L¨t²Î¤W¤]¤£¸Ó¦³§Þ³N¤Wªº§xÃø. §Ú̪º¬F©²¤£ª¾¹D¦b°µ¤°»ò, ¥H»OÆWªº¦Û³\¬°¹q¸£¤ý°êªº¦a¦ì, §Ú̪º°ê®a¥æ´«½X¨ì 1986 ¤~º¦¸¤½¥¬, ¦Ó¥B¤S·¾³q¤£¨}, ¾ÉP¥«³õ¤W¨S¤H²z¥¦ (¤£²z¬F©²¦ü¥G¬Oªñ¥N¨â ©¤¤¤°ê¤Hªº¦@¦P¯S¼x). §Ú·Q, §Y¨Ï²{¦b, ÁÙ¬O«Ü¦h°é¤l¸Ìªº¤H¨SÅ¥»¡¹L³oӼзÇ, ©Î¬OÅ¥»¡¤F¦ý¬O¨S¦Ò¼{¹Ln¥Î¥¦. ˬO¸êµ¦·|©M¤@¨Ç¤½®a³æ¦ì¶}©l (¤]³\¬O³Q¢) ¨Ï¥Î¥¦, °ê¥~ªº¤@¨Ç¤½¥q¶}©l¤ä´©¥¦, ¦]¬°¥¦²¦³º¬O¦b°ê»Ú¤Wµù¥Uªº°ê®a¼Ð·Ç½X. ®É¶¡¥^«P, ¼g¤F¨Ç§O¦r, ¦ý¦¹ editor ¤£®e©ö§ó¥¿, ½Ðì½Ì. -Shann ======================================================================== Date: Mon, 28 Aug 1995 22:57:15 +0800 (CST) From: David Chiou <b83050@cctwin.ee.ntu.edu.tw> Subject: Recommend Chinese Code -- CNS ¤U¤å§Y¬OÃö©ó¦UºØ¤º½XªºÂ²¤¶¡A¨ú¦Ûªá¶é¤j¾ÇÁI¾Ç WWW: http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm ¡]¤@¨Ç«nªº¤º®e¡A§Ú·|ÀH¤âªþ¤W¤¤¤å½Ķ¡A¤£¹L¤£«OÃÒ¨S½¿ù¡C ¤@¤Á±o¥Hì¤å¬°·Ç¡C¡^ _________________________________________________________________ Chinese character codes: an update ¤¤¤å¤º½Xªº±´¯Á¡Gק睊 by Christian Wittern §@ªÌ²¤¶¡G¤é¥»¨Ê³£ªá¶é¤j¾ÇÁI¾Ç¤¤¤ß¡]§Y¡u¹q¤l¹F¼¯¡v¥Zª«µo¦æªÌ¡^ ªº¸ê²`¤Hû¡Cªá¶é¤j¾ÇÁI¾Ç¤¤¤ß¹ï©ó¦ò¨å¹q¤l¤Æªº¥þ¥@¬É Ápµ¸¤u§@¡A¦Û 1992 ¦~¥H«e§Y¶}©l¶i¦æ¡A¥i¬O·í¤µ°ê»Ú¤W ³Ì¤jªºÁpµ¸ºô¡C _________________________________________________________________ Summary This article presents an update to Christian Wittern's and Urs App's articles concerning Chinese character codes (Electronic Bodhidharma No. 3). In those articles, Urs App argued that database creators must make the most crucial distinction between master data and user data. Master data should be of the highest quality, recording even minute detail like studio recording equipment. User data, on the other hand, must conform to what codes and equipment we presently have. Christian Wittern's article compared different codes and concluded that CCCII, a very large Taiwanese code that also includes Japanese and Korean letters, seems to be the best choice for the master data set of Chinese text databases. ºKn ¥»¤å§ï¶i¤F Christian Wittern ¥ý¥Í©M Urs App Ãö©ó¤¤¤å¤º½XªºµûªR ¡]¥Z¸ü©ó¡u¹q¤l¹F¼¯¡v´Á¥Z²Ä¤T´Á¡^¡C¦b¸Ó¤å¤¤¡A Urs App ªí¥Ü¸ê®Æ®w ªº«Ø¥ßªÌ¥²¶·¹ï©ó master data ¤Î user data §@¤U«D±`«D±`«nªº¨M©w¡C Master data ¥²¶·¨ã¦³³Ì°ªªº«~½è¡A¦p¦P¿ý¼v¾¹§÷°O¿ý¤U¨C¤ÀÄÁªºµe±¤@¯ë¡F ¥t¤@¤è±¡A user data ¥²¶·¶¶±q©ó¨ººØ¤º½X¬O§Ú̲{¦³ªº¡C Christian Wittern ¥ý¥Íªº¤å³¹¤ñ¸û¤F´XºØ¤£¦Pªº¤º½X¡Aµ²½×¬O¡G ¡u CCCII¡]¤@ºØ«D±`Ãe¤jªº¥xÆWªº¤º½X¡A¨Ã¥B¥]§t¤F¤é¥»¤ÎÁú°ê¦r¡^ ¦ü¥G¬O¤¤¤å¤º½Xªº master data ªº³Ì¨Î¿ï¾Ü¡C¡v We shelled out US $ 2000 for a CCCII board, only to discover that both the code itself and its implementation are seriously flawed. We thus had to continue using Big-5 for all practical purposes while looking for better solutions. Finally, Christian decided that the only practical approach at this time was to build on Big-5 (and other national codes such as JIS) and extend them through code references that are both stable and portable. His ingenious approach forms the basis of the IRIZ KanjiBase and its encoding scheme -- a scheme which will be as useful after the introduction of Unicode as it proves to be right now. (U.A.) §Ú̪á¤U¤F¬üª÷ 2000 ¤¸¡A¶R¤F¤@Ó CCCII ªºªO±¡Aµ²ªGµo²{¸Ó½X¥»¨¤Î ¥¦ªºªþÄݳ]³Æ¡A³£¨ã¦³ÄY«ªº·å²«¡C¦]¦¹¡A§Ú̦b¹ê»Úªºª¬ªp¤W¡A¥u¦nÄ~Äò ¨Ï¥Î BIG-5¤º½X¡Aµ¥µÛÄ~Äò´M§ä§ó¦nªº¸Ñ¨M¤è®×¡C³Ì«á¡A Christian ¥ý¥Í ¨M©w¤F¡A²{®É°ß¤@¹ê»Ú¥i¦æªº¤èªk¬O«Ø¥ß¦b BIG-5 ¡]¤Î¤é¥»°ê¤º´¶¹M¬y¦æªº JIS ½X¡^¤W±¡A¨Ã¥BÂǥѬJéw¤S¨ã¥iÄâ©Êªº¡u¤º½X°Ñ·Óªí¡v¡]code references¡^ ¨ÓÂX®i¥¦Ì¡C¥Lªº³o¶µÁo©ú´£Ä³²£¥Í¤F¡uIRIZ º~¦r®w¡vªº°ò¦¡A¥H¤Î¡uIRIZ º~¦r®w¡vªº¡uÂàĶ¾¹¡v¢w¢w¤@ºØ¦b±N¨Ó Unicode ¤Þ¶i«á¡A¯à°÷¦p¦P²{¦b§ÚÌ ÃÒ©ú¥¦¦³°÷¹ê¥ÎªºÂàĶ¾¹¡C _________________________________________________________________ * Some kanji codes for computers 1. Japanese JIS Codes 2. Taiwanese Big5 3. Taiwanese CNS 4. CCCII and EACC 5. Unicode ¡¯¤@¨Ç¹q¸£¤Wªºº~¦r¤º½X¡G 1. ¤é¥» JIS ¤º½X 2. ¥xÆW BIG-5 ¤º½X 3. ¥xÆW¤¤¥¡¼Ð·Ç§½ CNS ¤º½X 4. CCCII¤º½X¤Î EACC µ{¦¡ 5. Unicode * More information is available at ifcss.org in Ross Patterson's document CJK Codes and in Ken Lunde: Understanding Japanese Information Processing p35ff. ¡¯¦b ifcss.org(.jp) ¤W¦³§ó¦h¦³¥Îªº¸ê°T¡A´N¬O Ross Patterson ¥ý¥Íªº ¡u CJK ¤º½X¡v¤@¤å¡A¤Î Ken Lunde¥ý¥Íªº¡G¡u¤F¸Ñ¤é¥»¦b³B²z p35ff ¤W ªº¸ê°T¡v¤å¥ó¡C _________________________________________________________________ Development of kanji codes for computers ¹q¸£º~¦r¤º½Xªºµo®i Japanese JIS Codes ¤é¥» JIS ½X The first character code designed to make the processing of ideographic characters on computers possible was the JIS C 6226-1978. It was developed according to the guidelines laid down in the ISO standard 2022-1973 and became the model for most other code standards used today in East Asia (the most notable exception is Big5). Covering approximately 6500 characters, this standard has been revised two times, in 1983 and 1990, where the assignment of some characters where changed and a few added. Revising a standard is about the worst thing a standard body can do and has caused much grieve and headache among manufacturers and users alike. Today we finally have fonts that bear the year of the standard they cover in their name, so that users can know which version is encoded in that font and select if accordingly. Our texts and tools are based on the latest version. The version of 1990 has become known under the name JIS X 0208-1990 and has been together with an additional set of 5800 characters (JIS X 0212) the base of the Japanese contribution to Unicode. The JIS code is almost never used in computers as it was defined; rather, some changes are made in the way the code numbers are represented. This is necessary to allow JIS be mixed with ASCII characters and, as in the case of ShiftJis (or MS-Kanji, the most popular encoding on personal computers) with earlier Japanese encodings of half-width kana. East Asian text is thus most frequently based on a multibyte encoding, a character stream that contains a mixture of characters represented by one single byte and of characters represented by two bytes. In addition to the characters in the national standard, many Japanese vendors have added their own private characters to JIS, making the conversion between these different encodings difficult beyond belief. Big5 ¡]¤¤¤å BIG-5 ½X¡^ There are different legends about the beginnings of Big5; some say that the code had been developed for an integrated application with 5 parts, and others say it was an agreement of five big vendors in the computer industry. No matter which one is true (and it might as well be something else), the Taiwanese government did not realize the need for a practical encoding of Chinese characters timely enough. Government agencies had apparently been involved also in the development of Big5, but it was only in 1986 that an official code was announced, a time by which Big5 was already a de facto standard with numerous applications in daily use. Ãö©ó BIG-5 ¤º½X¶}©lªº¶Ç»¡¡A¦³³\¦h¤£¦Pªºª©¥»¡G¦³¤H»¡¦¹¤º½X¬O¥Ñ¤@Ó ¾ã¦X¤Ó³¡¥÷ªºÀ³¥Î³nÅé©Ò²£¥Íªº¡A¤S¦³¤H»¡¥¦¬O¤Ó¤j«¬ªº¹q¸£¼t°Ó©Ò ¦@¦P¬ù©wªº¡C¤£ºÞþ¤@Ӷǻ¡¬O¯uªº¡A¥xÆW¬F©²¨Ã¥¼§Y®É¤F¸Ñ¤¤¤å¤º½X ªº«n©Ê¤Î¶·¨D©Ê¡CÁöµM¬F©²¾÷Ãö«Ü©úÅã¦a¤]°Ñ»P¤F BIG-5 ªº¶}µo¤u§@¡A ¤£¹Lª½¨ì 1986 ¦~¡A©x¤èªº¤º½X¤~¥¿¦¡¹ï¥~«Å§G¡A³o®É BIG-5 ¤º½X¦¤w¬O ¬°¼Æ·¥¦hªº¤é±`À³¥Î³nÅé©Ò±Ä¥Îªº¼Ð·Ç¤F¡C Big5 defines 13051 Chinese characters, arranged in two parts according to their frequency of usage. The arrangement within these parts is by number of strokes, then Kangxi radical. As Big5 was apparently developed in a great hurry, some mistakes were made in the stroke count (and thus placement) of characters, and two characters are twice represented. On the other hand, some frequently used characters were left out and were later implemented by individual companies. All implementations agree on the core part of Big5, but different extensions by individual vendors aquired much weight, most notably in the case of the ETEN Chinese system that was very popular in the late eighties and early nineties. As there is no document that defines Big5 apart from the documentation provided by the vendors with their products, it is impossible to single out one standard Big5. This was actually a big problem in the process of designing Unicode -- and it remains one even today. ¡]³o¤@¬qÁ¿¨ì BIG5 µLªk²Î¤@¼Ð·Çªº¤j°ÝÃD¡Aª½¨ì¤µ¤éÁÙ¬O¦p¦¹¡A¦b±N¨Ó Unicode ¨î©w®É¥ç·|³y¦¨³Â·Ð¡C¡^ CNS X-11643-1986 and CNS X-11643-1992 ¡]¤¤¥¡¼Ð·Ç§½ CNS X-11643-1986 ¤Î CNS X-11643-1992¡^ This is the Chinese National Code for Taiwan. In the form published in 1992, it defines the glyph-shape, stroke count and radical heading for 48027 characters. For all these characters a reference font in a 40 by 40 grid ( and for most of them also in 24 by 24 grid ) is available from the issuing body. These characters are assigned to 7 levels with the more frequent at the lower levels and the variant forms at the two top levels. The whole architecture reserves space for five more standard levels and four level are reserved for non-standard, private encoding, bringing the total to 16 levels, with a hypothetical space for roughly 120 000 ideographs. On top of the currently defined ones, one more level with about 7000 characters is currently under revision and expected to be published in the course of 1995. This will bring the total number of assigned characters to roughly 55000. ³o¬O¥xÆWªº¤¤¥¡¼Ð·Ç½X¡C¦b 1992 ¦~µo§Gªº®æ¦¡¤W¡A¥¦¬° 48027 Ó¤¤¤å¦r ©w¸q¤F glyph-shape¡Astroke count¡A¥H¤Î radical heading ¡C¹ï©ó³o¨Ç ©Ò¦³ªº¤¤¤å¦r¡A¨Ã¦³¬ÛÀ³ªº 40 x 40 ®æ¤lªº¦r«¬¡]¤j³¡¥÷ªº¥ç¦³24 x 24 ¦r«¬¡^ªþ¦bµoªíªº¤º®e¤W¡C ³o¨Ç¤¤°ê¦r³Q¤À°t¦Ü¤CÓ¦r±¡A¥H³Ì±`¥Îªº¦rÂ\¦b¤U¼h¦r±¡A¥H¤ÎÅܲ§ªº ¦rÅéÂ\¦b¤W±¤G¼h¦r±¡C¤¤¥¡¼Ð·Ç½Xªº§Þ³N¡A¨Ï¥¦«O¯d¤F¤Ó¥H¤Wªº¼Ð·Ç¦r± ¥H¤Î¥|Ó«D¼Ð·Ç¡B¨p¤H¥Î¦r±¡A¨Ï±o¥¦Á`¦@¥i¥H¦³ 16 Ó¦r±¡A¨Ã¥B¹ï©ó²Ê²¤ ºâ¨Ó 120 000 Ó¦r¸¹¦³Ó°²³]ªºªÅ¶¡¡C ¦b¥Ø«e¤w©w¸qªº³Ì¤W¼h¦r±¡]²Ä¤C¼h¡^¡A¤@¼h¦hªº¦r±¡]¨ã¦³¬ù 7000 Ó¦r¡^ ¥¿¦b¥[¥H«·s¼f®Ö¡A¨Ã¥B¥´ºâ¦b 1995 ¦~¤½§G¡C³o±N¨Ï±o¥¦©Ò«ü©wªº¤¤¤å¦r¤¸ ¥i¹F¨ì±Nªñ 55000 Ó¦r¡C The overall structure has already been outlined; but how does the CNS code relate to other code sets in use in East Asia, e.g. the Korean KSC, the Japanese JIS, and the mainland Chinese GB? And what about Unicode? ³o¾ãÅ骺µ²ºc¤w¸g³Q¤Äµe¥X¨Ó¤F¡C¦ý¬O CNS ½X»P¨ä¥¦ªF¨È©Ò¥Îªº¤º½X ¡]¨Ò¦pÁú°ê KSC ½X¡B¤é¥» JIS ½X¡B¤¤°ê¤j³°Â²Åé GB ½Xµ¥¡^¦³¤°»ò Ãö«Y©O? ©M Unicode ªºÃö«Y¤S¦p¦ó©O? The answer to this is somewhat disappointing: Although CNS defines roughly eight times the number of characters, more than three hundred characters present in the Japanese JIS are still missing from the CNS. In relation to GB, the CNS misses roughly 1800 simplified characters. With this it is also clear that the CNS code will miss quite a number of Unicode Han characters. Upon closer examination, the reason is soon obvious: CNS in its higher levels occasionally defines some abbreviated forms, but in general it does not include characters created as a result of the modern character reforms. I consider this a serious drawback and an obstacle to a true universal character set. But this seems to h³B²z³o¶µ¶·¨D¡C¹ê»Úªº¤u§@ Åã¥Ü¤F©µ¥Î¤w²ßºDªº¤u§@Àô¹Ò¡]°t¦X¦r«¬¡B½s¿è¾¹µ¥¡^¬O¦h»òªº«n¡C ¦]¦¹¡A§Ú²{¦b´£Ò¨Ï¥Î¤@ºØ¥Ø«e°ê»Ú³q¦æªº¤º½X¡]¥xÆWBIG5 ©Î¤é¥» JIS¡^ °t¦X¡uIRIZº~¦r®w¡v¡A¬O¤ñ°_±Ä¥Î CCCII ¨Ó±o¦nªº¤è®×¡C _________________________________________________________________ 1. Before launching large database projects, one ought to find out what has already been done in the area and study its qualities and defaults. Often one learns much by asking programmers and database designers what they would do differently if they could start all over again. In the field of Buddhist studies, the Electronic Buddhist Text Initiative tries to help in this coordination and learning process. This may sound trite, but it is a fact that even major projects in the field are unaware of what is happening elsewhere Ñ and sometimes even in their own institution. On the recent field trip organized by the Electronic Buddhist Text Initiative, we found for example that the people managing the Chinese University of Hong Kong concordance project were not aware of the very similar effort in Oslo; and a long-time resident scholar at the Academia sinica found out through us that important materials for a Chinese text he has been translating are on his instituteÕs computer. That electronic versions of a text exist does not mean much in itself; one must evaluate data quality, accessibility, and suitability for oneÕs project. 2. One must classify data input projects by the amount of data involved and their destination. Thus one must distinguish between small amounts of data and large amounts of data, data destined for individual users or small groups and data destined for large user groups and institutions, etc. The present guidelines apply to large input projects that contain many full-form Chinese characters and are aimed at a large and diverse group of users. Failure to make such distinctions may lead to inadequate demands for data quality, search strategies, etc. For example, certain automatic or half-automatic methods of scanner input can be quite useful and efficient for an individual user prepared to spend a substantial amount of time for data correction; but the very same method may prove totally inadequate for large-scale institutional data input because of the high cost of error correction. Similarly, a relatively high number of mistakes may not bother some users but is unacceptable for data that are to be distributed to other users. Again, the use of many self-defined characters can be acceptable for individuals but not for institutions. 3. It is of the greatest importance to make basic decisions at the beginning of a project and to discuss them with specialists. In making these decisions, both present and future possibilities of use must be kept in mind. This applies particularly to the choice of source text, text editing, annotation, basic data character (character encoding, data format, non-standard character handling, etc.), and hard/software environments. Such questions must be discussed by a team of specialists at the outset of a large project, i.e. before the main input activity starts, and an action plan should be approved by the whole team. Failure to do this can result in gigantic waste of money. Several Chinese text databases I know of started out with little planning; mostly they were designed to fit the hardware and software environment of some years ago at a specific location. Later, when trying to convert the data to present requirements and for use by other institutions, they found that automatic conversion was not possible or corrupted the data set. Prior planning and consultation with specialists could have prevented this. Another example: tagging data during the input or correction / editing process can improve the value of a database enormously, for example in making it possible to look for all plant names or place names in the whole Pali canon. Doing something like this at a later point would be another major enterprise that could have been avoided through careful planning. 4. If the electronic text is (or may at a later point in time be) destined for international users and a variety of hardware and software environments, it is necessary to make a basic data set (master data set) that can later be automatically converted into any necessary code or format. It is important to treat this master data set as a separate entity whose input conditions, character code, hardware environment, etc. can be very different from that of the eventual user, just as studio quality music recording and editing equipment is different from the reproduction equipment of the consumer. With Chinese text, the difference shows particularly in the way rare characters and different national standards are handled. Institutions that do not separate master data and user data invariably produce data that follow the low standards of character codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this number by C. Wittern). Of the institutions visited on the recent field trip, those who did not distinguish between master and user data all suffer from data quality problems which will become even more serious as larger codes become available. Those who were wise enough to make this distinction are: the libraries of Taiwan National University and Hong Kong University of Science and Technology (both use master data in CCCII code and user data in BIG-5) and the Chinese Academy of Social Sciences (master data in their own 45,000 character code, user data in various formats). Just like master tapes in the music business, master data must be of such quality that it can be used in many different environments, present and future. Most of the Chinese text data so far input in Japan, Korea, and mainland China will have about as much future as the recording of a concert made on a Walkman. 5. In order to assure such convertibility and adaptability, the master data must contain the greatest possible amount of information. This is an important factor of data quality. In the case of Chinese, Korean, or Japanese data (or any other text set that maip, we met programmers who admitted that they have never actually used the database they have been working on for years... 9. Databases are made for users; therefore the wishes, working environment, and likely working habits of users must be carefully studied and respected. For example, most users search while writing a paper or book; therefore it must be possible to use the database concurrently with a word processing program. Any large text database should also let the user attach notes and tags to the main text. Such notes should also be searchable, printable (together with the text or separately), savable as separate files with location tags, and portable to updated versions of the electronic text. Search engines must also be adapted to many usersÕ needs. Therefore it must be flexible and adaptable to a variety of usersÕ preferences (just like word processing programs) rather hard-coded. Search results should be viewable and printable and file saveable in a variety of formats according to the userÕs wishes. Since the main aim of databases is the retrieval of information, such retrieval should be carefully planned with many options for the user. In projects whose input takes many years of work, one must make programmers produce multiple test versions of search software and have scholars and other prospective users evaluate it even while input is going on. If necessary, data structure decisions have to be reevaluated. Users should have a say in all important software decisions, and programmers should assist users to evaluate test versions and to formulate their wishes by telling them about alternative possibilities. Author:Urs App Last updated: 95/04/23 ========================================================================== Date: Mon, 24 Jul 1995 23:39:11 +0800 From: Shann Wei-Chang <sq¥Lªº¤å¥ó¨Ó¬Ý¡A¦ü¥G¨S¦³ µ´¹ï¼ÖÆ[ªº¸Ñ¨M¤èªk¡Cªº½T¥O¤HW´o¡C ¥¼¨Óªº¤@¦Ü¤G¶g¡A§Ú±N§ë¤J¥þ¤O¼g¤@¥÷¤¤¤å TeX ªº¨Ï¥Î¤â¥U¡AµM«án¨ó§U¤uŪ¥Í ©Mp¤¤¼g accounting ªº³B²z scripts. «u¡A¦h»¡µL¯q¡AÁ`¤§§Ú«Ü·QÀ°¦£¦ý¬O¹ê¦b µL¯à¬°¤O¡C > ¤£¹L¨º¦ìʤѪº¤uµ{¤Hû¼B©ú«Â¥ý¥Íªí¥Ü¡A±oµ¥¦³¤@©w¼Æ¶qªº > ¦ò±Ð¹ÎÅé¤ä«ù¦¹¤@ÂX¥Rªººc·Q«á¡A¼B¥ý¥Í¤~·|¥h¶i¦æµ{¦¡×§ïªº¤u > ¨ã¡A¥H§K¨ìÀY¨Ó¥Õ¦£¤@³õ¡C > > ·Ó³o¼Ë¤l¨Ó¬Ý¦¹ Big-5 ªº§ï¨}ª©¥»®£©È·|¦³°ÝÃD? ¤£¹ê¥Î? > ¦]¦¹¤@¯ë user ¨Ï¥Îªº¤´µM¬Oªº Big-5 ª©¥»... > ¦]¦¹³oÓª©¥»¬J¤£¦p CCCII, Unicode µ¥¯à´£¨Ñ "¥þ¼Æ" ªº³y¦r¡A > ¤S¤£¹³ Big-5 ¯ëªº¬y³q¡A¦ü¥G¥u¯à§@¹L´ç¤§¥Î? §Ú¤£¤ÓÀ´³o¤@¬q¸Üªº·N¸q¡C CCCII ªº°ÝÃD Wittern ¤w¸g»¡±o«Ü²M·¡ (§Ú¥H«e¨S³o»ò ²M·¡¡A¥u¬O¦b²z½×±À²z¤W¡A»{¬°¥¦¤£¬O¤@Ó¦n¥D·N¡A²{¦b Wittern µ¹¤F«Ü©ú½Tªº§Þ³N ¸ê®Æ¡A»¡©ú¥¦¤£¬O¤@Ó¦n¥D·N), ¦ý§Ú¤£»{¬° Unicode ¯à´£¨Ñ¥þ¼Æªº³y¦r, ¥¦²¦³º¬O ¤@Ó©T©w¤j¤p 256*256 ªº¦rªO¡A³y¦rªºÓ¼Æ¬O¦³¤Wªº¡F¦Ó¥B³oÓ½XÁÙn¥þ¥@¬É¨Ó¤À µÛ¥Î¡A¤£¥i¯à§â©Ò¦³³y¦rªÅ¶¡³£µ¹¤F§Ú̧a¡HÁÙ¦³¡A§A»¡°µ¹L´ç¤§¥Î¡A«üªº¬O½Ö¡H ¬O§ï¨}ªº Big-5 ¶Ü¡H¥i¬O§Aè¤~¤£¬O¤~»¡Ê¤Ñ²{¦b¤£¯à®³¥X¨Ó¥Î¶Ü¡H §Ú«ÜÃÙ¦¨ Wittern ¤å³¹¤¤ (©Î¬O¥t¤@¤H¼gªº¡AÁ`¤§¬O§Aªþªº¨º½g) ©Ò»¡ªº¡A¸ê®Æ n¤À¤º½X (master data) ©M ¥~½X (user data)¡C¦pªG§A±µ¨ü³oÓÆ[©À¡A¨º»ò§Y¨è ¥i¥H¿ï¤@ӳ̾A·íªº¦r½X¨Ó»s³y master data¡C¬Æ¦Ü¤£¥²²z·|¥ô¦ó¼Ð·Ç½X¡C¦Ó§Ú Ó¤Hªº«Øij (¤@Ó¤£°Ñ»P¤u§@ªº¤H»¡³o»ò¦h«Øij¡A¹ê¦b«Ü¤ßµê) ¬O¡A¸ò¥H«e¤@¼Ë¡A ¾¨¶q¥Î CNS, ¤£¨¬ªº¦r¦Û¦æ©w¸q¡A¥Î¸õ²æ½Xªí¥Ü§A̪º¯S®í³y¦r¡A¦b PC ¤W¦³«Ü¦h ³y¦rµ{¦¡¦@±z̥ΡA¦b UNIX ¤W¤j®a¤@«ß¥Î X Window ªº bitmap ©Î BDF ®æ¦¡§Y¥i¡C ¤@¥¹¤º½X³y¦¨¤F¡A»P¥~½Xªº¹ïÀ³¥u¬O¤@±iªí®æªº°ÝÃD¡C > ½Ð°Ý¤@¤U¡A¥H CNS ¼Ð·Ç¿é¤Jªº¤å¥ó¡A¦b BIG-5 ¤U±¥i¥H¬Ý¶Ü? CNS ¼Ð·Çªº¨âÓ bytes ³£¬O low bytes (0 .. 127), ³o¬O ISO ªº¼Ð·Ç¡C ¤£¦P¦r±ªº CNS ¥Î¸õ²æ½X¡A©Ò¥H°ò¥»¤W©M Big-5 ¬OºIµM¤£¦P¡C¦ý¬O¦b PC ¤W ʤѴ£¨Ñ CNS ½X¡A¥Lªº·N«ä¬O shift-CNS (like shift-JIS). ¥L¥u¥Î CNS ªº ²Ä¤@¤G¨âÓ¦r±, ²Ä¤@¦r± shift ²Ä¤@Ó byte 128...255, ²Ä¤G¦r±§â¨âÓ byte ³£ shift. ¬GÄY®æ¨Ó»¡Ê¤Ñ©Òµ¹ªº CNS ½X¤]¤£¬O¼Ð·Çªº. ¦Ó¥B CNS ªº«e¨âÓ¦r±©M Big-5 ¤]¤£¬O order-preserving one-to-one mapping, ©Ò¥H§Y¨Ï¬O shift-CNS ¤]¤£µ¥©ó Big-5. ¥h¦~§Ú´¿ªá¤F¦Ü¤Ö¤@Ó¤U¤È¥h·d²M·¡ Big-5 ©M CNS plans 1,2 ªº®t²§¡A¨Ã½T©w Big-5 ªº¿ù»~¤§³B¡A§Ú´¿¼g¤@¥÷³ø§i post µ¹ CCNET-L, ²{¦b¨S®É¶¡§ä¥X½Z. ¦ý¦³¤@Ó betty µ{¦¡¥i¥H¤Î®É§â shift-CNS Âন Big-5 (vice versa), ¦ý¥¦¥u¦b UNIX ¤W°õ¦æ. > ½Ð±Ð¤@¤U¡A¤£ª¾ CNS ªº¤¤¤å¨t²În¦p¦ó¨ú±o©O? °Ý˧ڤF¡C°£¤FʤѤWªº shift-CNS §Ú¨S¨£¹L¨ä¥Lªº implementations. ³o·íµM ¤£¬O PD µ{¦¡¡C§Ú²q¸êµ¦·|©M¬Y¨Ç¬F©²³æ¦ì¤@©w¦³³o³nÅé¡A¥u¬O°Ó³õ¤W¥¦²@µL¥ß¨¬ ¤§¦a¡A©Ò¥H¤@¯ëªº¨Ï¥ÎªÌ¬Ý¤£¨ì³oºØ²£«~¡C¦b UNIX ¤W§Ú·Q§Úª¾¹D¦p¦ó°t¦X CXTERM implement ¤@¥÷ shift-CNS ªº¤¤¤åÀô¹Ò¡A¦Ü©ó¦Û³y¦rªº¸õ²æ½X³B²z¡A§Ú·Q¥i¥Hקï betty µ{¦¡¨Ó implement. Betty ªº§@ªÌ¦b²M¤j (§Æ±æ¥LÁÙ¨S²¦·~), ¥i¥H½Ð¥L «ü¾É¡C -Shann /End of lin |
¾\Ū¤å³¹¡G ²Ä 58/2032 ½g | ¤W½g | ¤U½g | ¦^ÂÐ | Âà±H | Âà¶K | m H d | ªð¦^ |
Éà ¥x¤j·à¤l§q¦ò¾Ç±M¯¸ http://buddhaspace.org |