中文佛典造字討論 FAQ

看板: BudaTech ◎ 佛典電子化討論 板主: HeavenChow
閱讀文章：第 58/2032 篇 | 上篇 | 下篇 | 回覆 | 轉寄 | 轉貼 | m H d | 返回
發信人: b83050@cctwin.ee.ntu.edu.tw (Post Gateway), 信區: BudaTech
標  題: 中文佛典造字討論 FAQ
發信站: 由 獅子吼站 收信 (Fri Mar 29 17:30:18 1996)


~---------- Forwarded message ----------
Date: Sat, 2 Sep 1995 13:29:57 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Chinese Characters FAQ (about Buddhism)

以下是近來至各處搜集來的中文內碼相關文件中，比較重要的。
目前佛典的內碼選用以及造字問題，是佛典輸入的瓶頸，以下
資訊供各位學長參考。

ps. 若此 mail alias 的學長，有在寺院工作或是對於相關
    中文輸入的訊習（內碼、造字等）很有興趣的人，
    請回函告知末學一聲，以將您加在佛典輸入的佛教機構 mail alias 中。
    有些關於中文內碼的技術性問題，將不會在目前的 mail alias 內發佈。
    （目前只有 corbon copy 至台大佛學研究中心、香光寺
      自衍法師、農禪寺果光法師等幾位法師，及幾位特別熱心的學長的帳號而已。）


以下即是佛典相關中文內碼的重要 FAQ:
（末學上次曾轉貼數十封相關的信件給以上佛教機構，
  提供作為參考。如果有學長對此特別有興趣的話，
  可以向末學索取更詳細的文件，或直接加入佛教機構
  的名單中。）


=========================================================================
Date: Sat, 13 May 1995 10:07:34 +0800
From: Shann Wei-Chang <shann@math.ncu.edu.tw>
作者簡介：中央數學系單維彰教授，對於國學極有興趣，對於 UNIX 系統亦非常熟，
          參與網路上內碼的討論已多年。
Subject: internal code
 
大剛,
 
方才讀了你的 report, 有關佛典輸入碰到的罕見字問題.  你知道我在 CCNET 和
CHPOEM 的 mailing list 上很久了, 我們常常討論這一類的問題.  關於它的
解決方案, 其實是沒有共識的定案, 而且我自己的想法也隨時間改變 (不知是不是
越變越成熟就有待時間考驗了).
 
讓我告訴你我現在的想法, 以資參考.  第一, 我不喜歡 Big-5 和當初設計它的
那一幫人, 這是典型的劣幣驅良幣的例子.  但是, 隨著對事實的認知與妥協
 (這應該是與年齡有關), 我開始承認, 任何想要普遍流傳的中文電子檔案,
 必須與 Big-5 相容; 直接相容, 毋須轉碼或特殊處理.
 
要 output 特殊字比 input 簡單, (input for search, for instance).  但是,
一篇電子文件通常只有字碼, 而不附帶字型 (glyph, the bitmap binary file 
or in other formats).  如果文件是放在磁片或光碟上流通, 這個問題比較小,
但是我們總希望同樣的文件, 應該能在極少的變動下放到網路上流傳.  這時候,
文件與閱讀器就是兩碼子事.  這是最需要花力氣的地方.
 
我目前的想法是, 基本上使用 Big-5 碼, 碰到罕用字, 用 Escape sequence
隔開, 就像海外留學生常用的 HZ 碼, 或是日本的 JIS 標準, 以及大部份 UNIX
工作站之援的 EUC.  如果使用端的閱讀器無法識別這個 Escape sequence, 或是
沒有相對應的字型, 則讀者可能看到一串亂七八糟的字, 但是通常這些字應該
不多, 不至於影響整個文章的內容.  至於該用哪些字串作為 Escape sequence?
我國的 CNS 碼已經在國際上註冊, 我們應該儘量跟隨這個標準, 不能跟的時候,
應該運用網路大眾傳播的力量, 加上政治游說的力量, 把我們選定的 Escape sequence
設成標準.  至於罕用字該如何編碼, 同樣應該先參考中央標準局在 1992 年公布
的標準交換碼.  這個碼的編排符合國際標準, 目前共有七個字面, 還有很多括充
的空間, 每個字面依國際標準排入 94*94 個字碼 (two bytes, each byte is
between 33 and 126, decimal inclusive).  第一二字面所選定的字基本上與
Big-5 相同, 但改正了幾個 (也許是所有的) 錯誤.  第三到七字面定義了三萬
多個罕用字, 或體字, 異體字, 和一些只出現在算命先生的命名學上的奇奇怪怪
的字: 它們的字碼以及字型.  八到十六字面空著, 第十二字面是 user defined.
 
我的學識不足以憑斷這些在第三到第七字面的字是否完整或排序妥當, 因為它們
全是我不認識的字.  如果佛經裡的字還有在這裡找不到的, 我建議不要用第十二
字面, 而是運用佛教團體的政治力量去爭取一個字面, 例如十三, 作為宗教罕用字面.
因為, 所謂 user defined, 到最後一定是一團沒用的稀泥.
 
至於罕用字的輸入, 很明顯的, 必須發展對應的中文輸入軟體以及字型.  在 X window
上已經有一套作法可循, 其他系統上也不該有技術上的困難.
 
我們的政府不知道在做什麼, 以臺灣的自許為電腦王國的地位, 我們的國家交換碼到
1986 才首次公布, 而且又溝通不良, 導致市場上沒人理它 (不理政府似乎是近代兩
岸中國人的共同特徵).  我想, 即使現在, 還是很多圈子裡的人沒聽說過這個標準,
或是聽說了但是沒考慮過要用它.  倒是資策會和一些公家單位開始 (也許是被迫)
使用它, 國外的一些公司開始支援它, 因為它畢竟是在國際上註冊的國家標準碼.
 
時間匆促, 寫了些別字, 但此 editor 不容易更正, 請原諒.
 
-Shann
 
========================================================================
Date: Mon, 28 Aug 1995 22:57:15 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Recommend Chinese Code -- CNS
 
 
 
下文即是關於各種內碼的簡介，取自花園大學禪學 WWW:
http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm
 
（一些重要的內容，我會隨手附上中文翻譯，不過不保證沒翻錯。
  一切得以原文為準。）
 
     _________________________________________________________________
   
Chinese character codes: an update
中文內碼的探索：修改版
 
    by Christian Wittern
    作者簡介：日本京都花園大學禪學中心（即「電子達摩」刊物發行者）
              的資深人員。花園大學禪學中心對於佛典電子化的全世界
              聯絡工作，自 1992 年以前即開始進行，可是當今國際上
              最大的聯絡網。
   
     _________________________________________________________________
   
    Summary
    
   This article presents an update to Christian Wittern's and Urs App's
   articles concerning Chinese character codes (Electronic Bodhidharma
   No. 3). In those articles, Urs App argued that database creators must
   make the most crucial distinction between master data and user data.
   Master data should be of the highest quality, recording even minute
   detail like studio recording equipment. User data, on the other hand,
   must conform to what codes and equipment we presently have. Christian
   Wittern's article compared different codes and concluded that CCCII, a
   very large Taiwanese code that also includes Japanese and Korean
   letters, seems to be the best choice for the master data set of
   Chinese text databases.
 
   摘要
   
   本文改進了 Christian Wittern 先生和 Urs App 關於中文內碼的評析
   （刊載於「電子達摩」期刊第三期）。在該文中， Urs App 表示資料庫
   的建立者必須對於 master data 及 user data 作下非常非常重要的決定。
   Master data 必須具有最高的品質，如同錄影器材記錄下每分鐘的畫面一般；
   另一方面， user data 必須順從於那種內碼是我們現有的。
 
   Christian Wittern 先生的文章比較了幾種不同的內碼，結論是：
   「 CCCII（一種非常龐大的台灣的內碼，並且包含了日本及韓國字）
      似乎是中文內碼的 master data 的最佳選擇。」
 
 
   We shelled out US $ 2000 for a CCCII board, only to discover that both
   the code itself and its implementation are seriously flawed. We thus
   had to continue using Big-5 for all practical purposes while looking
   for better solutions. Finally, Christian decided that the only
   practical approach at this time was to build on Big-5 (and other
   national codes such as JIS) and extend them through code references
   that are both stable and portable. His ingenious approach forms the
   basis of the IRIZ KanjiBase and its encoding scheme -- a scheme which
   will be as useful after the introduction of Unicode as it proves to be
   right now. (U.A.)
 
   我們花下了美金 2000 元，買了一個 CCCII 的板面，結果發現該碼本身及
   它的附屬設備，都具有嚴重的瑕疵。因此，我們在實際的狀況上，只好繼續 
   使用 BIG-5內碼，等著繼續尋找更好的解決方案。最後， Christian 先生
   決定了，現時唯一實際可行的方法是建立在 BIG-5 （及日本國內普遍流行的
   JIS 碼）上面，並且藉由既穩定又具可攜性的「內碼參照表」（code references）
   來擴展它們。他的這項聰明提議產生了「IRIZ 漢字庫」的基礎，以及「IRIZ
   漢字庫」的「轉譯器」──一種在將來 Unicode 引進後，能夠如同現在我們
   證明它有夠實用的轉譯器。
 
     _________________________________________________________________
   
     * Some kanji codes for computers
         1. Japanese JIS Codes
         2. Taiwanese Big5
         3. Taiwanese CNS
         4. CCCII and EACC
         5. Unicode
 
     ＊一些電腦上的漢字內碼：
         1. 日本 JIS 內碼
         2. 台灣 BIG-5 內碼
         3. 台灣中央標準局 CNS 內碼
         4. CCCII內碼及 EACC 程式
         5. Unicode
 
     * More information is available at ifcss.org in Ross Patterson's
       document CJK Codes and in Ken Lunde: Understanding Japanese
       Information Processing p35ff.
 
     ＊在 ifcss.org(.jp) 上有更多有用的資訊，就是 Ross Patterson 先生的  
       「 CJK 內碼」一文，及 Ken Lunde先生的：「了解日本在處理 p35ff 上
       的資訊」文件。
 
     _________________________________________________________________
   
Development of kanji codes for computers
電腦漢字內碼的發展
 
  Japanese JIS Codes
日本 JIS 碼
   
   The first character code designed to make the processing of
   ideographic characters on computers possible was the JIS C 6226-1978.
   It was developed according to the guidelines laid down in the ISO
   standard 2022-1973 and became the model for most other code standards
   used today in East Asia (the most notable exception is Big5). Covering
   approximately 6500 characters, this standard has been revised two
   times, in 1983 and 1990, where the assignment of some characters where
   changed and a few added. Revising a standard is about the worst thing
   a standard body can do and has caused much grieve and headache among
   manufacturers and users alike. Today we finally have fonts that bear
   the year of the standard they cover in their name, so that users can
   know which version is encoded in that font and select if accordingly.
   Our texts and tools are based on the latest version.
   
   The version of 1990 has become known under the name JIS X 0208-1990
   and has been together with an additional set of 5800 characters (JIS X
   0212) the base of the Japanese contribution to Unicode.
   
   The JIS code is almost never used in computers as it was defined;
   rather, some changes are made in the way the code numbers are
   represented. This is necessary to allow JIS be mixed with ASCII
   characters and, as in the case of ShiftJis (or MS-Kanji, the most
   popular encoding on personal computers) with earlier Japanese
   encodings of half-width kana. East Asian text is thus most frequently
   based on a multibyte encoding, a character stream that contains a
   mixture of characters represented by one single byte and of characters
   represented by two bytes.
   
   In addition to the characters in the national standard, many Japanese
   vendors have added their own private characters to JIS, making the
   conversion between these different encodings difficult beyond belief.
   
  Big5
（中文 BIG-5 碼）
   
   
   There are different legends about the beginnings of Big5; some say
   that the code had been developed for an integrated application with 5
   parts, and others say it was an agreement of five big vendors in the
   computer industry. No matter which one is true (and it might as well
   be something else), the Taiwanese government did not realize the need
   for a practical encoding of Chinese characters timely enough.
   Government agencies had apparently been involved also in the
   development of Big5, but it was only in 1986 that an official code was
   announced, a time by which Big5 was already a de facto standard with
   numerous applications in daily use.
 
   關於 BIG-5 內碼開始的傳說，有許多不同的版本：有人說此內碼是由一個
   整合五個部份的應用軟體所產生的，又有人說它是五個大型的電腦廠商所
   共同約定的。不管哪一個傳說是真的，台灣政府並未即時了解中文內碼
   的重要性及須求性。雖然政府機關很明顯地也參與了 BIG-5 的開發工作，
   不過直到 1986 年，官方的內碼才正式對外宣佈，這時 BIG-5 內碼早已是
   為數極多的日常應用軟體所採用的標準了。
 
 
   Big5 defines 13051 Chinese characters, arranged in two parts according
   to their frequency of usage. The arrangement within these parts is by
   number of strokes, then Kangxi radical. As Big5 was apparently
   developed in a great hurry, some mistakes were made in the stroke
   count (and thus placement) of characters, and two characters are twice
   represented. On the other hand, some frequently used characters were
   left out and were later implemented by individual companies.
 
   All implementations agree on the core part of Big5, but different
   extensions by individual vendors aquired much weight, most notably in
   the case of the ETEN Chinese system that was very popular in the late
   eighties and early nineties. As there is no document that defines Big5
   apart from the documentation provided by the vendors with their
   products, it is impossible to single out one standard Big5. This was
   actually a big problem in the process of designing Unicode -- and it
   remains one even today.
 
   （這一段講到 BIG5 無法統一標準的大問題，直到今日還是如此，在將來
     Unicode 制定時亦會造成麻煩。）
 
   
 
  CNS X-11643-1986 and CNS X-11643-1992
（中央標準局 CNS X-11643-1986 及 CNS X-11643-1992）
   
   This is the Chinese National Code for Taiwan. In the form published in
   1992, it defines the glyph-shape, stroke count and radical heading for
   48027 characters. For all these characters a reference font in a 40 by
   40 grid ( and for most of them also in 24 by 24 grid ) is available
   from the issuing body. These characters are assigned to 7 levels with
   the more frequent at the lower levels and the variant forms at the two
   top levels. The whole architecture reserves space for five more
   standard levels and four level are reserved for non-standard, private
   encoding, bringing the total to 16 levels, with a hypothetical space
   for roughly 120 000 ideographs. On top of the currently defined ones,
   one more level with about 7000 characters is currently under revision
   and expected to be published in the course of 1995. This will bring
   the total number of assigned characters to roughly 55000.
 
   這是台灣的中央標準碼。在 1992 年發佈的格式上，它為 48027 個中文字
   定義了 glyph-shape，stroke count，以及 radical heading 。對於這些
   所有的中文字，並有相應的 40 x 40 格子的字型（大部份的亦有24 x 24
   字型）附在發表的內容上。
 
   這些中國字被分配至七個字面，以最常用的字擺在下層字面，以及變異的
   字體擺在上面二層字面。中央標準碼的技術，使它保留了五個以上的標準字面
   以及四個非標準、私人用字面，使得它總共可以有 16 個字面，並且對於粗略
   算來 120 000 個字號有個假設的空間。
 
   在目前已定義的最上層字面（第七層），一層多的字面（具有約 7000 個字）
   正在加以重新審核，並且打算在 1995 年公佈。這將使得它所指定的中文字元
   可達到將近 55000 個字。
 
 
   The overall structure has already been outlined; but how does the CNS
   code relate to other code sets in use in East Asia, e.g. the Korean
   KSC, the Japanese JIS, and the mainland Chinese GB? And what about
   Unicode?
 
   這整體的結構已經被勾畫出來了。但是 CNS 碼與其它東亞所用的內碼
   （例如韓國 KSC 碼、日本 JIS 碼、中國大陸簡體 GB 碼等）有什麼
   關係呢? 和 Unicode 的關係又如何呢?
   
 
   The answer to this is somewhat disappointing: Although CNS defines
   roughly eight times the number of characters, more than three hundred
   characters present in the Japanese JIS are still missing from the CNS.
   In relation to GB, the CNS misses roughly 1800 simplified characters.
   With this it is also clear that the CNS code will miss quite a number
   of Unicode Han characters. Upon closer examination, the reason is soon
   obvious: CNS in its higher levels occasionally defines some
   abbreviated forms, but in general it does not include characters
   created as a result of the modern character reforms. I consider this a
   serious drawback and an obstacle to a true universal character set.
   But this seems to h處理這項須求。實際的工作
   顯示了延用已習慣的工作環境（配合字型、編輯器等）是多麼的重要。
   因此，我現在提倡使用一種目前國際通行的內碼（台灣BIG5 或日本 JIS）
   配合「IRIZ漢字庫」，是比起採用 CCCII 來得好的方案。
 
 
     _________________________________________________________________
   
   
   
    1. Before launching large database projects, one ought to find out
       what has already been done in the area and study its qualities and
       defaults. Often one learns much by asking programmers and database
       designers what they would do differently if they could start all
       over again. In the field of Buddhist studies, the Electronic
       Buddhist Text Initiative tries to help in this coordination and
       learning process.
       
     This may sound trite, but it is a fact that even major projects in
     the field are unaware of what is happening elsewhere � and sometimes
     even in their own institution. On the recent field trip organized by
     the Electronic Buddhist Text Initiative, we found for example that
     the people managing the Chinese University of Hong Kong concordance
     project were not aware of the very similar effort in Oslo; and a
     long-time resident scholar at the Academia sinica found out through
     us that important materials for a Chinese text he has been
     translating are on his institute掇 computer. That electronic
     versions of a text exist does not mean much in itself; one must
     evaluate data quality, accessibility, and suitability for one掇
     project.
    2. One must classify data input projects by the amount of data
       involved and their destination. Thus one must distinguish between
       small amounts of data and large amounts of data, data destined for
       individual users or small groups and data destined for large user
       groups and institutions, etc. The present guidelines apply to
       large input projects that contain many full-form Chinese
       characters and are aimed at a large and diverse group of users.
       
     Failure to make such distinctions may lead to inadequate demands for
     data quality, search strategies, etc. For example, certain automatic
     or half-automatic methods of scanner input can be quite useful and
     efficient for an individual user prepared to spend a substantial
     amount of time for data correction; but the very same method may
     prove totally inadequate for large-scale institutional data input
     because of the high cost of error correction. Similarly, a
     relatively high number of mistakes may not bother some users but is
     unacceptable for data that are to be distributed to other users.
     Again, the use of many self-defined characters can be acceptable for
     individuals but not for institutions.
    3. It is of the greatest importance to make basic decisions at the
       beginning of a project and to discuss them with specialists. In
       making these decisions, both present and future possibilities of
       use must be kept in mind. This applies particularly to the choice
       of source text, text editing, annotation, basic data character
       (character encoding, data format, non-standard character handling,
       etc.), and hard/software environments. Such questions must be
       discussed by a team of specialists at the outset of a large
       project, i.e. before the main input activity starts, and an action
       plan should be approved by the whole team.
       
     Failure to do this can result in gigantic waste of money. Several
     Chinese text databases I know of started out with little planning;
     mostly they were designed to fit the hardware and software
     environment of some years ago at a specific location. Later, when
     trying to convert the data to present requirements and for use by
     other institutions, they found that automatic conversion was not
     possible or corrupted the data set. Prior planning and consultation
     with specialists could have prevented this. Another example: tagging
     data during the input or correction / editing process can improve
     the value of a database enormously, for example in making it
     possible to look for all plant names or place names in the whole
     Pali canon. Doing something like this at a later point would be
     another major enterprise that could have been avoided through
     careful planning.
    4. If the electronic text is (or may at a later point in time be)
       destined for international users and a variety of hardware and
       software environments, it is necessary to make a basic data set
       (master data set) that can later be automatically converted into
       any necessary code or format. It is important to treat this master
       data set as a separate entity whose input conditions, character
       code, hardware environment, etc. can be very different from that
       of the eventual user, just as studio quality music recording and
       editing equipment is different from the reproduction equipment of
       the consumer.
       
     With Chinese text, the difference shows particularly in the way rare
     characters and different national standards are handled.
     Institutions that do not separate master data and user data
     invariably produce data that follow the low standards of character
     codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this
     number by C. Wittern). Of the institutions visited on the recent
     field trip, those who did not distinguish between master and user
     data all suffer from data quality problems which will become even
     more serious as larger codes become available. Those who were wise
     enough to make this distinction are: the libraries of Taiwan
     National University and Hong Kong University of Science and
     Technology (both use master data in CCCII code and user data in
     BIG-5) and the Chinese Academy of Social Sciences (master data in
     their own 45,000 character code, user data in various formats). Just
     like master tapes in the music business, master data must be of such
     quality that it can be used in many different environments, present
     and future. Most of the Chinese text data so far input in Japan,
     Korea, and mainland China will have about as much future as the
     recording of a concert made on a Walkman.
    5. In order to assure such convertibility and adaptability, the
       master data must contain the greatest possible amount of
       information. This is an important factor of data quality. In the
       case of Chinese, Korean, or Japanese data (or any other text set
       that maip, we
     met programmers who admitted that they have never actually used the
     database they have been working on for years...
    9. Databases are made for users; therefore the wishes, working
       environment, and likely working habits of users must be carefully
       studied and respected. For example, most users search while
       writing a paper or book; therefore it must be possible to use the
       database concurrently with a word processing program. Any large
       text database should also let the user attach notes and tags to
       the main text. Such notes should also be searchable, printable
       (together with the text or separately), savable as separate files
       with location tags, and portable to updated versions of the
       electronic text. Search engines must also be adapted to many
       users� needs. Therefore it must be flexible and adaptable to a
       variety of users� preferences (just like word processing programs)
       rather hard-coded. Search results should be viewable and printable
       and file saveable in a variety of formats according to the user掇
       wishes. Since the main aim of databases is the retrieval of
       information, such retrieval should be carefully planned with many
       options for the user.
       
     In projects whose input takes many years of work, one must make
     programmers produce multiple test versions of search software and
     have scholars and other prospective users evaluate it even while
     input is going on. If necessary, data structure decisions have to be
     reevaluated. Users should have a say in all important software
     decisions, and programmers should assist users to evaluate test
     versions and to formulate their wishes by telling them about
     alternative possibilities.
    Author:Urs App
    Last updated: 95/04/23
    
 
 
==========================================================================
Date: Mon, 24 Jul 1995 23:39:11 +0800
From: Shann Wei-Chang <sq他的文件來看，似乎沒有
絕對樂觀的解決方法。的確令人苦惱。
 
未來的一至二週，我將投入全力寫一份中文 TeX 的使用手冊，然後要協助工讀生
和計中寫 accounting 的處理 scripts.  哎，多說無益，總之我很想幫忙但是實在
無能為力。
 
>     不過那位倚天的工程人員劉明威先生表示，得等有一定數量的
> 佛教團體支持此一擴充的構想後，劉先生才會去進行程式修改的工
> 具，以免到頭來白忙一場。
> 
>     照這樣子來看此 Big-5 的改良版本恐怕會有問題? 不實用?
> 因此一般 user 使用的仍然是舊的 Big-5 版本...
> 因此這個版本既不如 CCCII, Unicode 等能提供 "全數" 的造字，
> 又不像 Big-5 般的流通，似乎只能作過渡之用?
 
我不太懂這一段話的意義。 CCCII 的問題 Wittern 已經說得很清楚 (我以前沒這麼
清楚，只是在理論推理上，認為它不是一個好主意，現在 Wittern 給了很明確的技術
資料，說明它不是一個好主意), 但我不認為 Unicode 能提供全數的造字, 它畢竟是
一個固定大小 256*256 的字板，造字的個數是有上限的；而且這個碼還要全世界來分
著用，不可能把所有造字空間都給了我們吧？還有，你說做過渡之用，指的是誰？
是改良的 Big-5 嗎？可是你剛才不是才說倚天現在不能拿出來用嗎？
 
我很贊成 Wittern 文章中 (或是另一人寫的，總之是你附的那篇) 所說的，資料
要分內碼 (master data) 和 外碼 (user data)。如果你接受這個觀念，那麼即刻
可以選一個最適當的字碼來製造 master data。甚至不必理會任何標準碼。而我
個人的建議 (一個不參與工作的人說這麼多建議，實在很心虛) 是，跟以前一樣，
儘量用 CNS, 不足的字自行定義，用跳脫碼表示你們的特殊造字，在 PC 上有很多
造字程式共您們用，在 UNIX 上大家一律用 X Window 的 bitmap 或 BDF 格式即可。
一旦內碼造成了，與外碼的對應只是一張表格的問題。
 
 
>    請問一下，以 CNS 標準輸入的文件，在 BIG-5 下面可以看嗎?
 
CNS 標準的兩個 bytes 都是 low bytes (0 .. 127), 這是 ISO 的標準。
不同字面的 CNS 用跳脫碼，所以基本上和 Big-5 是截然不同。但是在 PC 上
倚天提供 CNS 碼，他的意思是 shift-CNS (like shift-JIS). 他只用 CNS 的
第一二兩個字面, 第一字面 shift 第一個 byte 128...255, 第二字面把兩個 byte
都 shift.  故嚴格來說倚天所給的 CNS 碼也不是標準的.
 
而且 CNS 的前兩個字面和 Big-5 也不是 order-preserving one-to-one mapping,
所以即使是 shift-CNS 也不等於 Big-5.  去年我曾花了至少一個下午去搞清楚
Big-5 和 CNS plans 1,2 的差異，並確定 Big-5 的錯誤之處，我曾寫一份報告
post 給 CCNET-L, 現在沒時間找出舊稿.
 
但有一個 betty 程式可以及時把 shift-CNS 轉成 Big-5 (vice versa),  但它只在
UNIX 上執行.
 
>    請教一下，不知 CNS 的中文系統要如何取得呢?
 
問倒我了。除了倚天上的 shift-CNS 我沒見過其他的 implementations.  這當然
不是 PD 程式。我猜資策會和某些政府單位一定有這軟體，只是商場上它毫無立足
之地，所以一般的使用者看不到這種產品。在 UNIX 上我想我知道如何配合 CXTERM
implement 一份 shift-CNS 的中文環境，至於自造字的跳脫碼處理，我想可以修改
betty 程式來 implement.  Betty 的作者在清大 (希望他還沒畢業), 可以請他
指導。
 
-Shann
 
 
/End of lin
閱讀文章：第 58/2032 篇 | 上篇 | 下篇 | 回覆 | 轉寄 | 轉貼 | m H d | 返回
卍台大獅子吼佛學專站 http://buddhaspace.org