建立大型佛典資料庫的注意事項

~---------- Forwarded message ----------
Date: Mon, 28 Aug 1995 23:40:08 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Guidelines for the Creation of Large Chinese Text Database


以下是關於建立佛學資料庫的注意事項，
僅提供給諸如台大文學院佛學研究中心參考。

（按照下文，似乎有說 CCCII 不適用?
  不過我只是瞄了一下，可能看錯了就是了。）


~---------- Forwarded message ----------
     _________________________________________________________________
   
Guidelines for the Creation of Large Chinese Text Databases

  by Urs App
  
   
     _________________________________________________________________
   
    Abstract
    
   This article (which was also published in the Electronic Bodhidharma
   No. 3) establishes some guidelines for the creation of large Chinese
   databases. Practical experience at our institute in trying to create
   master data in CCCII code has shown that CCCII is not a practical
   option for this purpose. Our IRIZ KanjiBase encoding, on the other
   hand, has quite admirably served this purpose. Actual work shows how
   important maintaining the habitual working environment (with front end
   processors etc.) is. Therefore I would now advocate using one of the
   national codes in combination with KanjiBase rather than CCCII.
     _________________________________________________________________
   
   
   
    1. Before launching large database projects, one ought to find out
       what has already been done in the area and study its qualities and
       defaults. Often one learns much by asking programmers and database
       designers what they would do differently if they could start all
       over again. In the field of Buddhist studies, the Electronic
       Buddhist Text Initiative tries to help in this coordination and
       learning process.
       
     This may sound trite, but it is a fact that even major projects in
     the field are unaware of what is happening elsewhere � and sometimes
     even in their own institution. On the recent field trip organized by
     the Electronic Buddhist Text Initiative, we found for example that
     the people managing the Chinese University of Hong Kong concordance
     project were not aware of the very similar effort in Oslo; and a
     long-time resident scholar at the Academia sinica found out through
     us that important materials for a Chinese text he has been
     translating are on his institute掇 computer. That electronic
     versions of a text exist does not mean much in itself; one must
     evaluate data quality, accessibility, and suitability for one掇
     project.
    2. One must classify data input projects by the amount of data
       involved and their destination. Thus one must distinguish between
       small amounts of data and large amounts of data, data destined for
       individual users or small groups and data destined for large user
       groups and institutions, etc. The present guidelines apply to
       large input projects that contain many full-form Chinese
       characters and are aimed at a large and diverse group of users.
       
     Failure to make such distinctions may lead to inadequate demands for
     data quality, search strategies, etc. For example, certain automatic
     or half-automatic methods of scanner input can be quite useful and
     efficient for an individual user prepared to spend a substantial
     amount of time for data correction; but the very same method may
     prove totally inadequate for large-scale institutional data input
     because of the high cost of error correction. Similarly, a
     relatively high number of mistakes may not bother some users but is
     unacceptable for data that are to be distributed to other users.
     Again, the use of many self-defined characters can be acceptable for
     individuals but not for institutions.
    3. It is of the greatest importance to make basic decisions at the
       beginning of a project and to discuss them with specialists. In
       making these decisions, both present and future possibilities of
       use must be kept in mind. This applies particularly to the choice
       of source text, text editing, annotation, basic data character
       (character encoding, data format, non-standard character handling,
       etc.), and hard/software environments. Such questions must be
       discussed by a team of specialists at the outset of a large
       project, i.e. before the main input activity starts, and an action
       plan should be approved by the whole team.
       
     Failure to do this can result in gigantic waste of money. Several
     Chinese text databases I know of started out with little planning;
     mostly they were designed to fit the hardware and software
     environment of some years ago at a specific location. Later, when
     trying to convert the data to present requirements and for use by
     other institutions, they found that automatic conversion was not
     possible or corrupted the data set. Prior planning and consultation
     with specialists could have prevented this. Another example: tagging
     data during the input or correction / editing process can improve
     the value of a database enormously, for example in making it
     possible to look for all plant names or place names in the whole
     Pali canon. Doing something like this at a later point would be
     another major enterprise that could have been avoided through
     careful planning.
    4. If the electronic text is (or may at a later point in time be)
       destined for international users and a variety of hardware and
       software environments, it is necessary to make a basic data set
       (master data set) that can later be automatically converted into
       any necessary code or format. It is important to treat this master
       data set as a separate entity whose input conditions, character
       code, hardware environment, etc. can be very different from that
       of the eventual user, just as studio quality music recording and
       editing equipment is different from the reproduction equipment of
       the consumer.
       
     With Chinese text, the difference shows particularly in the way rare
     characters and different national standards are handled.
     Institutions that do not separate master data and user data
     invariably produce data that follow the low standards of character
     codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this
     number by C. Wittern). Of the institutions visited on the recent
     field trip, those who did not distinguish between master and user
     data all suffer from data quality problems which will become even
     more serious as larger codes become available. Those who were wise
     enough to make this distinction are: the libraries of Taiwan
     National University and Hong Kong University of Science and
     Technology (both use master data in CCCII code and user data in
     BIG-5) and the Chinese Academy of Social Sciences (master data in
     their own 45,000 character code, user data in various formats). Just
     like master tapes in the music business, master data must be of such
     quality that it can be used in many different environments, present
     and future. Most of the Chinese text data so far input in Japan,
     Korea, and mainland China will have about as much future as the
     recording of a concert made on a Walkman.
    5. In order to assure such convertibility and adaptability, the
       master data must contain the greatest possible amount of
       information. This is an important factor of data quality. In the
       case of Chinese, Korean, or Japanese data (or any other text set
       that may include characters that are not standardized and where
       several competing standards exist), one must utilize the character
       standard with the best structure, greatest number of characters,
       and best convertibility. At present, the best standardized Chinese
       character code for master data is the Taiwanese CCCII code (see
       the article by C. Wittern below). In spite of its clumsy
       three-byte format, its elevated price (around US $ 4000 for a PC
       card, conversion routines into other codes, and exhaustive
       documentation), and its very small number of users, adoption of
       this code seems at present the most sensible approach for creating
       a master data set of large bodies of full-form Chinese character
       text.
       
        (Note May 1995: Though the principle of clearly distinguishing
       master data from user data stands, for various reasons I do not
       see the CCCII code as the best possible code any more. A national
       code in combination with the IRIZ KanjiBase is more practical.)
       
     Data that is input in character codes with a mixture of simplified
     and non-simplified characters and a small total number of characters
     (such as JIS or GB codes) cannot be automatically converted into
     more elaborate codes; for example, 邗 cannot be converted
     automatically because no machine will know whether it stands for �
     or � or � or some other long form. The reverse conversion,
     however, is easy. The same is true for variant forms of characters:
     the objective is to preserve as much of this information in the
     electronic text as is possible. In Japan, characters not existing in
     the JIS code are often input by an empty box and other characters in
     simplified forms (for example in the monogatari CD-ROM of the
     Kokubungaku shiry耮an in Tokyo). Such data has bad convertibility
     and is thus of deficient quality even if the input text is quite
     accurate.
    6. If variant forms of characters exist in the printed form of a
       source text, one should strive to reproduce them as they are in
       the electronic text. This is not always possible or even wise;
       some variations (for example different print shapes of radicals
       (as in 榳 and 潁) are commonly accepted. However, all such
       decisions must be documented and strictly adhered to. Obviously, a
       master data code such as CCCII makes the management of such
       variations easier since it links variant characters to basic
       forms, making conversion from one into the other possible if the
       need arises. If the printed text contains several printed forms of
       a character, one must reproduce these features in the electronic
       text. If a good record is kept of such variations during the input
       and correction process, one will later be able to create search
       modules that automatically search all variants of a character or
       term concurrently if the user wishes this. The producers of
       electronic text should keep in mind that present and future users
       may have interests that can not be imagined or foreseen and that
       it is not the job of the data producer to limit such interests.
       Rather, the basic data set should be as faithful as possible, just
       like the master recording in music.
       
     For the Chinese University of Hong Kong掇 concordance series (stored
     in Big-5 format without distinction of master and user data), the
     variant forms were reduced to standard forms and listed in printed
     form. The electronic text only features the standard forms. If in
     the future a larger code comes into common use that contains many
     variant forms, there will be no master data that can be converted in
     such a code, and much of the work that was done in reducing the
     information will have to be repeated in the other direction. In
     contrast, the Hong Kong University of Science and Technology inputs
     mainland book information in simplified forms and Taiwanese
     information in long forms, as they appear in the book. The search
     module then treats the simplified characters as variant forms which
     are also searched, allowing the user to find information regardless
     of the specific form of the printed character.
    7. In electronic text, data accuracy is exceedingly important because
       machines search data with much more accuracy than humans. They
       find only what the data set contains, and browsing is usually not
       possible or feasible. Mistakes are in general only found by chance
       and not as a result of a search; one thus cannot expect users to
       correct data. With large data sets, users are often blinded by the
       amount of information that can be found. However, one must also be
       able to rely on accurate information on what is not in a text.
       Input mistakes prevent gaining such information. Data mistakes can
       be eliminated by adequate input methods and data correction
       procedures. Data accuracy depends on a variety of factors which
       are usually interdependent: quality and readability of the source
       material, choice of input method, education of personnel, quality
       of input guidelines (definition of identity / difference of
       characters), size of the character code, quality of reference
       materials, data correction procedures and personnel, consistency
       of the application of the guidelines, quality of input and
       correction documentation, honesty of personnel in admitting
       problems, etc.
       
     Master data of good quality must not only be in an adequate code
     which contains much information (and has thus good conversion
     characteristics) and does not distort the printed original: it must
     also be error-free. With alphabetical text, input of the same text
     by two typists and machine-comparison of the typed text yield quite
     good results. However, this method is not totally adequate because
     fast typists sometimes mistakenly hit the same wrong adjacent key.
     For Chinese data, this method has not proven successful because
     typists often make the same mistakes. Thus a good error-correction
     procedure must be applied and strict guidelines must be given to the
     input and correction personnel. They must be trained in strict
     quality control procedures; all individual decisions must be
     documented, approved, and consistently applied.
    8. The overall value of a database can often be substantially
       improved by teamwork and by team discussions of a variety of basic
       issues. For example: the choice of the printed text that serves as
       data source; the presence or absence of scholarly commentary or
       annotation; the references to printed sources; the user profile;
       the required search tools; the cost and quality of necessary hard-
       and software; future hardware and software environ-ment prospects;
       the ease of use of hardware and software; the variety and quality
       of character conversion utilities; the cost of the data; the
       accura-cy standard of data; the convertibility standard of the
       data; the structure of the data; the flexibility of the data
       structure (adaptability of format, etc.); the standardization
       level; etc.
       
     Having heard too many 涐f only we had thought about this before
     input started ...� I believe that in database planning and
     management, group decisions based on discussion are often better
     than individual decisions. Scholars must be careful not to leave
     such decisions to technicians and programmers. On the field trip, we
     met programmers who admitted that they have never actually used the
     database they have been working on for years...
    9. Databases are made for users; therefore the wishes, working
       environment, and likely working habits of users must be carefully
       studied and respected. For example, most users search while
       writing a paper or book; therefore it must be possible to use the
       database concurrently with a word processing program. Any large
       text database should also let the user attach notes and tags to
       the main text. Such notes should also be searchable, printable
       (together with the text or separately), savable as separate files
       with location tags, and portable to updated versions of the
       electronic text. Search engines must also be adapted to many
       users� needs. Therefore it must be flexible and adaptable to a
       variety of users� preferences (just like word processing programs)
       rather hard-coded. Search results should be viewable and printable
       and file saveable in a variety of formats according to the user掇
       wishes. Since the main aim of databases is the retrieval of
       information, such retrieval should be carefully planned with many
       options for the user.
       
     In projects whose input takes many years of work, one must make
     programmers produce multiple test versions of search software and
     have scholars and other prospective users evaluate it even while
     input is going on. If necessary, data structure decisions have to be
     reevaluated. Users should have a say in all important software
     decisions, and programmers should assist users to evaluate test
     versions and to formulate their wishes by telling them about
     alternative possibilities.
    Author:Urs App
    Last updated: 95/04/23
    
   HTML>