~---------- Forwarded message ----------
Date: Mon, 28 Aug 1995 23:40:08 +0800 (CST)
From: David Chiou <b83050@cctwin.ee.ntu.edu.tw>
Subject: Guidelines for the Creation of Large Chinese Text Database
以下是關於建立佛學資料庫的注意事項,
僅提供給諸如台大文學院佛學研究中心參考。
(按照下文,似乎有說 CCCII 不適用?
不過我只是瞄了一下,可能看錯了就是了。)
~---------- Forwarded message ----------
_________________________________________________________________
Guidelines for the Creation of Large Chinese Text Databases
by Urs App
_________________________________________________________________
Abstract
This article (which was also published in the Electronic Bodhidharma
No. 3) establishes some guidelines for the creation of large Chinese
databases. Practical experience at our institute in trying to create
master data in CCCII code has shown that CCCII is not a practical
option for this purpose. Our IRIZ KanjiBase encoding, on the other
hand, has quite admirably served this purpose. Actual work shows how
important maintaining the habitual working environment (with front end
processors etc.) is. Therefore I would now advocate using one of the
national codes in combination with KanjiBase rather than CCCII.
_________________________________________________________________
1. Before launching large database projects, one ought to find out
what has already been done in the area and study its qualities and
defaults. Often one learns much by asking programmers and database
designers what they would do differently if they could start all
over again. In the field of Buddhist studies, the Electronic
Buddhist Text Initiative tries to help in this coordination and
learning process.
This may sound trite, but it is a fact that even major projects in
the field are unaware of what is happening elsewhere and sometimes
even in their own institution. On the recent field trip organized by
the Electronic Buddhist Text Initiative, we found for example that
the people managing the Chinese University of Hong Kong concordance
project were not aware of the very similar effort in Oslo; and a
long-time resident scholar at the Academia sinica found out through
us that important materials for a Chinese text he has been
translating are on his institute掇 computer. That electronic
versions of a text exist does not mean much in itself; one must
evaluate data quality, accessibility, and suitability for one掇
project.
2. One must classify data input projects by the amount of data
involved and their destination. Thus one must distinguish between
small amounts of data and large amounts of data, data destined for
individual users or small groups and data destined for large user
groups and institutions, etc. The present guidelines apply to
large input projects that contain many full-form Chinese
characters and are aimed at a large and diverse group of users.
Failure to make such distinctions may lead to inadequate demands for
data quality, search strategies, etc. For example, certain automatic
or half-automatic methods of scanner input can be quite useful and
efficient for an individual user prepared to spend a substantial
amount of time for data correction; but the very same method may
prove totally inadequate for large-scale institutional data input
because of the high cost of error correction. Similarly, a
relatively high number of mistakes may not bother some users but is
unacceptable for data that are to be distributed to other users.
Again, the use of many self-defined characters can be acceptable for
individuals but not for institutions.
3. It is of the greatest importance to make basic decisions at the
beginning of a project and to discuss them with specialists. In
making these decisions, both present and future possibilities of
use must be kept in mind. This applies particularly to the choice
of source text, text editing, annotation, basic data character
(character encoding, data format, non-standard character handling,
etc.), and hard/software environments. Such questions must be
discussed by a team of specialists at the outset of a large
project, i.e. before the main input activity starts, and an action
plan should be approved by the whole team.
Failure to do this can result in gigantic waste of money. Several
Chinese text databases I know of started out with little planning;
mostly they were designed to fit the hardware and software
environment of some years ago at a specific location. Later, when
trying to convert the data to present requirements and for use by
other institutions, they found that automatic conversion was not
possible or corrupted the data set. Prior planning and consultation
with specialists could have prevented this. Another example: tagging
data during the input or correction / editing process can improve
the value of a database enormously, for example in making it
possible to look for all plant names or place names in the whole
Pali canon. Doing something like this at a later point would be
another major enterprise that could have been avoided through
careful planning.
4. If the electronic text is (or may at a later point in time be)
destined for international users and a variety of hardware and
software environments, it is necessary to make a basic data set
(master data set) that can later be automatically converted into
any necessary code or format. It is important to treat this master
data set as a separate entity whose input conditions, character
code, hardware environment, etc. can be very different from that
of the eventual user, just as studio quality music recording and
editing equipment is different from the reproduction equipment of
the consumer.
With Chinese text, the difference shows particularly in the way rare
characters and different national standards are handled.
Institutions that do not separate master data and user data
invariably produce data that follow the low standards of character
codes now used on PCs (JIS, GB, BIG-5, etc.; see the article in this
number by C. Wittern). Of the institutions visited on the recent
field trip, those who did not distinguish between master and user
data all suffer from data quality problems which will become even
more serious as larger codes become available. Those who were wise
enough to make this distinction are: the libraries of Taiwan
National University and Hong Kong University of Science and
Technology (both use master data in CCCII code and user data in
BIG-5) and the Chinese Academy of Social Sciences (master data in
their own 45,000 character code, user data in various formats). Just
like master tapes in the music business, master data must be of such
quality that it can be used in many different environments, present
and future. Most of the Chinese text data so far input in Japan,
Korea, and mainland China will have about as much future as the
recording of a concert made on a Walkman.
5. In order to assure such convertibility and adaptability, the
master data must contain the greatest possible amount of
information. This is an important factor of data quality. In the
case of Chinese, Korean, or Japanese data (or any other text set
that may include characters that are not standardized and where
several competing standards exist), one must utilize the character
standard with the best structure, greatest number of characters,
and best convertibility. At present, the best standardized Chinese
character code for master data is the Taiwanese CCCII code (see
the article by C. Wittern below). In spite of its clumsy
three-byte format, its elevated price (around US $ 4000 for a PC
card, conversion routines into other codes, and exhaustive
documentation), and its very small number of users, adoption of
this code seems at present the most sensible approach for creating
a master data set of large bodies of full-form Chinese character
text.
(Note May 1995: Though the principle of clearly distinguishing
master data from user data stands, for various reasons I do not
see the CCCII code as the best possible code any more. A national
code in combination with the IRIZ KanjiBase is more practical.)
Data that is input in character codes with a mixture of simplified
and non-simplified characters and a small total number of characters
(such as JIS or GB codes) cannot be automatically converted into
more elaborate codes; for example, cannot be converted
automatically because no machine will know whether it stands for
or or or some other long form. The reverse conversion,
however, is easy. The same is true for variant forms of characters:
the objective is to preserve as much of this information in the
electronic text as is possible. In Japan, characters not existing in
the JIS code are often input by an empty box and other characters in
simplified forms (for example in the monogatari CD-ROM of the
Kokubungaku shirykan in Tokyo). Such data has bad convertibility
and is thus of deficient quality even if the input text is quite
accurate.
6. If variant forms of characters exist in the printed form of a
source text, one should strive to reproduce them as they are in
the electronic text. This is not always possible or even wise;
some variations (for example different print shapes of radicals
(as in 榳 and 潁) are commonly accepted. However, all such
decisions must be documented and strictly adhered to. Obviously, a
master data code such as CCCII makes the management of such
variations easier since it links variant characters to basic
forms, making conversion from one into the other possible if the
need arises. If the printed text contains several printed forms of
a character, one must reproduce these features in the electronic
text. If a good record is kept of such variations during the input
and correction process, one will later be able to create search
modules that automatically search all variants of a character or
term concurrently if the user wishes this. The producers of
electronic text should keep in mind that present and future users
may have interests that can not be imagined or foreseen and that
it is not the job of the data producer to limit such interests.
Rather, the basic data set should be as faithful as possible, just
like the master recording in music.
For the Chinese University of Hong Kong掇 concordance series (stored
in Big-5 format without distinction of master and user data), the
variant forms were reduced to standard forms and listed in printed
form. The electronic text only features the standard forms. If in
the future a larger code comes into common use that contains many
variant forms, there will be no master data that can be converted in
such a code, and much of the work that was done in reducing the
information will have to be repeated in the other direction. In
contrast, the Hong Kong University of Science and Technology inputs
mainland book information in simplified forms and Taiwanese
information in long forms, as they appear in the book. The search
module then treats the simplified characters as variant forms which
are also searched, allowing the user to find information regardless
of the specific form of the printed character.
7. In electronic text, data accuracy is exceedingly important because
machines search data with much more accuracy than humans. They
find only what the data set contains, and browsing is usually not
possible or feasible. Mistakes are in general only found by chance
and not as a result of a search; one thus cannot expect users to
correct data. With large data sets, users are often blinded by the
amount of information that can be found. However, one must also be
able to rely on accurate information on what is not in a text.
Input mistakes prevent gaining such information. Data mistakes can
be eliminated by adequate input methods and data correction
procedures. Data accuracy depends on a variety of factors which
are usually interdependent: quality and readability of the source
material, choice of input method, education of personnel, quality
of input guidelines (definition of identity / difference of
characters), size of the character code, quality of reference
materials, data correction procedures and personnel, consistency
of the application of the guidelines, quality of input and
correction documentation, honesty of personnel in admitting
problems, etc.
Master data of good quality must not only be in an adequate code
which contains much information (and has thus good conversion
characteristics) and does not distort the printed original: it must
also be error-free. With alphabetical text, input of the same text
by two typists and machine-comparison of the typed text yield quite
good results. However, this method is not totally adequate because
fast typists sometimes mistakenly hit the same wrong adjacent key.
For Chinese data, this method has not proven successful because
typists often make the same mistakes. Thus a good error-correction
procedure must be applied and strict guidelines must be given to the
input and correction personnel. They must be trained in strict
quality control procedures; all individual decisions must be
documented, approved, and consistently applied.
8. The overall value of a database can often be substantially
improved by teamwork and by team discussions of a variety of basic
issues. For example: the choice of the printed text that serves as
data source; the presence or absence of scholarly commentary or
annotation; the references to printed sources; the user profile;
the required search tools; the cost and quality of necessary hard-
and software; future hardware and software environ-ment prospects;
the ease of use of hardware and software; the variety and quality
of character conversion utilities; the cost of the data; the
accura-cy standard of data; the convertibility standard of the
data; the structure of the data; the flexibility of the data
structure (adaptability of format, etc.); the standardization
level; etc.
Having heard too many 涐f only we had thought about this before
input started ... I believe that in database planning and
management, group decisions based on discussion are often better
than individual decisions. Scholars must be careful not to leave
such decisions to technicians and programmers. On the field trip, we
met programmers who admitted that they have never actually used the
database they have been working on for years...
9. Databases are made for users; therefore the wishes, working
environment, and likely working habits of users must be carefully
studied and respected. For example, most users search while
writing a paper or book; therefore it must be possible to use the
database concurrently with a word processing program. Any large
text database should also let the user attach notes and tags to
the main text. Such notes should also be searchable, printable
(together with the text or separately), savable as separate files
with location tags, and portable to updated versions of the
electronic text. Search engines must also be adapted to many
users needs. Therefore it must be flexible and adaptable to a
variety of users preferences (just like word processing programs)
rather hard-coded. Search results should be viewable and printable
and file saveable in a variety of formats according to the user掇
wishes. Since the main aim of databases is the retrieval of
information, such retrieval should be carefully planned with many
options for the user.
In projects whose input takes many years of work, one must make
programmers produce multiple test versions of search software and
have scholars and other prospective users evaluate it even while
input is going on. If necessary, data structure decisions have to be
reevaluated. Users should have a say in all important software
decisions, and programmers should assist users to evaluate test
versions and to formulate their wishes by telling them about
alternative possibilities.
Author:Urs App
Last updated: 95/04/23
HTML>