In some collections a list of organizations. We use the Dublin Core as a base and retrieved document. The document structure is used for searchable indexes. Old material for which GML files have previously been created is not re-imported. Then the build process is invoked to build the requisite indexes for the collection. Finally, the contents of the building directory are moved into the index directory, and the new version of the collection automatically becomes live.
This procedure may seem cumbersome. But all the steps are necessary for efficient operation with large collections. The import process could be performed on the fly during the building operation—but because building indexes is a multipass operation, the often lengthy importing would be repeated several times.
The build process can take considerable time—a day or two, for very large collections. Consequently, the results are placed in the building directory so that, if the collection already exists, it will continue to be served to users in its old form throughout the building operation.
Active users of the collection will not be disturbed when the new version becomes live—they will probably not Figure 4: Browsing titles in the HDL even notice. The persistent OIDs ensure that interactions remain coherent—users who are examining the results of a query or browse operation will still retrieve the expected paragraphs, corresponding to the distinctions that GML documents—and if a search is actually in progress when makes—the hierarchical structure is flattened for the the change takes place the program detects the resulting purposes of creating these indexes.
Indexes can be of text, file-structure inconsistency and automatically and or metadata, or any combination. For example, the raw material for the HDL format, and plugins are required to process each format is supplied in the form of HTML files marked up with type.
For each book in the library there is a plugin in order until it finds one that can process it. When directory that contains a single HTML file representing the updating an existing collection, all plugins necessary to book, and separate files containing the associated images. The building step creates the indexes for both searching and browsing. The MG software is generally used to do the Since the collection exists, its directory is already set up searching Witten et al.
For example, the Humanity Development Library configuration file. Subdirectories of the index directory are created for each of these indexes. The updating procedure To update a collection, the new raw material is placed in the import directory, in whatever form it is available. Now use a standardized format such as XML. Given the file system though they would be needed if the the transitory nature of the imported data, to date, we have collection were rebuilt.
This contains an entry for each document, giving its OID, its internal MG Building new collections from scratch is only slightly document number, and metadata such as title. Information different from updating an existing collection. Two pieces of during the building process and stored in the database.
The utility reused from one collection to another. With suitable data placed in the import whatever appears in the archives directory. If it contains material in the original document filenames. Thus the import process is optional.
To enhance the functionality and presentation— something anything but the most trivial collection will require—the GML is designed to be fast and easy to parse, an important configuration file must be edited. For a collection sourced requirement when millions of documents are to be from documents in an already supported data format, processed. Something as simple as requiring tags to be presented in a similar fashion to an existing collection, the lower-case, for example, yields a substantial speed-up.
Plugins parse documents, extracting the text and metadata to be indexed. Classifiers control how metadata is brought together to form browsable data structures. Both are specified in an object-oriented framework using inheritance to minimize the amount of code written. A plugin must specify three things: what file formats it can handle, how they should be parsed, and whether the plugin is recursive. File formats are normally determined using regular expression matching on the filename.
For example, the HTML plugin accepts all files that end in. HTM, or. Figure 5 line 7. If it can, the plugin parses the file and returns the number of documents processed. This is how directory hierarchies are traversed. Importing new data formats and browsing metadata in ways not currently supported are Plugins are small modules of code that are easy to write.
We monitored the time it took to develop a new one that was different to any we had produced so far. Figure 5b shows simple alterations to the generic Figure 6 shows a user searching for bookmarked pages configuration file in Figure 5a that was generated by the about music.
The new plugin took under an hour to write, new-collection utility. Now use a standardized format such as XML. Given the file system though they would be needed if the the transitory nature of the imported data, to date, we have collection were rebuilt. This contains an entry for each document, giving its OID, its internal MG Building new collections from scratch is only slightly document number, and metadata such as title. Information different from updating an existing collection.
Two pieces of during the building process and stored in the database. The utility reused from one collection to another. With suitable data placed in the import whatever appears in the archives directory. If it contains material in the original document filenames. Thus the import process is optional.
To enhance the functionality and presentation— something anything but the most trivial collection will require—the GML is designed to be fast and easy to parse, an important configuration file must be edited. For a collection sourced requirement when millions of documents are to be from documents in an already supported data format, processed.
Something as simple as requiring tags to be presented in a similar fashion to an existing collection, the lower-case, for example, yields a substantial speed-up.
Plugins parse documents, extracting the text and metadata to be indexed. Classifiers control how metadata is brought together to form browsable data structures. Both are specified in an object-oriented framework using inheritance to minimize the amount of code written. A plugin must specify three things: what file formats it can handle, how they should be parsed, and whether the plugin is recursive. File formats are normally determined using regular expression matching on the filename.
For example, the HTML plugin accepts all files that end in. HTM, or. Figure 5 line 7. If it can, the plugin parses the file and returns the number of documents processed. This is how directory hierarchies are traversed. Importing new data formats and browsing metadata in ways not currently supported are Plugins are small modules of code that are easy to write. We monitored the time it took to develop a new one that was different to any we had produced so far.
TEXTPlug is replaced with about music. The new plugin took under an hour to write, EMAILPlug line 7 which reads email files and extracts and was lines long ignoring blank lines and metadata From, To, Date, Subject from them. A classifier comments —about the average length of existing plugins. The default presentation of work on GML-format data. For example, any plugin that search results is overridden line 17 to display both the generates date metadata in accordance with the Dublin title of the message i.
Dublin Core Title and its sender core can request the collection to be browsable i. Dublin Core Author. Classifiers are with a particular document. The built-in term [icon] more elaborate than most plugins, but new ones are seldom produces a suitable image that represents the document required.
Anything else in the format statement, which in this case is Classifiers must specify three things: an initialization solely table-cell tags in HTML, is passed through to the routine, how individual documents are classified, and the page being displayed. On presentation of the instance, computer-trained librarians. Once all documents have been added, a request is made for Writing new plugins and classifiers the completed data structure. Some classifiers return the data structure directly; others transform the data structure Extensibility is obtained through plugins and classifiers.
A text version of the page is also available upon which a searching option is also provided. Started in , Harvest is also a long-running research project. It provides an efficient means of gathering source data from the Internet and distributing indexing information over the Internet. This is accomplished through five components: gatherer, broker, indexer, replicator and cache.
The first three are central to creating, updating and searching a collection; the last two help to improve performance over the Internet through transparent mirroring and caching techniques. The system is configurable and customizable. While searching is most commonly implemented using Glimpse glimpse. It is possible to control what type of documents are gathered during creation and updating, and how the query interface Figure 7: Browsing a newspaper collection by date looks and is laid out.
Sample collections cited by the developers include 21, divides the alphabetically sorted list of metadata into computer science technical reports and 7, home pages. Harvest is also library software are Dienst Lagoze and Fielding, often used to index Web sites for example and Harvest Bowman et al. The origins of Dienst www. The term Comparing Greenstone with Dienst and Harvest, there are has come to represent three entities: a conceptual both similarities and differences.
We hope that this software will encourage the effective deployment of digital libraries to share information and place it in the public domain. Further information can be found in the book How to build a digital library , authored by three of the group's members. The Greenstone project is the seventh recipient of the biennial Namur award , which recognizes recipients for raising awareness internationally of the social implications of information and communication technologies.
The dissemination of educational, scientific and cultural information throughout the world, and particularly its availability in developing countries, is central to UNESCO's goals as pursued within its intergovernmental Information for All Programme, and appropriate, accessible information and communication technology is seen as an important tool in this context. This paper describes the Greenstone digital library software, a comprehensive, open-source system for the construction and presentation of information collections.
Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically.
0コメント