Basic Problem

The main reason for creating a database is to organize the available information into a number of discrete, labelled fields, which can then be searched to provide the information required by the user (Figure 6 . 3) . Human-readable text is not structured according to logical fields and is seen by the computer as a string of characters, preventing the automation of complex queries . A database, on the other hand, is essentially a collection of separate fields, allowing the accurate specification of complex queries . An example (based on a palaeon-tological description) of the sort of query facilitated by a database would be to extract the subset of all taxa that had strongly biconvex shells with spines on the exterior surface that occurred during the Ordovician period and had been collected or recorded from the USA .

Taxonomic descriptions contain all this information and much more However, computer-based understanding of natural language does not yet and may never allow the automation of such queries over text resources . Querying text using these criteria could locate places in the text where the words 'strongly biconvex', 'spines', 'Ordovician' and 'United States' were cited in close proximity, but understanding the meaningful connections among the terms cannot be achieved; even finding synonyms and misspellings of these terms is problematic . Thus, extensive human involvement and expertise would be required for each such query

Simple searches of text strings are in any event inadequate for the purpose because a certain amount of important information is implied rather that directly stated in standard text. The stated stratigraphic range of Cambrian to Devonian, for example, indicates that the taxon had a stratigraphical range that did indeed include the Ordovician (one of the intervening geological periods between the Cambrian and the Devonian) The fact that it is not directly mentioned in the text means that simple queries will miss this taxon when searching for Ordovician Databases can handle this by having separate fields for the start and end of a stratigraphic range (e g , filled by Cambrian and Devonian, respectively) and by using a look-up table to allow a complex search that will successfully extract taxa whose stratigraphic range includes the Ordovician, even though the term does not appear in the

FIGURE 6.3 Diagram showing how components of a taxonomic descriptions have to be partitioned into labeled boxes (fields) when imported into the database .

original description. This still leaves the problem of how to get the information into the fields of the database .

6.4.1 The Spectrum fROM Nonstructured to Structured Electronic Data

In computing terms, the fundamental difference between normal text and databases is described in terms of structure Normal text is unstructured, while the labelled fields of a database represent terms that have a high degree of structure. Using the preceding example, by transforming the unstructured text of taxonomic descriptions into the structured terms of a database, it would be possible to successfully carry out a complex search of the latter that would return, in a single document, all the taxa from a particular superfamily that had spines and were found in the Ordovician of China The question is whether or not it is possible to achieve the transformation from unstructured to highly structured text without having to depend on having the text retyped by a human operator

Having highly accessible electronic data resources is the key for the future health of any big science and has been explicitly identified as the crucial future direction for systematics However, the overall design of such resources requires careful thought because large digital data collections may be structured in a range of different ways It is important to clearly identify which method is most suitable for a particular purpose and, indeed, which methods are feasible for particular types of information

Recent developments in computing science have provided a much wider range of methods of obtaining structured digital data In this context, it is useful to think in terms of a spectrum of data structuring, which ranges from totally unstructured data at one end, up to a highly structured database at the other. The two end points are well recognized amongst the users and creators of taxonomic literature and biodiversity databases; a single large text resource, such as a taxonomic monograph, is an example of an unstructured digital resource, and the database is a good example of a highly structured digital resource However, there are various possible hybrids along this line also, one of which in particular will be advocated in this chapter as the most suitable for important collections of taxonomic data. There is great value for systematics in exploiting this spectrum of approaches, especially because it holds out the possibility of greatly speeding up the acquisition of a digital data resource for the subject

The unstructured text resource is the simplest to acquire because text is the traditional format for all current and historical publication While modern publications typically exist in electronic format already, if this is made available by its owners, it is also a relatively straightforward task to create electronic text resources for archival and indeed ancient material, due to recent significant advances in document scanning and OCR technology (Figure 6 .4) . Together, these techniques make it eminently feasible for large volumes of archive text to be brought into the electronic domain at relatively small human cost

Nor should unstructured text be dismissed as an information format; advances in automated information retrieval and information extraction, spurred by their requirement to improve the efficacy of Internet search engines, mean that pertinent information can easily and quickly be found in a very large document resource, even modulo details such as differences in the textual form of terms or even the use of synonyms As any user of an Internet search engine will know, these techniques give excellent performance in precision and recall (terms meaning, roughly, specificity and sensitivity, respectively), even over a document collection in the order of 109 resources . However, in terms of extracting rigorously correct answers to precisely defined queries, the probabilistic essence of the approach leaves much to be desired; furthermore, the result of an information retrieval engine will be a set of resources requiring inspection by a human, rather than those that could be used as the input to further automated query processing As mentioned previously, queries using text strings alone will not satisfy the majority of questions that users of systematics resources require from digital resources, but they may offer a useful facility for certain purposes, such as initial investigations of the data

At the other end of the spectrum is the database, which contains a highly regular subset of all the available information Databases have the significant advantage of providing a framework where rigorously defined automated queries can be posed, and very high confidence may be vested in the answers The reason for this high degree of confidence lies in the database schema, which is an a priori data description designed for the particular set of tasks in hand. This schema is used to ensure that queries are sound — that is, that they strongly correspond with the description of the data used For example, a question of mean age may only be posed if every record has an age field and each one of these is a correct numeric value The presence of the schema implies that the property of soundness may be determined irrespective of the actual data collection, by reference to the schema alone If a query is not sound, the programmer will be immediately alerted and the query will not be allowed to execute

Furthermore, the database is also populated with information according to the structure of this schema, giving an enforced quality filter at data entry time If, for example, the field

2. Liotlltrjs arctic a, Friele, Sp. (Plate I. figs. 17, 18.)

Terebratula árctica, Prick, Saerskilt Aftryk af Nyt Magazin for Natm-viilensltabeme, pi. i. fig. i., 1877.

Shell small, globoso, broadly orate, rather longer Sfian wide. Valves smooth, glassy, semi transparent, whitish; dorsal valve convex, squarely circular, without fold or sinus; ventral valve. very convex and deep; beak unusually short, slightly incurved and truncated by a very small foramen margined anteriorly by rudimentary dcltidial plates; loop very small and simple. Length 7, breadth C, depth 4 linas.

Mai. Dredged by Herman Priele some few miles south-west of ,Tan Mayen, in 263 la thorns depth. Shell abundant, but so brittle that most of the specimens were broken during the dredging-operation.

Obs. After haviug carefully compared a specimen of the shell under description, sent to me by Priele, with others of the vac. minor to which it had been referred by Dr. Jeffreys, I could, as Priele had previously done, di so over several differences which, although not very great, have induced me to follow its discoverer in considering it a distinct species. L. árctica is much moro globose and squarely rounded than L. minor, which is more of an elongated oval. As stated by Priele, its form approaches most to L. minor of Philippi, but the deviation is shown in the shorter beak and by the position of the foramen, which, in L. arctica, is placed directly above the dorsal valve, the deltidium. being almost hidden. The loop in L. árctica is very much weaker and thinner, and the crura processes are placed further apart than in L. minor. It is the first representative of the genus JAotkyris that has been hitherto found in Arctic seas.

2. LIOTRYRIS ARCTICA, Friele, sp. (Plate I. figs. 17, 18.)

Terebratula arctica, Friele, Srerskilt Aftryk af Nyt Magazin for Naturvidenskaberne, pI. i. fig. i., 1877.

Shell small, globose, broadly ovate, rather longer than wide. Valves smooth, glassy, semitransparent, whitish; dorsal valve convex, squarely circular, without fold or sinus; ventral valve very convex and deep; beak unusually short, slightly incurved and truncated by a very small foramen margined anteriorly by rudimentary deltidial plates; loop very small and simple. Length 7, breadth 6, depth 4 lines.

Hab. Dredged by Herman Friele some few miles south-west of Jan Mayen, in fathoms depth. Shell abundant, but so brittle that most of the specimens were during the dredging-operation.

Obs. After having carefulIy compared a specimen of the shell under description, sent to me by Friele, with others of the var. minor to which it had been referred by Dr. Jeffreys, I could, as Friele had previously done, discover several differences which, although not very great, which have induced me to follow its discoverer in considering it a distinct species. L. arctica is much more globose and squarely rounded than L. minor, which is more of an elongated oval. As stated by Friele, its form approaches most to L. minor of Philippi, but the deviation is shown in the shorter beak and by the position of the foramen, which, in L. arctica is placed directly above the dorsal valve, the deltidium being almost hidden. The loop in L. arctica is very much weaker and thinner, and the crura processes are placed further apart than in L. minor. It is the first representative of the genus Liothyris that has been hitherto found in Arctic seas.

FIGURE 6.4 OCR-scanned text from taxonomic description in Davidson, 1886. (a) = original text; (b) = OCR processed text from original manuscript . (Davidson, T. D. Transactions of the Lin-nean Society of London, 4, Linnean Society, 1887)

entered does not correspond to the expected value range or if a field fails to be entered through oversight, a notification will immediately occur and the error must be rectified at that point . Finally, database constraints may be defined so that any inconsistent changes to the database will cause an alert. All of these valuable properties are possible due to the presence of the database schema; this definition is the first requirement for any new database.

These are the good points of the database; the single worst aspect, however, is that the strong reliance upon the definition of a precisely defined schema necessarily allows only a subset of all the available information to be stored This is acceptable for many purposes, notably when a fixed set of queries is required over information that is inherently regularly structured; however, in a field where data contain many inherent irregularities, the partial nature of the schema always causes problems with the loss of information This aspect is aggravated by the fact that the schema, once in place, is extremely difficult to change, precluding smooth evolution as the demands of a system's users evolve over time .

A hybrid data model, known now as the semistructured model, appeared in the database research literature in the late 1990s . Originally proposed as an interchange format (i . e. , a partial solution for sharing information between different databases), the realization of the value of its use as a data model in its own right quickly made it a mainstream research topic . The inherent value of semistructured data is that they are self-describing; that is, they contain structure, but this structure is an integral part of the data collection, rather than appearing as a separate structural definition, as in a database schema. Semistructured data are technically defined as an edge-labeled directed graph and are often portrayed using data diagrams such as that shown in Figure 6 . 5 .

The significance of the correspondence between these diagrams and those in common use by taxonomists will not be missed; one of the key advantages of this data model in the present context is that the user population is well used to thinking in terms of this as a natural data model, rather than in terms of relations that have a rather artificial mapping

This research, details of which are well beyond the scope of this chapter, occurred by historical coincidence around the same time that XML was emerging as a replacement standard for HTML, with the primary purpose of separating the concepts of information and presentation within Internet resources . The XML standard, while containing much historical 'junk' from this origin, also happens to provide a public, open standard encoding for semistructured data

Semistructured data are sometimes referred to as a potential replacement for databases; this is a mistake They are purely a compromise model, giving a useful hybrid between free text and databases . While semistructured resources can have data descriptions (the current standards being DTD and XMLSchema), these are very different concepts from database schemata: they are not necessarily defined a priori, queries are not necessarily checked for soundness with respect to them and there is no framework for enforcing data to correspond to the descriptions While these are all ongoing research issues, the essential case is that semistructured data provide more flexibility but less safety than traditional databases .

We should stress again that the semistructured data model should be viewed as a hybrid, rather than as a new model to replace existing ones In terms of the example queries given before, using a semistructured paradigm is quite good for all of them. It is always better than either structured or unstructured models for things each is bad for; it is never as good as either for things for which each is good

FIGURE 6.5 Tree picture of data.

In this context, the single greatest advantage we perceive is the ability to avoid any data loss . Furthermore, the flexibility of the format is enshrined in the ability to move in either direction from it; for those sets of tasks that do require the rigor of databases, the creation of those databases is made substantially easier by having the semistructured, rather than unstructured, information as its starting point In the other direction, when free text is the requirement (as, for example, when people just need to read species descriptions as from a traditional treatise), this is just one of the possible views that may be provided by a semi-structured collection

Our novel observation in this context is that semistructured data may be gleaned, via a largely automated approach, from printed textual archives This is due to the following:

• Ancient archival material may be successfully subjected to scanning and subsequent OCR to create an electronic text archive

• This text archive is highly structured due to the use of rigid conventions that have developed within each taxonomic subdomain, and the inherent structure within the document can be gleaned via a further automated analysis of the scanned text

• The quality of structure that can be gained through these almost totally automated processes is sufficient to perform a large class of automated query over the meaning, rather than just the textual form, of the information.

SEO Article Copywriting

SEO Article Copywriting

Ghost Writing and Its Link to Internet Marketing. From 1996 to 2000, SEO copywriting was still not formulated. To optimize Websites, operators and owners had just needed to formulate and create Meta tags or titles and submit the tags and the whole Website to directories and search engines so that search listing would include the Website.

Get My Free Ebook

Post a comment