Querying Xmltagged Text

The once concise taxonomic description has been transformed into a much longer, much more unwieldy document by the XML parser, but the important aspect for the automation of digital capture is that that the tags allow complex searches to be carried out of the type

- <GENUS confirmed = "true"> <NAME>Gloriana</NAME>


<TYPE>[*Cryptiana superbia Nelson, 1856, p.413]</TYPE> <SYNONYM>[ = Paradesia SMITH, 1871, p. 54 (type, P. excella JONES, 1869b, pl. 6)] </SYNONYM>

<SYNONYM>[= Glorana DAVIES, 1906, p.18]</SYNONYM>


<DETAILS>Shell medium to small, subcircular to elongage oval outline, strongly biconvex </DETAILS> <DETAILS>Beak erect to sub-erect</DETAILS> <DETAILS>delthyrium with conjunct delthyrial plates</DETAILS> <DETAILS>Anterior commisure rectimarginate</DETAILS> <DETAILS>Exterior with faint radial ribs, and concentric growth lamellae, each bearing small, regularly distributed, erect spines</DETAILS> <DETAILS>Dental plates absent</DETAILS> <DETAILS>cardinal process long, thin, bifurcating anteriorly </DETAILS> </DESCRIPTION>


<STRATIGRAPHICRANGE> <START confirmed = "false">Lower Devonian</START>

<DETAILPERIOD confirmed = "false">Emsian</DETAILPERIOD> <START confirmed = "true">Upper Devonian</START>

<DETAILPERIOD confirmed = "true">Lochkovian</DETAILPERIOD> <END confirmed = "true">Upper Carboniferous</END> <DETAILPERIOD confirmed = "true">Cantabrian</DETAILPERIOD> </STRATIGRAPHICRANGE> </STRATIGRAPHIC>


<PLACE confirmed = "true">Australia</PLACE> <STRATIGRAPHICPERIOD confirmed = "false">Emsian</STRATIGRAPHICPERIOD> </GEOSTRATSETS>


<PLACE confirmed = "false">England</PLACE> <PLACE confirmed = "true">Belgium</PLACE> <PLACE confirmed=ntruen>Germany</PLACE> <PLACE confirmed="true">Czechoslovakia</PLACE> <STRATIGRAPHICPERIOD >

<START confirmed="true">EifeMan</START> <END confirmed="true">Givetian</END> </STRATIGRAPHICPERIOD> </GEOSTRATSETS >


<PLACE confirmed="true">England</PLACE> <PLACE confirmed="true">Ireland</PLACE> <PLACE confirmed="true">Germany</PLACE> <STRATIGRAPHICPERIOD >

<START confirmed="true" >Frasnian</START> <END confirmed="true">Cantrabian</END> </GEOSTRATSETS > </PLACES_AND_POSSIBLE_PERIODS > </GENUS>

FIGURE 6.6 XML-parsed taxonomic description.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template mateh="/"> <HTML>

<HEAD><TITLE>Brachiopod view</TITLE></HEAD> <BODY>

<P>There are <xsl:value-of select='count( FAMILY/SUBFAMILY/GENUS )'/> genera</P>

<P><xsl:value-of select="count( FAMILY/SUBFAMILY/GENUS[ contains( DESCRIPTION,'spine') ] )"/> of these contain spines.</P>

<TABLE BORDER="2" BGCOLOR="yellow"> <TR BGCOLOR="orange"> <TH>Genus name</TH> <TH>Details</TH> </TR>

<xsl:for-each select="FAMILY/SUBFAMILY/GENUS[ contains( DESCRIPTION,'spine') ]"> <xsl:sort select='NAME'/>

<TD><xsl:value-of select="NAME"/></TD> <TD>

<xsl:for-each select='DESCRIPTION/DETAILS'>

<xsl:val ue-of select="."/>; </xsl:for-each> </TD> </TR>

</xsl:for-each> </TABLE> </BODY> </HTML>

</xsl:template> </xsl:stylesheet>

FIGURE 6.7 XSL query run over XML-tagged text shown in Figure 6 .6 .

that would be impossible when the text is in the original stage. Over the last year, a number of computational standards have been implemented that make it much easier to run such queries over XML-coded data. Figure 6 .7 shows a simple query constructed as an XSL document. When run across an XML-coded document generated from taxonomic descriptions, this query will extract a list of all those taxa in which the morphological descriptions include the term 'spine', sort them by name, and present the result as a document in a WWW browser More complex queries would be relatively simple to program

The XML-tagged text allows complex queries of the kind possible with highly structured databases, but in this case the problems of manually entering the data into the database have been overcome by the automatic parsing of the original text For the example given in Figure 6 . 3, it would be possible to run queries that involved any possible combinations of the following:


<SHELL>Shell medium to small, subcircular to elongage oval outline, strongly biconvex </SHELL> <BEAK>Beak erect to sub-erect</BEAK> <DELTHYRIUM>delthyrium with conjunct delthyrial plates</DELTHYRIUM >

<ANTERIOR COMMISURE>Anterior commisure rectimarginate</ANTERIOR COMMISURE>

<EXTERIOR>Exterior with faint radial ribs, and concentric growth lamellae, each bearing small, regularly distributed, erect spines</EXTERIOR> <DENTAL PLATES>Dental plates absent</DENTAL PLATES> <CARDINAL PROCESS>cardinal process long, thin, bifurcating anteriorly</CARDINAL PROCESS> </DESCRIPTION>

FIGURE 6.8 Showing how XML tags could be recoded to provide additional query options within morphological descriptions .

• date of description;

• journal or monograph;

• morphological feature or descriptive term applied to that feature;

• geological range; and

• combinations of geological age and biogeographical location

Many advanced analytical tools developed for analysis of databases could also be applied if the tagged data were transferred to a suitable database It would also be possible to name each of the separately tagged parts of the morphological descriptions to create within an XML document the equivalent of separate fields in databases . For example, using the first or the first two words in the hypothetical species description parsed in Figure 6 . 3 would generate the fields found in Figure 6 . 8 within the description.

This process would not always produce meaningful headings, most obviously when the name of the feature appeared not at the beginning of a phrase. However, it would allow more sophisticated queries to be run across the data and would be helpful if the ultimate aim was to use XML parsing as a step in populating biodiversity databases During the renaming of the subsections of the description, interaction with an experienced taxonomist would allow useful names to be applied, perhaps from a glossary of appropriate terminology An important aspect of the compilation of the brachiopod treatise was the circulation of a glossary that was agreed upon by all participants prior to the initiation of detailed taxonomic work (Curry, Connor and Simeoni 2001; Williams 2001) . However, the flexibility of the procedure described means that rigid adherence to an established glossary is not essential; unusual terms will be retained whenever they are used and synonyms may be defined later

6.7.1 Advantage of Using XML Tagging to Extract Taxonomic Data

The major advantages of using this approach as compared to constructing a database are speed, accuracy, flexibility and ease of updating and revision Several hundreds of samples can be processed in a short time by even a modestly powerful personal computer. XML is an extremely flexible, hierarchical language, and the parser can readily be tailored to deal with variations in the style of taxonomic writing . Updating is not a serious problem because new or revised taxonomic descriptions can simply be parsed and the resulting XML document appended to or used to replace parts of the existing resource The text used is exactly that written is down in the monograph and has not been modified or synthesized in any way, as happens when human operators are required to extract and enter the information into database fields This means that information is never lost; instead, interpretive views, which may be changed, are layered over it XML is a similar language to the HTML used to construct WWW sites, and all modern browsers display XML-tagged text and the products of XSL-programmed queries, which means that the original text is always one such readily available view

6.7.2 Applicability of the Technique

There is great inherent flexibility in the XML structure to allow for differences in the structure, protocols and layouts of taxonomic descriptions in different groups The technique could therefore be readily made applicable to the automatic XML tagging of taxo-nomic descriptions for a wide range of different phyla. Provided the layout and structure are standardized, taxonomic descriptions in languages other than English can also be readily parsed Foreign language descriptions would have be accommodated in the queries run across such data, but in many cases the technical terms used to describe species are very similar in many languages

Was this article helpful?

0 0

Post a comment