The Problem of Nary Associations

The Parkinson's-Reversing Breakthrough

What is Parkinsons Disease

Get Instant Access

With neuronal data, however, associations between objects are typically not binary but N-ary (where N is greater than 1). For example, consider the following information on the neurons of the nigrostriatal pathway (whose function is impaired in Parkinsonism):

Neurons:

Nigrostriatal

Anatomical Origin:

Location:

Pars Compacta of Substantia Nigra

Projecting To:

Corpus Striatum

Neurochemical Released:

Dopamine

Receptor:

D2

Electro-physiological Function:

Inhibitory

Neurons that provide Input (Afferents):

Pars Reticulata of Substantia Nigra, Striato-Nigral

Neuron Projected To (Efferents):

Striato-Pallidal neurons, Striato-Striatal neurons

The information in the example above may be regarded as multi-axial, where receptor, anatomical site, neuronal type etc. comprise the axes, and we must now consider how to represent it within a database schema. First, note that it is not advisable to represent the axes as attributes of a "Neuron" class. Such an approach might have been permissible in a database with the primary focus on neurons. However, in a database that stores information on a variety of objects, this approach introduces an asymmetry by implicitly making neurons first-class entities and others second-class. Each axis mentioned above refers to objects which, depending on the perspective of particular users, may be as or more important than neuronal types. Thus, some queries may not be directed at the Neuron class at all: e.g., a user may wish to retrieve a list of anatomical locations where D2 receptors are found. Such users might advocate creating a "Receptors" table, and storing this information with the neuronal type as one of the fields instead.

Therefore, a better way to represent this information is as associations. One method of representing multi-axial associations, widely deployed in data warehouse design for business applications, is the "Star" Schema (14). In a star schema, a central "Facts" table, which stores one or more quantitative columns plus several foreign key columns, is related many-to-one to multiple "Dimension" (class) tables. (The phrase "star" refers to the appearance of the schema diagram, with the many-to-one links radiating out from a central facts table.) Each class table stores information on entities in a single axis. This design has proved valuable in situations such as analyzing sales information by territory, salesman, product, product category, volume, and so on. For NS data, however, star schema design is unlikely to be usable without considerable modification, for several reasons:

1. The axes describing neuronal data are not strictly orthogonal (i.e. independent). For example, receptors and neurotransmitters are inter-related; given knowledge of the receptor, a domain expert automatically knows the transmitter involved (though the converse is not true).

2. The number of axes relevant to a particular fact is variable: some attributes may not be known, and others may be irrelevant for certain instances of data.

3. The nature of axes is also unlikely to be static (i.e. unchanging) in a rapidly evolving field. An axis that could be added to the above list, for example, is expression during different phrases of embryonic life.

4. The same object class may appear in more than one axis. In the example, the "Neuron" class appears in three different categories - Neurons, Efferents, and Afferents.

5. Certain axes may have multiple object instances. In the example, the nigrostriate neurons receive inputs from multiple neurons and generate outputs.for multiple neurons as well. For the NS, in fact, most of the axes are likely to be multi-valued. (For example, a single neuron has multiple classes of receptors on its surface, and many kinds of neurons are known to release more than one neurochemical simultaneously.) In a normalized relational database design, columns of a table must be atomic and not multi-valued, and so multivalued data must be factored out into separate tables.

6. Retrieval of data along some axes sometimes involves sophisticated algorithms rather than simple table lookups. The most well known examples in biological data are sequence similarity (determined by algorithms such as BLAST (15) and the information-retrieval metric of Wilbur (16), which measures similarity of two bibliographic citations based on textual content. 1

7. Within a single axis, entities may be inter-related through recursive relationships of the parent-child type. This complicates the query process because of the need to "explode"a query object instance, retrieving all its children prior to scanning the association data. E.g., in the example above, the pars compacta is part of the substantia nigra, which is part of the mid-brain. To process a query that asked for anatomical locations of various receptors in the mid-brain, one would first have to retrieve all "child" anatomical sites within the mid-brain and then search the association data against this set of child sites.

A general representation of N-ary Associations

A general representation of N-ary Associations

The sub-schema we propose to handle the case of N-ary associations (where N, the number of axes, varies) is shown in fig. 1. (In this figure, table names are Bold/Underlined, while primary keys are in italics. Arrows point from a foreign key to a primary key.) The Classes and Objects tables have been mentioned earlier in connection with the Object Dictionary approach. The Facts table stores a unique identifier, the Fact ID, and a textual narrative of the fact (for reasons described shortly). There is a one-to-many link between the Facts table and a Citations table (not shown in the figure).

1 While some systems, such as NCBI's Entrez, store pre-computed sequence and citation similarity scores for efficiency, such pre-computation must be done each time new sequences or citations are added to the database. Such pre-computation is justifiable only if the primarily purpose of the database is to assist similarity searching (as in Entrez).

The Associations table describes the object instances linked to the fact, with one row per object instance for the same fact ID. The Qualifier field is a descriptor from a controlled-vocabulary table, that describes an aspect of an association. The set of permissible qualifiers is determined by the class of a particular object instance. Examples of qualifiers for neurons are "primary", "afferent" and "efferent". Examples of functional effects are "excitatory", "inhibitory", "autoreceptor negative feedback".

The example data would be represented in the Associations table as illustrated in Table 1. (We assume a fact ID of 100 and, for simplicity, use the Object / Class / Qualifier descriptions rather than IDs.)

Fact ID

Object

Class

Qualifier

100

Nigrostriate

Neuron

Primary

100

Pars Compacta, S.Nigra

Location

Soma

100

Corpus Striatum

Location

Eff.Axon

100

Dopamine

Neurochemical

Transmitter

100

Receptor

D2

Efferent

100

Inhibition

Electr.Function

100

Striato-pallidal

Neuron

Efferent

100

Striato-striatal

Neuron

Efferent

100

Striato-nigral

Neuron

Afferent

100

Pars Reticulata, S.Nigra

Neuron

Afferent

Table 1 : Capturing information on Nigrostriatal Neurons

Table 1 : Capturing information on Nigrostriatal Neurons

The Qualifier field can readily capture the semantics of an association in the case of binary relationships. Thus, to record the assertion that the neurochemical "substance P" is co-released as a modulator with the chemical serotonin (a neurotransmitter), we would need two rows. One row would have the Object ID/Qualifier values "serotonin" and "transmitter", the other would have "Substance P" and "modulator". Reciprocal binary relationships are also readily captured without the need to store the same fact reciprocally. Thus, to record the assertion, "structure X is contained within structure Y", one uses two rows with the Object/Qualifier pairs <"X",."contained"> and <"Y","contains">.

However, for the arbitrary N-ary fact, the objects are linked to each other conceptually like the nodes of a semantic network. The Qualifiers act somewhat like the edges in the network, but they are not fully satisfactory for this purpose given the current structure. Therefore, it is often hard to reconstruct the semantics of an N-ary fact given the data in the Associations table alone. (Even in cases where it is not hard, it generally requires several computational steps.) To avoid this (generally unneeded) computation, the "Narrative" field in the Facts table stores an explicit textual description of the N-ary fact for the user's perusal.

By an analogy with Information-Retrieval methods, the Associations table may be regarded as an index (17) to the narrative text, for the purpose of rapid retrieval. The only difference between the Association table and the inverted files created by freetext indexing engines (e.g., for Web-searchable document collections) is that the index-term vocabulary is more controlled with the Associations table. The similarity, however, is that, in both cases, complex Boolean retrieval (e.g., list all neurons where Dopamine has an inhibitory role) requires set operations, such as Union. Intersection, and Difference, on subsets (projections/ selections) of the Associations table with each other. Relational set operations are computationally less efficient than the equivalent AND, OR and NOT operations that would have been needed with, say, a classical Binary-Relationship table, but the plus feature is flexibility and a simple structure. (For example, multiple object instances on a single axis do not need to be managed through separate many-to-one related tables.) Also, in practice a significant proportion of queries tend to be based on a single axis rather than multiple ones. Such queries can be answered by locating the fact IDS corresponding to a particular class instance, and then simply returning the narrative for those IDS.

Was this article helpful?

0 0

Post a comment