Bio Kleisli in Action Querying Biomedical Databases

BioKleisli consists of a query execution engine and a set of type specific data drivers. The figure below illustrates the installation that we use at the Center for Bioinformatics (PennCBI) which underlies the queries shown at the CPL website http://www.pcbi.upenn.edu (follow links to research projects). Users interact with the PennCBI installation using parameterized HTML query forms, a variety of programs written in perl5, prolog, C and other languages, or by using CPL directly. The types of external data sources that we connect to include ASN.l, Sybase and AceDB, as well as the BLAST sequence analysis package, as shown by the types of data drivers illustrated. Note that the Sybase driver can be used for both the GDB Sybase server as well as our local Sybase database for Chromosome 22, Chr22DB.

To query an external server, the names and types of "structures" that will be accessed in the data source must be registered as primitives in the BioKleisli library. For example, with a relational database one must register as parameterless functions the names of relations or views that will be accessed; alternatively, one can simply register one function per database which takes as input the name of a relation and returns a result of type set of records. These functions can then be used in CPL queries or within other CPL function definitions to create "user views" of the underlying data sources. The query execution module within BioKleisli will then generate the appropriate query in the host language of the external server to extract the value of the named structure. For example, with a Sybase server an SQL query would be generated from the CPL query. The host language query is then passed to the appropriate data driver, and from there to the external server. When the external server returns the result, the data driver translates it into internal BioKleisli format and returns the translated result to the query execution module for further processing within the original CPL query.

Non-Human Homolog Search

To illustrate how BioKleisli executes queries, we will walk through an example: "Find information on the known DNA sequences on human chromosome 22, as well as information on homologous sequences from other organisms." The strategy taken in writing this query will be to combine information from relational GDB and ASN. 1 GenBank. GDB is queried for information about the accession numbers of DNA sequences known to be within chromosome 22. The NA-Homolog-Summary function available in the Entrez interface to ASN.l GenBank is then invoked to retrieve homologous sequences (i.e., sequences with significant similarity to the original). The homologous sequences are then filtered to retrieve only non-human entries. The final answer is printed as a nested relation.

0 0

Post a comment