Linking the Visual Module with the Language Module

Once the visual module is built, what good is it? By itself, not much. It only becomes useful when it is linked by knowledge with other cognitive modules. This subsection presents a brief sketch of an example of how, via instruction by a human educator, a vision module could be usefully linked with a language module.

A problem that has been widely considered is the automated text annotation of video describing objects within video scenes and some of those object's attributes. For example, such annotations might be useful for blind people if the images being annotated were taken by a camera mounted on a pair of glasses (and the annotations were synthesized into speech provided by the glasses to the wearer's ears via small tubes issuing from the temples of the glasses near the ears).

Figure 3.12 illustrates a simple concept for such a text annotation system. Video input from the eyeglasses-mounted camera is operated upon by the gaze controller and objects that it selects are segmented and represented by the already-developed visual module, as described in the previous subsection. The objects that were used in the visual module development process were those that a blind person would want to be informed of (curbs, roads, cars, people, etc.). Thus, by virtue of its development, the visual module will search each new frame of video for an object of operational interest (because these were the objects sought out by the human educator who's examples were used to train the gaze controller perceptron) and then that object will be segmented, and after consensus building, represented by the module on all of its three layers.

To build the knowledge links from the visual module to the text module, another human educator is used. This educator looks at each fixation point object selected by the vision module (while it is being used out on the street in an operationally realistic manner), and if this is indeed an object that would be of interest to a blind person, types in one to five sentences describing that object. These sentences are designed to convey to the blind person useful information about the nature of the object and its visual attributes (information that can be extracted by the human educator just by looking at the visual representation of the object).

To train the links from the vision module to the language module (every visual lexicon is afforded a knowledge base to every phrase lexicon), the educator's sentences are entered, in order, into the word lexicons of the sentence modules (each of which represents one sentence — see Figure 3.12); each sentence is parsed into phrases (see Section 3.4); and these phrases are represented on the sentence summary lexicon of each sentence. Counts are accumulated between the symbols active on the visual module's tertiary lexicons and those active on the summary lexicons. If the educator wishes to describe specific visual subcomponents of the object, they may designate a local window in the eyeball image for each subcomponent and supply the sentence(s) describing each such subcomponent. The secondary and tertiary lexicon symbols representing the subcomponents within each image are then linked to the summary lexicons of the associated sentences. Before being used in this application, all of the internal knowledge bases of the language module have already been trained using a huge text training corpus.

After a sufficient number of education examples have been accumulated (as determined by final performance — described below), the link use counts are converted into p(C|1) probabilities and frozen. The knowledge bases from the visual module's lexicons to all of the sentence summary lexicons are then combined (so that the available long-range context can be exploited by a sentence in any position in the sequence of sentences to be generated). The annotation system is now ready for testing.

The testing phase is carried out by having a sighted evaluator walk down the street wearing the system (yes, the idea is that the entire system is in the form of a pair of glasses!). As the visual module selects and describes each object, knowledge link inputs are sent to the language module. These inputs are used, much as in the example of Section 3.3: as context that drives formation of a sentence (only now there is no starter). Using consensus building (and separate sentence starter generator and sentence terminator subsystems — not shown in Figure 3.12 and not discussed here

Figure 3.12 Image text annotation. A simple example of linking a visual module with a (text) language module. See text for description.

— for starting and ending the sentence), the language module composes one or more grammatical sentences that describe the object and its attributes.

The number of sentences is determined by a meaning content critic subsystem (not shown in Figure 3.12) which stops sentence generation when all of the distinctive, excited, sentence summary lexicon symbols have been ''used'' in one or more of the generated sentences.

This sketch illustrates the monkey-see/monkey-do principle of cognition: there is never any complicated algorithm or software; no deeply principled system of rules or mathematical constraints; just confabulation and consensus building. It is a lot like that famous cartoon where scientists are working at a blackboard, attempting, unsuccessfully, to connect up a set of facts on the left with a desired conclusion on the right via a complicated scientific argument spanning the gap between them. In frustration, one of the scientists erases a band in the middle of the argument and puts in a box (equipped with input and output arrows) labeled ''And Then a Miracle Occurs.'' THAT is the nature of cognition.

Was this article helpful?

0 0

Post a comment