Representation of Multi Source Soundstreams

Figure 3.5 illustrates an ''audio front end'' for transduction of a soundstream into a string of ''multisymbols;'' with a goal of carrying out ultra-high-accuracy speech transcription for a single speaker embedded in multiple interfering sound sources (often including other speakers). The description of this design does not concern itself with computational efficiency. Given a concrete design for such a system, there are many well-known signal processing techniques for implementing approximately the same function, often orders of magnitude more efficiently. For the purpose of this introductory treatment (which, again, is aimed at illustrating the universality of confabulation as the mechanization of cognition), this audio front-end design does not incorporate embellishments such as binaural audio imaging.

Referring to Figure 3.5, the first step in processing is analog speech lowpass filtering (say, with a flat, zero-phase-distortion response from DC to 4 kHz, with a steep rolloff thereafter) of the high-quality (say, over 110 dB dynamic range) analog microphone input. Following bandpass filtering, the microphone signal is sampled with an (e.g., 24-bit) analog to digital converter operating at a 16 kHz sample rate. The combination of high-quality analog filtering, sufficient sample rate (well above the Nyquist rate of 8 kHz) and high dynamic range, yield a digital output stream with almost no artifacts (and low information loss). Note that digitizing to 24 bits supports exploitation of the wide dynamic ranges of modern high-quality microphones. In other words, this dynamic range will make it possible to accurately understand the speech of the attended speaker, even if there are much higher amplitude interferers present in the soundstream.

The 16 kHz stream of 24-bit signed integer samples generated by the above preprocessing (see Figure 3.5) is next converted to floating point numbers and blocked up in time sequence into 8000-sample windows (8000-dimensional floating point vectors), at a rate of one window for every 10 ms. Each such sound sample vector X thus overlaps the previous such vector by 98% of its length (7840 samples). In other words, each X vector contains 160 new samples that were not in the previous X vector (and the ''oldest'' 160 samples in that previous vector have ''dropped off the left end'').

sound sample vector

Listening to the Binaural Beat

Listening to the Binaural Beat

When you were a kid were you fascinated by those dog whistles that you could blow, not hear but all the dogs in the vicinity would come running? The high pitch was something that only they could here, and though it seemed the dogs didn't seem to arrive in droves as they did in the movies, it was enough for perhaps your pet dog to prick up his ears before sliding back into sleep.

Get My Free Ebook

Post a comment