Conceptual Framework for Carol's Song
The problem:
Amyotrophic lateral sclerosis (ALS) patients typically lose their ability to speak
while retaining the cognitive functions to be productive members of society. Their
needs in the area of speech communication aids are unique for two reasons:
The loss of speech is typically accompanied by a loss of other motor functions, thus
greatly reducing the effectiveness of traditional input devices such as keyboards.
The speed at which the disease progresses makes long learning curves impractical.
To be useful, a communication aid must be very easy to use and transfer seamlessly
from one input device to another.
Current communication aids for the speech impaired have three main components:
-
An input device for the speech impaired person to enter their communication.
-
A translation unit that converts the input to text.
-
A Text To Speech (TTS) device that converts the text to synthesized speech.
While improvements are possible in all three areas, it is the opinion of this group
that the greatest gains are possible in the second area.
In the simplest case, the translation unit is nothing more than a conduit, collecting
a string of characters from the input device and then passing the completed string
to the TTS device. Such units are woefully inadequate for ALS patients because the
input rate is too slow for real-time dialog.
More sophisticated units employ prediction algorithms which offer options for completing
words and sentences based on what has already been entered. Most of these algorithms
are based on matching the text entered to the beginning of strings held in a dictionary.
Such algorithms have the advantage of being intuitively obvious and fast to execute,
but they do not take full advantage of the redundancy inherent in written text. For
small words, the number of keystrokes is not significantly reduced. For long words,
the number of options does not become manageable until many characters have been entered.
Predicting sentences by matching the first few words falls prey to the same problems
of information theory. The object of a sentence is often the last word in the sentence.
It is difficult to predict the sentence until the main parts of speech are known.
A translation unit that requires words be entered in the order of the final sentence
mandates that words be entered that could have been predicted.
In an attempt to obviate these difficulties, some translation units also allow the
use of stored abbreviations for words and complete sentences. While these are effective,
the total number of abbreviations is necessarily small since a large number would
require extensive memorization and extra keystrokes.
Another area that is poorly addressed by most translation units is the correction
of typographical errors. This is particularly important for users who also suffer
from motor skill impairments.
Whats needed is a translation unit that allows the user to abbreviate freely and offer
potential words and sentences based on the relevance of the information rather than
the order. The translation unit also needs to recognize incongruous information (typos)
and attempt to match the remaining information. In essence, the translator needs to
employ common sense.
Proposal: Bayesian Adaptive Prediction Algorithm
Since it is impossible to know exactly what message is intended until the entire message
is delivered, a prediction algorithm must determine which possibilities are most likely
intended and then present them as options to be selected.
The probability that a user intends to send a given message, M, is represented as:
P(M)
The set of probabilities for all possible messages is called the Prior Distribution
(or simply, Prior) because we have not yet applied any observations to the calculation
of the probabilities. Once a string is entered, we have more information, and thus
can update the Prior. The probability of a message, M being intended given that a
user has entered a string, S, is expressed as:
P(M|S)
This set of these probabilities over then entire set of possible messages is called
the Posterior Distribution (or Post), because it is a distribution modified by the
presence of a known condition. This probability is extraordinarily difficult to compute
directly. Fortunately, Bayes Theorem gives us a mechanism to compute it indirectly:
P(M|S) = P(M) P(S|M) / P(S)
Two pieces of this equation are known, or at least can be reasonably estimated. P(M)
is just our Prior, while P(S|M) is the probability that a user would choose the string
S to represent M (as opposed to some other string).
Both these terms will vary from user to user, thus the need for an adaptive algorithm.
The initial distributions will be estimated and then the program will track the actual
usage and update the probabilities accordingly.
The final piece, P(S), is the probability of seeing a particular string. While this
is the most heavily user dependent and therefore the most difficult piece to estimate,
it is also completely unnecessary. Since the value of S is fixed, P(S) is merely a
constant that normalizes the distribution so all the probabilities sum to 1. The goal
here is to determine the most likely candidates, not assess their actual probability,
so the P(S) term can be left out of the computation.
The benefits of using a Posterior Distribution to rank candidate results are manifest.
Very little training is required as the user is free to abbreviate as they see fit.
Over time, the program adapts to their usage rather than vice-versa. However, to implement
such a system, two significant challenges need to be addressed:
-
The set of possible messages and the set of possible strings are both infinite. Exhaustive
evaluation of even a restricted list of each would require computing resources that
simply do not exist, much less be available to a speech impaired user.
-
Completely unanticipated abbreviations will never be recognized (and thus not adapted
to) since if P(S|M) = 0 then the Post will never rank M as a viable candidate for
string S.
The strategies for dealing with these two challenges are outlined below.
The problem of infinite domain:
The set of messages can be reduced to a finite number by restricting the vocabulary
and the length of sentences. However, even a tiny vocabulary of 1000 words and a limit
of 10 words per sentence yields approximately 1030 possible messages. Further
restricting the message domain to grammatically correct statements still results in
a set that is far beyond the storage capacity of any device for the foreseeable future.
Translating directly from the input string to the final message is apparently to great
a leap. If translation is done word by word, much of the power of the algorithm will
be lost.
An intermediate form is to use a tiered translation. The domain of P(M) will include
both individual words and the most frequent messages. Since individual words appear
with a much higher frequency than complete sentences, the prediction algorithm will
initially create a Post that sorts only words to the top. As words are recognized
and integrated back into the string, a grammatical context is generated and seeded
with the known parts of the message. This is used to generate new candidate messages
and include them in the next Prior and Post distributions. Eventually, complete messages
will find their way to the top of the list.
An example of how this might work follows:
|
Input
|
Message String
|
Candidates
|
|
R
|
R
|
1. ARE
2. RUN
3. RAIN
4. ROOM
|
|
M
|
RM
|
1. ROOM
2. RESTROOM
3. REMAIN
4. ROAM
|
|
<F2>
|
RESTROOM
|
1. I HAVE TO GO TO THE RESTROOM.
2. WHERE IS THE RESTROOM?
3. DO YOU HAVE A RESTROOM?
4. I NEED A HANDICAPPED RESTROOM.
|
|
C
|
RESTROOM C
|
1. CAR ;
2. CAN
3. SEE
4. CAT
|
|
L
|
RESTROOM CL
|
1. CLEAN
2. CLIMB
3. CLOSE
4. CLASS
|
|
G
|
RESTROOM CLG
|
1. CLEANING
2. CLING
3. CLOG
4. THE RESTROOM NEEDS CLEANING.
|
|
<F1>
|
RESTROOM CLEANING
|
1. THE RESTROOM NEEDS CLEANING.
2. I ENJOY CLEANING A RESTROOM. ;
3. CLEANING A RESTROOM IS SCARY.
4. MY JOB IS CLEANING A RESTROOM.
|
|
M
|
RESTROOM CLEANING M
|
1. ME
2. MY
3. EMPTY
4. MAN
|
|
<F4>
|
RESTROOM CLEANING MAN
|
1. A MAN IS CLEANING THE RESTROOM.
2. WHERE IS THE RESTROOM CLEANING MAN?
3. A MAN IS CLEANING IN THE RESTROOM.
4. THE RESTROOM IS CLEANING A MAN.
|
The complexity of the candidate generation increases as the input string gets longer.
As more letters are added to a word in progress, the candidate vocabulary is increased.
As more words are added to the string, more complicated grammatical structures are
added to the message generator. This allows for relatively quick identification of
simple and often repeated messages while still assisting with the creation of more
complicated messages.
At a certain level of complexity, the generator will cease to create grammatical structures
for the message. At that point, the prediction algorithm reverts to simple word prediction
and the occasional expansion of a phrase.
The problem of unpredictability:
Word abbreviations are highly predictable, since the abbreviation is almost always
a subset of the letters in the word. Even transpositions and typos are easily accounted
for by simply assigning a probability penalty to each form of correction.
Much more difficult is predicting the abbreviations of phrases or complete messages.
Since all but the most common phrases and messages are generated outside of the vocabulary
list, the algorithm needs a grammatical context to create the phrase and assign a
Prior. If the user picks an obscure (albeit obvious to them) abbreviation for the
phrase and no other message information exists, no match will be found. The user will
have to erase the input string and try again. The program will never learn to associate
the original string with the message.
To facilitate the linking of such input/message pairs, the program will offer a retry
command (triggered by a special input character) distinct from the normal correction
keys. Upon receiving a retry, the current token (contiguous string of characters)
is erased from the input buffer, but the token is saved. The user continues to build
the message in the normal manner.
For every selected candidate between the retry and the acceptance of the final message,
a provisional pair (original token, selected candidate) is stored. The next time the
original token is encountered, the candidates from the provisional pairs are displayed.
The one that is selected is added to the vocabulary list and associated with that
token.
To further facilitate the adaptation, anytime a candidate or full message is selected,
the message string at that point is added to the vocabulary list and associated with
all the partial sets of keystrokes that preceded the acceptance. Infrequently used
phases and messages will be sorted to the bottom of the vocabulary list and not affect
performance. Frequently used phrases and messages will become part of the active vocabulary
and will thus be offered as candidates sooner than grammatically generated messages.
|