Since articulation disorder, objective speech quality measures

Since speech is an integral part of communication,

people with various speech disorders are about to be
isolated

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

in the society. Our work deals with articulation disorder
which

is more frequently occurring speech disorder. The
difficulty in

correctly pronouncing the phonemes is referred to as
articulation

disorder. Most of such pronunciation errors made by
children or

adults can be corrected with individual training given by
speech

therapists. In today’s increasing population speech
therapists

need help of signal processing techniques to attend the
needs

of increasing patients. Till now in Malayalam, signal
processing

techniques have not been implemented for the evaluation
of

speech disorders in order to reduce the effort of speech
pathologist.

As a preliminary work towards automation of Malayalam

articulation test, in this work we investigate
substitution type

articulation disorder with waveform analysis. Then we
introduce

objective evaluation of articulation disorder that builds
on speech

processing techniques such as Dynamic time warping (DTW).
To

quantify articulation disorder, objective speech quality
measures

such as Itakura-Saito (IS), Log likelihood ratio (LLR),
Log area

ratio (LAR), Weighted cepstrum distance(WCD), Signal to
noise

ratio(SNR) etc. are computed between normal and
disordered

speech. Finally these objective measures are combined
together

using majority voting method to obtain a score. False
rejection

ratio calculated for these score values shows that these
objective

measures have good correlation with subjective evaluation results.

The four dimensions of the basic form of communication

that is speech are voice, resonance, articulation and
fluency.

Abnormality in all of these areas will result in speech
disorders

like voice disorders, fluency disorders, articulation
disorders

etc. A speaker is said to have articulation disorder if
he fails

to perceive significant contrast between standard phoneme
and

the phoneme he produces.

 

As per World Health Organization (WHO) statistics at
least

3.5% of human population are victims of speech disorder
2.

Census of India 2011 shows that about 50, 72,914 Indians
have

speech disorders, in Kerala it is 1, 05,366 3. WHO
report on

disability in the south-east Asia region 2013 indicates
that in

India speech and hearing impairment is the rank 3
impairment

4. Also as the case in several developing countries, in
India

people with speech disorders are much less likely to
receive

assistive devices than people with other impairments 5.

As one of the most commonly occurring speech disorder

articulation disorder can be corrected with proper
training. The

causes of articulation disorder can be biological or
environmental.

Both children and adults can suffer with this speech

sound disorder. The difficulties with articulation
disorder can

be categorized as omission (pepa for pepper),
substitution

(thoda for soda) and distortion (shlip for ship). In this
work we

mainly concentrate on substitution type of articulation
disorder

in adults.

 

Familial aggregation of speech sound disorders is a
problem

faced by today’s nuclear families. Children of affected

parents are more likely to have articulation disorders
6. Being

around more people will help a child to develop his
language

so that he can hear more people talk, which is missing in

today’s nuclear families. In this scenario analysing
speech

disorders of adults and children have equal relevance.

Traditional speech therapy for resolving speech problems

involve one-on-one or group lessons by speech
pathologists

(SLPs) which are time consuming and costly. For the case

of articulation disorder therapy must be done frequently
to

remain effective, which is difficult to satisfy for many
reasons

including shortage of SLPs, financial limitation etc.
Nowadays

we require support of advanced technologies to meet the
needs

of increasing patients and to reduce the treatment cost.
The

existing systems make use of technology as an aid to
identify

and assess the degree of disorder. For example,
Computerized

Assessment of Phonological Processes in Malayalam (CAPPM)

is a software developed at AIISH, Mysore for speech

therapy 7. In CAPP-M, speech language pathologist make

use of pictures, they show it to the client and mark the

corresponding utterance produced just by hearing them to

grade the disability. It is time consuming process. It
requires

human intervention and hence semi-automatic in nature.
Instead

if a system can automatically recognize the speech

uttered by the client and decide which phoneme is having
the

articulation disorder, the human effort and the
associated errors

can be reduced. Another one is Vagmi Therapy
Picture-Word-

Articulation Module which also aims to provide
computerized

assessment of misarticulation. This module is currently
available

in English, Kannada, Telugu, Hindi, Oriya, and Arabic.

Till now in Malayalam, signal processing techniques have
not

been fully implemented for the evaluation of speech
disorders

in order to reduce the effort of speech pathologist.

Various signal processing techniques proposed in the
literature

suggests that automated system evaluation using various

signal features such as Teager Energy Operator (TEO),
Linear

Predictive Coding (LPC), Mel Frequency Cepstral
Coefficients

(MFCC), Pitch, Jitter, Shimmer and the first three
formants

together with the bandwidth of the first formant 8 etc.
are

effective. The system requires training and testing of
various

speech recordings including both normal and disordered

speech to produce a result. The evaluation techniques
denoted

as subjective evaluation techniques require skilled and
trained

personnel, also several man hours and efforts. Although
it is

efficient and accurate, the effort shouldered by these
people

may be reduced by using objective evaluation techniques
that

make use of certain objective measures such as Itakura
Saito

(IS) measure, Log Likelihood Ratio (LLR), Log Area Ratio

(LAR), Segmental SNR measure, and Log Cepstral Distance

(LCD) 1.

In order to automate articulation disorder test first we

have to identify the phoneme at which disorder occurs.
SLPs

do it manually by repeatedly hearing from the patient. As

a preliminary work we can do waveform analysis of the

disordered speech and can identify the disordered phoneme
by

comparing it with normal speech signal. After identifying
the

phoneme at disorder, the degree of disorder can be
calculated

using objective measures. Since duration of speech signal

even for the same utterance by same speaker will not be the

same at different times, we need non linear matching of
the

disordered phoneme and normal phoneme by dynamic time

warping(DTW). After finding optimum alignment between

these two phonemes using DTW we can calculate the
distance

between normal and disordered speech using objective
quality

measures such as LAR, LLR, IS, SNR and LCD. To obtain

a final score for a particular phoneme we combine all the

objective measures together using majority voting method.

Score validation is done by calculating the false
rejection rate

(FRR).

The speech database used in this experiment was collected

locally in the lab environment from adult speakers.
Mainly

misarticulation of three Malayalam letters /bha/, /zha/
and /nja/

are studied in this work. For each of the cases normal as

well as disordered speech database was collected. For the
first

disorder (/pha/ for /bha/) eleven normal speech samples
and

eleven disordered samples were collected. For the second
one

(/ra/ or /la/ for /zha/) it was six each and for the
third type(/na/

for /nja/) five each. All the speakers were requested to
utter

the corresponding test words three times. The recording
was

done using the free software Wavesurfer with a sampling
rate

of 16000 and in mono channel format.

 

Figure 1 shows waveform analysis between normal and

disordered speech for the letter /bha/. The
misarticulation

detected in this case is uttering /pha/ instead of /bha/.
This

problem is identified as a regional misarticulation found
in

certain natives of Kottayam district in Kerala. Figure
shows

waveforms for the correct word “bharatham” (top) and the

mispronounced word “pharatham” (bottom). We know that

the phoneme /bh/ is a plosive, making short puffs of air and
are easily identifiable in an audio waveform. But the

mispronounced phoneme /ph/ is like a fricative and hence
not

easy to isolate from the waveform. In that case we will
listen

to the waveform and try to isolate the particular phoneme.

The second articulation disorder is observed for the
letter

/zha/, which is more significant because of its usage and
pronunciation.

This phoneme exists in the Vedic language which

is the source of Sanskrit. Many people will not pronounce
the

letter /zha/ properly, which is unique to Tamil and
Malayalam.

The words with those sound got converted into more easier
to

pronounce sounds like, /ya/,/la/,/ra/ etc. One possible
reason

is outside influence. When non residential keralites or
non

keralites try to speak the letter they simplify it by
substituting

other easily utterable letters . For example non natives
will

pronounce “vazhappazham” as “valappalam” or “varapparam”

and “Kozhikode” as “Koyikode”. Waveforms for this
disorder

is shown in figure 2, the top one is for “vazhappazham”

and bottom one is “varapparam”. In this case the correct
and

disordered sounds are not identifiable by observation, we
have

to listen to the waveform and isolate the corresponding letters.

Another misarticulation identified is pronouncing the
letter

/nja/ as /na/. This problem is found in some adults
independent

of regional or non native background. They utter “oonjaal”

as “oonaal”. The waveform analysis for this one is shown
in

Figure 3. The top speech signal is for the correct
pronunciation

and bottom one is for mispronunciation.

Suitable features are extracted (such as MFCC, LPC etc)

from the normal as well as disordered speech. Then
matching

the features of corresponding phones in normal speech and

disordered speech using DTW gives a measure of
similarity.

Using this matching, the phone boundaries that make the

articulation disorder can be identified. Then using some
objective

speech quality measures such as IS, LLR, LAR etc.,

the articulation error can be identified and the score
can be

computed. Objective scores are then compared with manual

evaluation score to validate its effectiveness.

1) Dynamic Time Warping: Since the normal speech and

the disordered speech do not have exactly the same length

simple one-to-one comparison of windows from each speech

utterance is not possible. There for in this work we use
DTW,

which is the most straightforward solution for aligning
two

time sequences with different lengths.

Given two speech patterns,X and Y,these patterns

can be represented by a sequence (x1; x2; :::; xTx ) and

(y1; y2; :::; yTy ),where xi and yi are the feature
vectors. As we

have noted, in general the sequence of xis will not have the

same length as the sequence of yis. In order to
determine the

distance between X and Y, given that some distance
function

d(x; y) exists, we need a
meaningful way to determine how to

properly align the vectors for the comparison. DTW is one
way

that such an alignment can be made. We define two warping

functions, _x and _y, which transform the indices of the
vector

sequences to a normalized time axis, k. Thus we have

This gives us a mapping from (x1; x2; :::; xTx ) to

(x1; x2; :::; xT ) and from (y1; y2; :::; yTy ) to (y1; y2; :::; yT ).

With such a mapping, we are able to compute d_(x; y) using

these warping functions, giving us the total distance
between

two patterns as

where m(k)is a path weight and M_ is a normalization

factor. Thus, all that remains is the specification of
the path _

indicated in the above equation. The most common
technique

is to specify that _ is the minimum of all possible
paths,subject

to certain constraints.

 

2) Objective Quality Measures: Objective speech quality

measures are generally calculated from the normal speech
and

the disordered speech using some mathematical formula. It

does not require human listeners, and so is less
expensive and

less time consuming. Objective measures are used to get a

rough estimate of the quality.

 

SNR Measures: Signal-to-Noise Ratio(SNR)is one of the

oldest and widely used objective measures. It is
mathematically

simple to calculate, but requires both distorted and
undistorted

(clean) speech samples. SNR can be calculated as follows:

where x(n) is the clean
speech, y(n) the distorted speech,

and N the number of samples.

 

LP- Based Measures: Speech production process can be

modelled efficiently with a linear prediction (LP) model.
Some

of the following objective measures use the distance
between

two sets of linear prediction coefficients (LPC)
calculated on

the normal and the disordered speech.

 

1. The Itakura-Saito Distance Measure: The IS distortion

measure is calculated based on the following equation:

where _2x and _2y represent the
all-pole gains for the

standard healthy people’s speech and the test patient’s
speech.

ax and ay are the healthy-speech and
patient-speech LPC

coefficient vectors, respectively. Rx is the autocorrelation

matrix for kx(n), where kx(n) is the sampled
speech of healthy

people.

 

2. The Log-Likelihood Ratio: LLR is similar to the IS

measure. While the IS measure incorporates the gain
factor,

LLR only considers the difference between the general
spectral

components. The following equation can be used for
computing

the LLR:

3. The Log-Area Ratio: LAR is a speech quality assessment

measure based on the dissimilarity of LPC coefficients
between

normal speech and the disordered speech. LAR uses the
reflection

coefficients to calculate the difference and is expressed
by

the following equation.

where p is the order of the LPC

coefficients,rx(i)andry(i)are the ith reflection coefficients of

healthy and patient’s speech signals.

Log Cepstrum Distance: It is an estimate of the
logspectrum

distance between normal and disordered speech. Cepstrum

is calculated by taking the logarithm of the spectrum and

converting back to the time-domain. LCD can be calculated
as

follows:

where cx and cy are Cepstrum vectors for normal and

disordered speech, and P is the order.

 

C. Subjective Quality Measures

Subjective Evaluation of speech sound disorder is
required

to validate the scores obtained by the objective
measures.

Subjective evaluation is done by taking the opinion of a
set of

listeners. The listeners are requested to mark a 0 or 1
score

corresponding to normal and disordered speech played to
them.

The definition of a good speech sample is left to the
listener

to decide. Then final score for each utterance is
obtained by

taking the mean opinion score of all listeners. The final
score

simply says whether the speech is normal or disordered.

A. Speech Database

The normal and disordered phonemes corresponding to the

identified disorder were selected using the waveform
analysis

and the features are extracted. Then DTW is applied to
the

corresponding phonemes for aligning them in time domain.

Figure 5 shows the optimal frame match path between the

standard healthy speech and the disordered speech. Here

the distance between the normal and disordered speech was

measured at the identified phoneme boundaries, which
helped

to reduce the speaker dependency upto some extend. The

objective measures are evaluated using the Mel Frequency

Cepstrum Coefficients(MFCC) and the LPC coefficients. In

which the result obtained from LPC coefficients showed
good

correlation with subjective scores. The poor performance
with

MFCC features may be due to mel-scaling process done
during

computation of MFCC. LPC coefficients were extracted with

a order of six only, in order to reduce the speaker
dependency.

One correctly prompted phoneme from a healthy speaker was

used as the standard phoneme for calculating the
objective

measures of quality.

 

B. Performance Evaluation

The five distortion measures(IS, LLR, LAR, SNR and

LCD) were calculated for each of the three identified
disorders.

Table 1 shows classification of normal and disordered
speech

for the first speech disorder based on five distance
measures

and the DTW distance. N1 to N11 denote speech samples

from normal speakers and D1 to D11 denote disordered
speech

samples. The ideal classification is also given in the
table. Irrespective of the ascending order from N1 to N11 or from D1

to D11 what is required is all the normal speech samples
should

appear in the first eleven positions in the table 1 and
then the

eleven disordered samples. None of the distance measures
do

it exactly correct but minimum error occur in DTW based

distance where only one disordered sample is misplaced
and

therefore false rejection rate (FRR) is only 9.09. IS and
LCD

exhibit poor performance with a FRR of 36.36. The
equation

for evaluating FRR is given by,

where TA is the number of phones annotated and recognized

as normal and FR is the number of phones recognized

as disordered when the actual pronunciations are correct.

The same analysis for other two speech disorders is shown

in tables 2& 3. For the second disorder all the
objective

measures give same FRR. In case of third disorder DTW and

SNR shows poor performance but all other measures provide

good results with only one misplaced speech sample. Since

the phoneme boundaries for /bha/ are correctly isolated
in

the waveform for the first disorder the DTW distance
gives

minimum FRR. For the other two disorders it is not so.

In Table 4 the FRR of all the distance measures are
listed

for the three speech disorders and the average is found,
in

which the average FRR for the LLR distance measure gives

the minimum value.

Finally all the objective distance measures are combined

together to obtain a score for each of the normal and
disordered

utterances. The combined score was obtained by majority
rule

voting method that classifies a speech signal as normal
or

disordered based upon the classification given by
majority of

the objective measures that is more than half. The
distance

measure with low FRR has given more weight during this

procedure. Table 5 gives the FRR obtained for the three

disorder types in the combined method.

Automatic evaluation of speech disorder is not an easy

task. In this work the disordered phoneme is isolated
from the speech signal using waveform analysis. Then spectral features

are derived from the disordered speech and matched with
that

of the corresponding normal phoneme using Dynamic Time

Warping (DTW) to align them in time domain. To quantify

the degree of disorder, objective speech quality measures

such as Itakura-Saito (IS), Log Likelihood Ratio (LLR),
Log

Area ratio (LAR) ,Signal to Noise Ratio(SNR), Log
Cepstrum

Distance(CD) etc. were computed between normal and
disordered

phonemes. Objective scores were then compared with

subjective evaluation scores to confirm the effectiveness
of

objective measures. The combined objective score gives
better

correlation with the subjective score.

 

A fully automated articulation test system for Malayalam

language can be developed in future by combining this

proposed method with automatic speech recognition (ASR)

systems. Instead of manually selecting the phoneme
boundaries

from the waveform, when the disordered speech is given as

input to the ASR it will provide a semantically
meaningful

text output along with timestamps of phonemes present in
the

input speech.