# Speech coding

The need for data compression becomes necessary with the expanding technologies. Data compression makes it possible to contain more data on the same media, so that it can be used optimally.

Especially speech signals contain a lot of redundant information. Several techniques are available to find the redundancy and thereby optimizing the compression.

To obtain any compression there should be a relationship between successive values. For speech signals this relationship is discovered by the auto-correlation.

As speech carries little information, and the signal’s characteristics change seldom, very high compression rates can be achieved.

In this project, we will be doing the following:

• implement and simulate a data compression for speech signals (speech coding)

• create a corresponding decoder

• compare our results from our different implementations by compression rate and quality

# Theory

## Conversion and quantization of analog speech signals

Analog speech signals can be described as a continuous function in time, which means that an infinite number of amplitudes levels exist within the signals finite amplitude level. Because of the limitations in the human ear, is not necessary to use an infinite numbers of levels to represent the amplitude, when converting the analog signal to digital samples.

The process of converting an analog sample amplitude at a specific time into a discrete amplitude of finite value, is called amplitude quantization. The conversion of analog signals can be described by following flow chart: Low-pass filtering is required to avoid aliasing of the source signal. The sample frequency must be greater than twice the highest frequency component in the source signal (Nyquist). In speech signals a sample frequency of 8KHz is usually used.

The quantizer’s job is to calculate values for each pulse generated by the sampler, to represent the signal.

### Uniform quantization

A quantizer is uniform when the representation levels are uniformed spaced. The result of a uniform quantizer is that the same amount of bit is use to represent all amplitude levels. Non linear quantizer

### Error due to quantization

The quantizer will always introduce an error compared to the analog signal, because only a limited number of bits are available to represent the signal’s amplitude.

A way to minimize this error, is to use a large number of levels to represent each sample, which of cause means that transmission of the signal will require more bandwidth.

The linear quantizing can give a bad signal to noise ratio (SNR). Speech has a high dynamic range (40dB), therefore the low-amplitude parts of the speech will have a bad signal to noise ratio, due to the linearly distributed quantizing.

Another way of getting a better signal to the receiver, is by using more levels for the amplitudes where the human ear is most sensitive. This can be done by using a non-linear quantizer or companding the signal.

The quantization done in most A-D converters, is usually done by a uniform quantizer, but the non linear quantization can be done in software afterwards. Using a non-uniform quantizer at the sender, requires the receiver to use the inverse quantizer when re-generating the sound. ### u-Law and A-law companding

U-law and A-law are two standards used to compand audio signals in America and Europe.

Companding signals means compressing and expanding. U-law and A-law companding compresses the signal by means of using fewer bits to represent the loud passages in the signal, and more bits in the weak passages. This is especially efficient when the signal is a speech signal, because the weak passages has most information. If fewer bits are used to represent each sample in the output signal then compression is obtained. The compression should have little impact on the quality, because only informations from the ears less sensitive area are removed.

U-law companding where u=100:

Compresser: Expander:

## LPC Linear prediction coding

Is a lossfull way of compressing speech. Instead of sending data that relates to the original wave signal, data describing the original signal is send. This data is the pitch, the gain, filter coefficients and information whether the sound is voiced or unvoiced.

Linear Predictive Coding will try to predict the next sample in a signal, given a set of parameters. Prediction can only take place on signals with a certain amount of redundancy. Speech is indeed such a signal, and the prediction can be quite accurate

## Auto-Correlation

Autocorrelation can be used to analyze a signal for redundancy.

Autocorrelation gives an idea of how well the signal can be predicted. With unvoiced sounds (random noise) the prediction is going to be low. With voiced signals, the prediction can be higher.

When the auto-correlation is not an impulse function, then the signal contains memory and can be compressed by a prediction-filter.

By analyzing the auto-correlation of the signal, the filter-coefficients for the prediction-filter can be determined.

Autocorrelation formula: The auto-correlation represents the dependency from a signal towards the following signals and is used as a measure for the memory or redundancy in a signal. The auto-correlation in the previous graphs represents the memory in every type of signal, independent of the different distributions.

A signal with no memory, no matter the form of the PDF, is characteristically a pulse at origin in the auto-correlation, displaying the energy of the signal.

As for signals with memory, a non constant PDF results in a very high degree of memory.

Thus it is possible to examine signals of any sort and determine from the auto-correlation their state of memory; which, in turn, may lead to compression.

## Vocal Tract filter

A vocoder is a model of the speech generation mechanism. The vocal chords can generate periodic or turbulent sounds. The periodic sounds are the ones we call voiced sounds.

In order for a vocoder to make a sound, it needs to know if it is a voiced or unvoiced signal. Given the pitch and the filter coefficients, a speech like sound can be generated. The Filter coefficients are obtained a probability matrix.

A = R-1P

Where A is an array of filter coefficients. R is a matrix, symbolizing the probability of all changes in the signal. P is the first part of the autocorrelation.

Where R is a matrix of size number of parameters by number of parameters.

R is A Toeplitz matrix which means that the matrix is mirrored in x=y.

DPCM – Differential Pulse Code Modulation is based on the fact that signals with just a few values are not so information heavy as signals with big variance in the signal. Like quantisizng is minimizing the amount of occurring values, DPCM is doing the same, but less error signal as a result. First the filter coefficients are obtained. The prediction signal is then compared with the original signal, and the error signal is extracted. If the prediction is good, the error signal should be small, and can be transmitted on much less bandwith than the original signal.

# Test Signals

## Introduction

There exist two indications of redundancy of a source, that raise the possibility of source compression; when :

– the characters of a source have different probabilities (PDF)

– the source contains memory

Memory and auto-correlation are closely linked and the PDF is exhibited in the autocorrelation as well.

To create such signal values a transition probability matrix is used; the matrix specifies a Markov chain construction, as shown in the figure and creates a set of random numbers, with length N. The mean of all the signals is zero.

## In this case we will only consider a signal with three characters (0 , 1 , -1),

P(b,a) means that the number is in a and the probability to go from a to b is P(b,a).

The probabilities of the different letters from the alphabet (or state) can be calculated from this matrix.

### Random signal , no memory, constant pdf

P is the transition matrix and Prob the alphabet probability vector.

This signal has a transition matrix with the same elements overall. The lack of memory is visible in the autocorrelation, which is a pulse at 0.

## Random signal , memory, constant pdf

In this signal, the PDF is kept constant, but the transition probability is adjusted to add memory to the characters. The resulting memory can be seen in the autocorrelation, in the first zero crossing. ### Random signal , no memory, not constant pdf This signal is constructed by just using alphabet probabilities, there is no interdependency. This signal has an autocorrelation closely related to the signal with no memory and constant PDF; the only difference is the energy present.

### Random signal , memory, not constant pdf This signal has an autocorrelation closely related to the signal with memory and constant PDF; the differences are the energy present and the degree of memory.

# Investigating different speech algorithms ## Process

In the process of making LPC coding, we start of by sampling the signal in time windows.

The next step is to calculate the filter coefficients from the sampled signal. By predicting the reconstructed signal, we can compare it with the original signal and thereby get an error signal.

To optimize the performance of the system, we have added a step of low pass filtering, to test if this could give an improvement. By low passing the reconstructed signal, we might obtain a smoother step between the windows, we are reconstructing.

The auto-correlation function is an indicator for the memory and can be used for predictive coding and so for redundancy reduction. Linear Predictive coding works on the basis of predicting the linear combination of speech samples. This is possible because of the coherence between adjacent samples.

The prediction is close to the original and the error signal should thus be small.

This “small error” can be represented by fewer bits than needed to construct the original speech signal, thus obtaining a reduction in the transmission rate of it by removing the redundancy from the speech signal ( This can be calculated and optimized to achieve an optimal theoretical entropy.

The LPC is performed on a speech signal, an unvoiced and voiced signal is represented.

The result is compared with the “Yule-Walker function” from the signal function pack of Mathcad.

The power from the original signal and from the LPC-error signal is compared to measure the quality of the prediction obtained.

### The unvoiced LPC experiment

An unvoiced signal is created by means of the Gaussian distribution. Memory is introduced in the signal by using the moving-average method.

The memory is then measured by the auto-correlation. From the auto-correlation the Coefficients are calculated for a FIR filter and the filter is stimulated to reconstruct the following signal. The unvoiced signal seems to have a small group-delay and the predicted value does not seem to follow the original signal.

### The voiced LPC experiment

The voiced signal is constructed by adding two sinusoidal functions with different phase and each containing harmonics. From the auto-correlation it can be seen that the signal contains a high level of memory and this is then reflected in the better resolution from the voice signal shown below. The coefficients are calculated here to reconstruct the signal. The reconstruction of the signal appears to be very good and has no delay whatsoever.

Which means a clean reconstruction from the voiced signal.

The LPC is thus not so good in compressing noise. However, voiced signals are in most cases more interesting than unvoiced. This makes the LPC, despite the bad unvoiced compression very practical in many applications concerning speech coding. The unvoiced parts are a source of problem to the prediction. One way of dealing with this is replacing these parts with nothing, or white noise, as in PELPC, dealt with in.

### Coefficients and Power calculations

The power calculations are considered to obtain a measure for how much of the energy from the original signal is lost by means of the compression or quantization or prediction.

The top coefficients are calculated by the Yulew function and the bottom coefficients manually. As can be seen, they match precisely. ## Comparison of a test and a speech signal

The signal created should have no memory, since this makes it difficult to interpret the result. The pdf of a speech signal is therefore replicated by taking a Gaussian distribution and using Mathcad’s function histogram, the result of this is squared and normalized. ### Three different scaled signal levels

The probability density functions of the signals are displayed here.

All the signals have been scaled by a factor k.

The signal is tested with the micro-compression = 255. When using the u-law companding, the signal is first passed through a characteristic :

And the output from a linear uniform quantizer of the logarithmical signal is then passed through the inverse form, The factor 255 is used in this experiment. The amount of redundant information is calculated by means of the entropy. The optimum entropy is the amount of bits used in the quantizer. Three signals with different magnitudes (signal levels) are computed and the micro-law companding is used. This graph shows the entropy without the compression and the compressed form above it for the three signals with a k respectively of : 1 , 0.1 , 0.01

The entropy obtained after compression increases, which means that redundant information has been removed.

### Signal to quantizing noise ratio (SQNR)

To quantize the signal there are 8 bits to our disposal. There is a clear difference between the companded and the not companded signal, especially located at the smaller signals. The companded graph has a nearly constant signal to noise ratio while the non-companded ( linear ) starts from a bad signal to noise ratio and improves for larger magnitude signals.

The companding is therefore used when signals contain lots of small information, as typically seen with speech signals.

The level of degradation for the companded signal starts at about k = 0.05, when in a linear system the degradation has an immediate effect.

## Difference Pulse Code Modulation (DPCM)

### Introduction

DPCM is one of the basic methods to encode a signal, in which the error from the prediction is transmitted in order to reduce the redundancy. The prediction gain is the ratio between the power spectrum of the signal and the power spectrum of its prediction error.

### Prediction gain (PG) A way of reducing redundancy consists of windowing the signal, and performing prediction. As can be seen, the prediction gain is far bigger for a certain size of windows. This is due to the fact that prediction over a long part of speech is incorrect because there can be major changes in the character of the speech over such periods. A 20 ms window size is typical for speech analysis.

### Different windows with same length Slow rate of change and low amplitude range are revealed for high prediction gain, while faster characteristics correspond to a smaller prediction gain. During a speech the power spectrum of the signal is not changing enormously. So the most influent parameter over the PG is the error power, The higher the error power is, the smaller the PG will be. In a signal with a fast rate of change and a high amplitude range the coefficients will have increased import, and when replaced in the error power formula, their influence will be even bigger, thus making the FFT of the error signal bigger; and so diminishing the prediction gain. The same is visible for the error signals. For the autocorrelation, it is logical that if a signal has a fast rate of change and a wide amplitude change, its autocorrelation will have a large memory, and it will be the same for the autocorrelation of the error signal.

### DPCM and prediction gain

The prediction gain is an expression of how accurately a signal is predicted. Since DPCM involves transmitting the error, a high prediction gain would have a low level of error, which would need few bits to be transmitted, while a low prediction gain results in a lot of uncertainty meaning a larger prediction error.

While DPCM is successful at reproducing speech sequences, if the length of the windows are correctly chosen, it requires far more bits to reproduce noise than ordinary speech. Nevertheless, the DPCM is a clear improvement from the classical linear prediction in the sense that the dynamic range of the error signal needs less bits for representation than the original.

## Introduction

Pitch excited LPC is a method for coding speech. The signal is divided into windows of 20 ms length. For every window a set of parameters is calculated which is transmitted to the receiver. These parameters are the Vocal-tract filter coefficients, the gain and the pitch-frequency.

The all-pole filter coefficients can be found by prediction of the signal. The Yule-Walker algorithm is utilized for this. The pitch can be put to 0 in frames containing no speech and here already one bit can be saved. The advantage of using PELPC is the large compression obtained and the lower bit-rate necessary. The sound-quality is not so clear and almost sounds synthetical compared to the classical Linear Predictive Coding.

### Principle

PELPC is done by studying the signal in small periods and forming coefficients for these periods dynamically. The energy content is calculated to provide the gain of the window. Also, for each window, the pitch period is calculated to account for the period of the pitched signal.

For each window, pulses with the pitch frequency is placed; otherwise 0 (white noise) is placed, for the unpitched windows. This is introduced to the vocal-tract filter which recreates the human voice. A frame of 20 ms can be used for the prediction of normal speech. ## Pitch detection

To find the pitch of the window, we use the autocorrelation. If we in this process find out that the pitch length is zero, then the signal must be of an unvoiced sort.

The gain for each window is found by the formula :

where the energy of the error of the prediction and the energy of the source need to be calculated.

Since the auto-correlation for some windows might be hard to analyze through programming, the auto-correlation needs to be smoothed. This can be achieved by windowing the signal and/or by low pass filtering it. The pitch period for a 20 ms window is assumed to lie in the region between 3 ms to 15 ms. In this region, the mid point of the first positive curve of enough amplitude multiplied by the sampling period of the signal is found to be the pitch period of the window. This is the signal that will be used to excite the all-pole filter to reconstruct the signal. For each voiced window a pulse train with the period equal to the pitch length is created while for unvoiced windows small white noise is placed.

## Transmitting the data

One bit per time window to decide whether the signal is voiced or unvoiced. One value for pitch.

The LPC-10 uses a 10th order filter for voiced signal, and thus has to transmit 10 values. The unvoiced signal uses 4 values. Finally, a value representing gain will be transmitted.

By quantizing the different values transmitted, and adding an extra bit for synchronization we end up with a total of 54 bits per time window, which gives us a total of 2400 bits/second.

### Using speech compression in a packet switched network.

The existing telephone network is built on the idea, that a connection can be established between two nodes in the network. After the connection is made, the nodes are guarantied a certain amount of bandwidth. This is smart because the nodes can design their compression algorithms after this fixed bandwidth. Unfortunately this means that a lot of bandwidth is lost, if no (or little) information is send through the channel.

In a packet switched network (like the Internet), the bandwidth can be shared between more nodes, but the bandwidth cannot be guarantied.

Because speech usually is transformed into as a continuous stream of data, problems can occur when the bandwidth cannot be guarantied. A way to design a speech transport protocol, for use in a packet switched network, would be to allow the nodes to use dynamic bandwidth.

One approach to use dynamic bandwidth, is to measure the bandwidth available, and choose a compression algorithm that uses all the available bandwidth. This will result in that the nodes chooses the best quality possible, but it will require a system to distribute the resources (bandwidth) on the network.

Another approach is to measure the quality of the signal to transmit, and use the compression algorithm that uses least bandwidth and gives a reasonable quality at the receiver. Block diagram of such system: The error calculator must be able to calculate when a compression technique gives a bad result, and afterwards try another technique. The different compression algorithms might just differ by a few parameters, i.e. different number of filter coefficients. The protocol must include which compression technique is used at the senders, so the receiver is able to decode the signal correctly.

# Conclusion

There exist several techniques for compressing speech.

This can be seen in the entropy or selfinformation of the data source.

Speech data has a specific characteristic and can be modeled by creating a vocal tract filter.

The simplest of the technique involves the micro companding of a data signal where the entropy becomes visually larger and the data can be better compressed. The different other techniques depend on the prediction of the values by using filter coefficients (LPC) like for example the DPCM where a compression ratio of four times can be achieved. Because of the relative low compression the signal at the receiver is quite close the original signal.

When using the pitch exited LPC algorithm the filter coefficients are transmitted and a compression ratio can be achieved up to 26 times. Due to the high compression rate the signal at the receiver is somewhat distorted, but it is still easy to understand the message of the speech.