The short-time Fourier transform

The Fourier Transform is a general method to represent data in the time domain as data in the frequency domain. With discrete data such as audio sampled at a certain frequency, the Short-Time Fourier Transform (SFT) is used, where a stream of sound data is split in blocks (also called frames) of fixed length (N samples). Each frame is analyzed separately to produce a spectrum.

The basic concept underlying the SFT has to do with taking an array of N time-domain waveform samples (left) and processing those to produce a new array of frequency domain spectrum samples (right). The number of samples on the output is exactly the same as the number of samples on the input.

inset_100517.jpg 

 

The length of the SFT frame (time interval T, corresponding to N samples, and a sampling frequency fs =N/T) determines the spacing of the frequency bins df that can be represented in the spectrum (df = fs /N= 1/T). In reality, we can only use half of the bandwidth B to represent a frequency back in the time domain. The second half of the bandwidth is alias frequencies, that is frequencies of waveforms that, when sampled, would produce exactly the same samples as the sampled signal. For this reason, the actual frequency spectrum in UltraVox XT has a maximum frequency fmax equal to fs/2 (fmax is generally called the Nyquist frequency). Therefore, the spectrum itself is made of N/2 lines: 0, df, 2*df, ... up to (N/2)*df (see the figure below).

inset_000518.jpg 

 

From the SFT Fourier transform to a frequency spectrum, and the spectrogram in UltraVox XT. You can view each spectrum as a “slice” of the spectrogram, where the value chosen for SFT length is the width of one “pixel” of the spectrogram. Note that only the first half of the frequency array (compare this with the previous picture) is used to build the spectrum.

See also

nFrequency resolution and time resolution

nOverlap