View source on GitHub |
A preprocessing layer to convert raw audio signals to Mel spectrograms.
Inherits From: Layer
, Operation
tf.keras.layers.MelSpectrogram(
fft_length=2048,
sequence_stride=512,
sequence_length=None,
window='hann',
sampling_rate=16000,
num_mel_bins=128,
min_freq=20.0,
max_freq=None,
power_to_db=True,
top_db=80.0,
mag_exp=2.0,
min_power=1e-10,
ref_power=1.0,
**kwargs
)
This layer takes float32
/float64
single or batched audio signal as
inputs and computes the Mel spectrogram using Short-Time Fourier Transform
and Mel scaling. The input should be a 1D (unbatched) or 2D (batched) tensor
representing audio signals. The output will be a 2D or 3D tensor
representing Mel spectrograms.
A spectrogram is an image-like representation that shows the frequency spectrum of a signal over time. It uses x-axis to represent time, y-axis to represent frequency, and each pixel to represent intensity. Mel spectrograms are a special type of spectrogram that use the mel scale, which approximates how humans perceive sound. They are commonly used in speech and music processing tasks like speech recognition, speaker identification, and music genre classification.
References:
Examples:
Unbatched audio signal
layer = keras.layers.MelSpectrogram(num_mel_bins=64,
sampling_rate=8000,
sequence_stride=256,
fft_length=2048)
layer(keras.random.uniform(shape=(16000,))).shape
(64, 63)
Batched audio signal
layer = keras.layers.MelSpectrogram(num_mel_bins=80,
sampling_rate=8000,
sequence_stride=128,
fft_length=2048)
layer(keras.random.uniform(shape=(2, 16000))).shape
(2, 80, 125)
Input shape | |
---|---|
1D (unbatched) or 2D (batched) tensor with shape:(..., samples) .
|
Output shape | |
---|---|
2D (unbatched) or 3D (batched) tensor with
shape:(..., num_mel_bins, time) .
|
Methods
from_config
@classmethod
from_config( config )
Creates a layer from its config.
This method is the reverse of get_config
,
capable of instantiating the same layer from the config
dictionary. It does not handle layer connectivity
(handled by Network), nor weights (handled by set_weights
).
Args | |
---|---|
config
|
A Python dictionary, typically the output of get_config. |
Returns | |
---|---|
A layer instance. |
linear_to_mel_weight_matrix
linear_to_mel_weight_matrix(
num_mel_bins=20,
num_spectrogram_bins=129,
sampling_rate=8000,
lower_edge_hertz=125.0,
upper_edge_hertz=3800.0,
dtype='float32'
)
Returns a matrix to warp linear scale spectrograms to the mel scale.
Returns a weight matrix that can be used to re-weight a tensor
containing num_spectrogram_bins
linearly sampled frequency information
from [0, sampling_rate / 2]
into num_mel_bins
frequency information
from [lower_edge_hertz, upper_edge_hertz]
on the mel scale.
This function follows the Hidden Markov Model Toolkit (HTK) convention, defining the mel scale in terms of a frequency in hertz according to the following formula:
mel(f) = 2595 * log10( 1 + f/700)
In the returned matrix, all the triangles (filterbanks) have a peak value of 1.0.
For example, the returned matrix A
can be used to right-multiply a
spectrogram S
of shape [frames, num_spectrogram_bins]
of linear
scale spectrum values (e.g. STFT magnitudes) to generate a
"mel spectrogram" M
of shape [frames, num_mel_bins]
.
# `S` has shape [frames, num_spectrogram_bins]
# `M` has shape [frames, num_mel_bins]
M = keras.ops.matmul(S, A)
The matrix can be used with keras.ops.tensordot
to convert an
arbitrary rank Tensor
of linear-scale spectral bins into the
mel scale.
# S has shape [..., num_spectrogram_bins].
# M has shape [..., num_mel_bins].
M = keras.ops.tensordot(S, A, 1)
References:
Args | |
---|---|
num_mel_bins
|
Python int. How many bands in the resulting mel spectrum. |
num_spectrogram_bins
|
An integer Tensor . How many bins there are
in the source spectrogram data, which is understood to be
fft_size // 2 + 1 , i.e. the spectrogram only contains the
nonredundant FFT bins.
|
sampling_rate
|
An integer or float Tensor . Samples per second of
the input signal used to create the spectrogram. Used to figure
out the frequencies corresponding to each spectrogram bin,
which dictates how they are mapped into the mel scale.
|
lower_edge_hertz
|
Python float. Lower bound on the frequencies to be included in the mel spectrum. This corresponds to the lower edge of the lowest triangular band. |
upper_edge_hertz
|
Python float. The desired top edge of the highest frequency band. |
dtype
|
The DType of the result matrix. Must be a floating point
type.
|
Returns | |
---|---|
A tensor of shape [num_spectrogram_bins, num_mel_bins] .
|
symbolic_call
symbolic_call(
*args, **kwargs
)