Efficient audio-based CNNs via filter pruning

case study

Published: 20 March 2023

Efficient audio-based CNNs via filter pruning

Dr Arshdeep Singh, a machine learning researcher in sound with Professor Mark D Plumbley as a part of “” (AI4S) project within the Centre for Vision, Speech and Signal Processing (CVSSP), have been focusing on designing efficient and sustainable artificial intelligence and machine learning (AI-ML) models.

The issue

Recent trends in artificial intelligence (AI) employ convolutional neural networks (CNNs) [1, 2] that provide remarkable performance compared to other existing methods. However, the large size and high computational cost of CNNs is a bottleneck to deploying CNNs on resource-constrained devices such as smartphones. Moreover, training CNNs for several hours leads to emitting more CO2. For instance, a computing device (NVIDIA GPU RTX-2080 Ti) used to train CNNs for 48 hours generates the equivalent CO2 emitted by an average car driven for 13 miles. For estimating CO2, we use an openly available tool [Link-1].

Therefore, we aimed to compress CNNs

To reduce the computational complexity for faster inference.
To reduce memory footprints for using underlying resources effectively.
To reduce the number of computations during the training stage of CNNs by analyzing how many training examples are sufficient in the fine-tuning process of the compressed CNNs to achieve a similar performance to that obtained using all training examples for uncompressed CNNs.

The solution

One of the directions to compress CNNs is by “pruning”, where the unimportant filters are explicitly removed from the original network to build a compact or pruned network. After pruning, the pruned network is fine-tuned to regain the performance loss. This study proposes a cosine distance-based greedy algorithm [3] to prune similar filters in filter space for openly available CNNs designed for audio scene classification [Link-2]. Further, we improve the efficiency of the proposed algorithm [3] by reducing the computational time in pruning [4].

The outcome

We find that the proposed pruning method reduces the number of computations per inference by 27%, with 25% less memory requirements, with less than a 1% drop in accuracy. During fine-tuning of the pruned CNNs, a reduction of training examples by 25% gives a similar performance as that obtained using all examples. We made openly available the proposed algorithm [Link-3] for reproducibility and provided a video presentation [Link-4] explaining the methodology and results from our published work [3].

In addition, we improve the computational time of the proposed pruning method by three times without degrading performance [4, Link-5].

Open research practices/URL links

The proposed work uses the following Open Research practices,

Link-1:
Link-2:
Link-3: Proposed pruning Algorithm:
Link-4: Video presentation:
Link-5: Proposed efficient pruning Algorithm:

See the corresponding poster

References

[1] Q Kong et al., “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880– 2894, 2020.

[2] Irene et al., “Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,” in DCASE workshop, pp. 85-89, 2021.

[3] A Singh and Mark D Plumbley, “,” in INTERSPEECH, pp. 2433-2437, 2022.

[4] A Singh and Mark D Plumbley, “,” accepted for ICASSP 2023.