case study
Published: 20 March 2023

Efficient audio-based CNNs via filter pruning

Dr Arshdeep Singh, a machine learning researcher in sound with Professor Mark D Plumbley as a part of 鈥溾 (AI4S) project within the Centre for Vision, Speech and Signal Processing (CVSSP), have been focusing on designing efficient and sustainable artificial intelligence and machine learning (AI-ML) models.

The issue

Recent trends in artificial intelligence (AI) employ convolutional neural networks (CNNs) [1, 2] that provide remarkable performance compared to other existing methods. However, the large size and high computational cost of CNNs is a bottleneck to deploying CNNs on resource-constrained devices such as smartphones. Moreover, training CNNs for several hours leads to emitting more CO2. For instance, a computing device (NVIDIA GPU RTX-2080 Ti) used to train CNNs for 48 hours generates the equivalent CO2 emitted by an average car driven for 13 miles. For estimating CO2, we use an openly available tool [Link-1].

Therefore, we aimed to compress CNNs

  1. To reduce the computational complexity for faster inference.
  2. To reduce memory footprints for using underlying resources effectively.
  3. To reduce the number of computations during the training stage of CNNs by analyzing how many training examples are sufficient in the fine-tuning process of the compressed CNNs to achieve a similar performance to that obtained using all training examples for uncompressed CNNs.

The solution

One of the directions to compress CNNs is by 鈥減runing鈥, where the unimportant filters are explicitly removed from the original network to build a compact or pruned network. After pruning, the pruned network is fine-tuned to regain the performance loss. This study proposes a cosine distance-based greedy algorithm [3] to prune similar filters in filter space for openly available CNNs designed for audio scene classification [Link-2]. Further, we improve the efficiency of the proposed algorithm [3] by reducing the computational time in pruning [4].

The outcome

We find that the proposed pruning method reduces the number of computations per inference by 27%, with 25% less memory requirements, with less than a 1% drop in accuracy. During fine-tuning of the pruned CNNs, a reduction of training examples by 25% gives a similar performance as that obtained using all examples. We made openly available the proposed algorithm [Link-3] for reproducibility and provided a video presentation [Link-4] explaining the methodology and results from our published work [3].

In addition, we improve the computational time of the proposed pruning method by three times without degrading performance [4, Link-5].

Open research practices/URL links

The proposed work uses the following Open Research practices,

  • Link-1:
  • Link-2:
  • Link-3: Proposed pruning Algorithm:
  • Link-4: Video presentation:
  • Link-5: Proposed efficient pruning Algorithm:

See the corresponding poster

References

[1] Q Kong et al., 鈥淧ANNs: Large-scale pretrained audio neural networks for audio pattern recognition,鈥 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880鈥 2894, 2020.

[2] Irene et al., 鈥淟ow-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,鈥 in DCASE workshop, pp. 85-89, 2021.

[3] A Singh and Mark D Plumbley, 鈥,鈥 in INTERSPEECH, pp. 2433-2437, 2022.

[4] A Singh and Mark D Plumbley, 鈥,鈥 accepted for ICASSP 2023.

Contact details

Arshdeep Singh (1) and Mark D. Plumbley (2)

1: Department of Computer Science and Electrical Engineering, 麻豆视频, UK,

2: EPSRC Fellow in 鈥淎I for sound鈥 project, Professor of Signal Processing, 麻豆视频, UK

Contact information (Arshdeep Singh)

Lead author job title: Research Fellow A

Lead author faculty: Faculty of Engineering and Physical Sciences

Lead author email: arshdeep.singh@surrey.ac.uk

Lead author ORCID: