Author
Abrol, V
Sharma, P
Journal title
IEEE/ACM Transactions on Audio, Speech and Language Processing
DOI
10.1109/TASLP.2020.3001969
Volume
28
Last updated
2021-09-27T15:09:36.323+01:00
Page
1964-1973
Abstract
Recent advancements in modelling speech and audio signals using deep neural networks have shown that systems learning both features and the classifier can be built directly from raw signal. However, the performance of such end-to-end systems for acoustic scene classification (ASC) task is still not at par with conventional systems built using spectral features. In this work, we propose a raw waveform based end-to-end ASC system using convolutional neural network. In contrast to the existing studies using a non-hierarchical model, our framework leverages the hierarchical relations between acoustic categories to improve the classification performance. To this aim, our multi-task model is trained with coarse and fine labels that correspond to different levels of abstraction. In order to ensure consistency in the encoded information via label hierarchy, the proposed framework uses a prototypical model. Such a model ensures that the learned representations at least match to one of the global categorical learned prototypes. We also employed a statistical pooling layer to aggregate hidden representations over multiple frames of the input audio signal. The statistics (mean and standard deviation) are concatenated together to form a fixed-length audio embedding. This aggregation is done via an attention module so as to guide the model's attention even in presence of relatively short or transient acoustic events. Further, the proposed framework incorporate two parallel feature processing pipelines to achieve different resolutions for extracting important acoustic cues. Various experiments on publicly available datasets are performed to demonstrate the effectiveness of the proposed framework for ASC task. Additional transfer learning experiments showed the proposed model's adaptation capability to unseen data. Network analysis and visualizations demonstrate the importance of individual modules and their impact on overall representation learning for ASC task.
Symplectic ID
1111888
Publication type
Journal Article
Publication date
12 June 2020
Please contact us with feedback and comments about this page. Created on 12 Jun 2020 - 17:30.