Note-level singing melody transcription with transformers

Park, Jonggwon; Choi, Kyoyun; Oh, Seola; Kim, Leekyung; Park, Jonghun

doi:10.3233/IDA-227077

Note-level singing melody transcription with transformers

Article type: Research Article

Authors: Park, Jonggwon^a | Choi, Kyoyun^{b; *} | Oh, Seola^a | Kim, Leekyung^a | Park, Jonghun^a

Affiliations: [a] Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University, Seoul, Korea | [b] Institute of Engineering Research, Seoul National University, Seoul, Korea

Correspondence: [*] Corresponding author: Kyoyun Choi, Institute of Engineering Research, Seoul National University, Seoul, Korea. E-mail: [email protected].

Abstract: Recognizing a singing melody from an audio signal in terms of the music notes’ pitch onset and offset, referred to as note-level singing melody transcription, has been studied as a critical task in the field of automatic music transcription. The task is challenging due to the different timbre and vibrato of each vocal and the ambiguity of onset and offset of the human voice compared with other instrumental sounds. This paper proposes a note-level singing melody transcription model using sequence-to-sequence Transformers. The singing melody annotation is expressed as a monophonic melody sequence and used as a decoder sequence. Overlapping decoding is introduced to solve the problem of the context between segments being broken. Applying pitch augmentation and and adding noisy dataset with data cleansing turns out to be effective in preventing overfitting and generalizing the model performance. Ablation studies demonstrate the effects of the proposed techniques in note-level singing melody transcription, both quantitatively and qualitatively. The proposed model outperforms other models in note-level singing melody transcription performance for all the metrics considered. For fundamental frequency metrics, the voice detection performance of the proposed model is comparable to that of a vocal melody extraction model. Finally, subjective human evaluation demonstrates that the results of the proposed models are perceived as more accurate than the results of a previous study.

Keywords: Automatic music transcription, deep learning, music information retrieval, sequence-to-sequence learning, singing melody transcription

DOI: 10.3233/IDA-227077

Journal: Intelligent Data Analysis, vol. 27, no. 6, pp. 1853-1871, 2023

Published: 20 November 2023

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia