Abstract
Music plays an important role in our daily life. With the development of deep learning and modern generation techniques, researchers have done plenty of works on automatic music generation. However, due to the special requirements of both melody and arrangement, most of these methods have limitations when applying to multi-track music generation. Some critical factors related to the quality of music are not well addressed, such as chord progression, rhythm pattern, and musical style. In order to tackle the problems and ensure the harmony of multi-track music, in this article, we propose an end-to-end melody and arrangement generation framework to generate a melody track with several accompany tracks played by some different instruments. To be specific, we first develop a novel Chord based Rhythm and Melody Cross-Generation Model to generate melody with a chord progression. Then, we propose a Multi-Instrument Co-Arrangement Model based on multi-task learning for multi-track music arrangement. Furthermore, to control the musical style of arrangement, we design a Multi-Style Multi-Instrument Co-Arrangement Model to learn the musical style with adversarial training. Therefore, we can not only maintain the harmony of the generated music but also control the musical style for better utilization. Extensive experiments on a real-world dataset demonstrate the superiority and effectiveness of our proposed models.
- Howard Anton and Chris Rorres. 2013. Elementary Linear Algebra, Binder Ready Version: Applications Version. John Wiley 8 Sons.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR'15).Google Scholar
- Pierre Baldi, Yves Chauvin, Tim Hunkapiller, and Marcella A McClure. 1994. Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences 91, 3 (1994), 1059--1063.Google ScholarCross Ref
- Judith O. Becker. 2004. Deep Listeners: Music, Emotion, and Trancing. Vol. 2. Indiana University Press.Google Scholar
- Geoffray Bonnin and Dietmar Jannach. 2015. Automated generation of music playlists: Survey and experiments. ACM Computing Surveys 47, 2 (2015), 26.Google ScholarDigital Library
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics. Springer, 177--186.Google ScholarCross Ref
- Mason Bretan, Gil Weinberg, and Larry Heck. 2016. A unit selection methodology for music generation using deep neural networks. arXiv preprint arXiv:1612.03789 (2016).Google Scholar
- Jean-Pierre Briot, Gaëtan Hadjeres, and François Pachet. 2017. Deep learning techniques for music generation-a survey. arXiv preprint arXiv:1709.01620 (2017).Google Scholar
- Gino Brunner, Andres Konrad, Yuyi Wang, and Roger Wattenhofer. 2018. MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer. In 19th International Society for Music Information Retrieval Conference (ISMIR'18).Google Scholar
- Pietro Casella and Ana Paiva. 2001. Magenta: An architecture for real time automatic composition of background music. In Proceedings of theInternational Workshop on Intelligent Virtual Agents. Springer, 224--232.Google ScholarCross Ref
- Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–Decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 103–111.Google ScholarCross Ref
- Parag Chordia, Avinash Sastry, and Sertan Şentürk. 2011. Predictive tabla modelling using variable-length Markov and hidden Markov models. Journal of New Music Research 40, 2 (2011), 105--118.Google ScholarCross Ref
- Hang Chu, Raquel Urtasun, and Sanja Fidler. 2016. Song from pi: A musically plausible network for pop music generation. arXiv preprint arXiv:1611.03477 (2016).Google Scholar
- Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 160--167.Google ScholarDigital Library
- Darrell Conklin. 2003. Music generation from statistical models. In Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences. 30--35.Google Scholar
- Shuqi Dai, Zheng Zhang, and Gus Xia. 2018. Music style transfer issues: A position paper. arXiv preprint arXiv:1803.06841 (2018).Google Scholar
- Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 1723--1732.Google ScholarCross Ref
- Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. 2018. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
- Sean R. Eddy. 1996. Hidden Markov models. Current Opinion in Structural Biology 6, 3 (1996), 361--365.Google ScholarCross Ref
- Franco Fabbri. 2007. Browsing music spaces: Categories and the musical mind. In Proceedings of the International Association for the Study of Popular Music.Google Scholar
- Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, and Zhi-Hua Zhou. 2014. Exploiting geographic dependencies for real estate appraisal: A mutual perspective of ranking and clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1047--1056.Google ScholarDigital Library
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.Google ScholarDigital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems. 2672--2680.Google ScholarDigital Library
- Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). IEEE, 6645--6649.Google ScholarCross Ref
- Gaëtan Hadjeres, François Pachet, and Frank Nielsen. 2017. Deepbach: a steerable model for bach chorales generation. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. JMLR. org, 1362–1371.Google Scholar
- Christopher Harte, Mark Sandler, and Martin Gasser. 2006. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia. ACM, 21--26.Google ScholarDigital Library
- Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint many-task model: Growing a neural network for multiple NLP tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1923–1933.Google ScholarCross Ref
- Nanzhu Jiang, Peter Grosche, Verena Konz, and Meinard Müller. 2011. Analyzing chroma feature types for automated chord recognition. In Proceedings of the 42nd Audio Engineering Society Conference. Audio Engineering Society.Google Scholar
- Daniel Johnson. 2015. Composing music with recurrent neural networks.Google Scholar
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 655–665.Google ScholarCross Ref
- Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7482–7491.Google Scholar
- Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).Google Scholar
- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017).Google Scholar
- Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Doklady Akademii Nauk SSSR 163, 4 (1966), 707–710.Google Scholar
- Bei Liu, Jianlong Fu, Makoto P. Kato, and Masatoshi Yoshikawa. 2018. Beyond narrative description: Generating poetry from images by multi-adversarial training. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 783--791.Google ScholarDigital Library
- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. 2873–2879.Google Scholar
- Qi Liu, Zhenya Huang, Yu Yin, Enhong Chen, Hui Xiong, Yu Su, Guoping Hu. 2019. EKT: Exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering (2019).Google Scholar
- Qi Liu, Guifeng Wang, Hongke Zhao, Chuanren Liu, Tong Xu, and Enhong Chen. 2017. Enhancing campaign design in crowdfunding: A product supply optimization perspective. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 695--702.Google ScholarCross Ref
- Mingsheng Long and Jianmin Wang. 2015. Learning multiple tasks with deep relationship networks. arXiv preprint arXiv:1506.02117 (2015).Google ScholarDigital Library
- Chien-Yu Lu, Min-Xin Xue, Chia-Che Chang, Che-Rung Lee, and Li Su. 2019. Play as you like: Timbre-enhanced multi-modal music style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1061--1068.Google ScholarDigital Library
- Prasanta Chandra Mahalanobis. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Science of India.Google Scholar
- Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.Google ScholarCross Ref
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.Google Scholar
- Olof Mogren. 2016. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904 (2016).Google Scholar
- François Pachet, Sony CSL Paris, Alexandre Papadopoulos, and Pierre Roy. 2017. Sampling variations of sequences for structured music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’17). 167--173.Google Scholar
- François Pachet and Pierre Roy. 2011. Markov constraints: Steerable generation of Markov sequences. Constraints 16, 2 (2011), 148--172.Google ScholarDigital Library
- Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2017. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142 (2017).Google Scholar
- Romain Sabathé, Eduardo Coutinho, and Björn Schuller. 2017. Deep recurrent music writer: Memory-enhanced variational autoencoder-based musical score composition and an objective measure. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17). IEEE, 3467--3474.Google ScholarCross Ref
- Paul Schmeling. 2011. Berklee Music Theory. Berklee Press.Google Scholar
- Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of Information Technology and Electronic Engineering 19, 1 (2018), 10--26.Google ScholarCross Ref
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 3104--3112.Google Scholar
- Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 3. IEEE, 1315--1318.Google ScholarCross Ref
- Andries Van Der Merwe and Walter Schulze. 2011. Music generation with Markov models. IEEE MultiMedia 18, 3 (2011), 78--85.Google ScholarDigital Library
- Dominique T. Vuvan and Bryn Hughes. 2019. Musical style affects the strength of harmonic expectancy. Music 8 Science 2 (2019), 2059204318816066.Google Scholar
- Yanan Wang, Qi Liu, Chuan Qin, Tong Xu, Yijun Wang, Enhong Chen, and Hui Xiong. 2018. Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, 597--606.Google ScholarCross Ref
- Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. 2017. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’17).Google Scholar
- Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google Scholar
- Kun Zhang, Guangyi Lv, Le Wu, Enhong Chen, Qi Liu, Han Wu, and Fangzhao Wu. 2018. Image-enhanced multi-level sentence representation net for natural language inference. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, 747--756.Google ScholarCross Ref
- Kai Zhang, Hefu Zhang, Qi Liu, Hongke Zhao, Hengshu Zhu, and Enhong Chen. 2019. Interactive attention transfer network for cross-domain sentiment classification. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Xiaofan Zhang, Feng Zhou, Yuanqing Lin, and Shaoting Zhang. 2016. Embedding label structures for fine-grained feature representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1114--1123.Google ScholarCross Ref
- Yu Zhang and Qiang Yang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).Google Scholar
- Hengshu Zhu, Enhong Chen, Kuifei Yu, Huanhuan Cao, Hui Xiong, and Jilei Tian. 2012. Mining personal context-aware preferences for mobile users. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining. IEEE, 1212--1217.Google ScholarDigital Library
- Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Chuan Qin, Jiawei Li, Kun Zhang, Guang Zhou, Furu Wei, Yuanchun Xu, and Enhong Chen. 2018. Xiaoice band: A melody and arrangement generation framework for pop music. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining. ACM, 2837--2846.Google ScholarDigital Library
Index Terms
- Pop Music Generation: From Melody to Multi-style Arrangement
Recommendations
XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWith the development of knowledge of music composition and the recent increase in demand, an increasing number of companies and research institutes have begun to study the automatic generation of music. However, previous models have limitations when ...
Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions
MM '20: Proceedings of the 28th ACM International Conference on MultimediaA great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent ...
Structure-Enhanced Pop Music Generation via Harmony-Aware Learning
MM '22: Proceedings of the 30th ACM International Conference on MultimediaPop music generation has always been an attractive topic for both musicians and scientists for a long time. However, automatically composing pop music with a satisfactory structure is still a challenging issue. In this paper, we propose to leverage ...
Comments