Abstract
Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and is often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speaker’s vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely, NUS Multi-Sensor Presentation dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.
- Motasem Alrahabi and Jean-Pierre Desclés. 2008. Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities. In Advances in Natural Language Processing. Springer, 40--51. Google ScholarDigital Library
- Vahid Aryadoust. 2016. Gender and academic major bias in peer assessment of oral presentations. Lang. Assess. Quart. 13, 1 (2016), 1--24.Google ScholarCross Ref
- Kartik Audhkhasi, Kundan Kandhway, Om Deshmukh, and Ashish Verma. 2009. Formant-based technique for automatic filled-pause detection in spontaneous spoken english. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 4857--4860. Google ScholarDigital Library
- Trudy W. Banta. 2007. Assessing Student Achievement in General Education: Assessment Update Collections, Vol. 5. John Wiley 8 Sons.Google Scholar
- Dean C. Barnlund. 1970. A transactional model of communication. In Language Behavior: A Book of Readings in Communication. 43--61.Google Scholar
- Paolo Bernardis and Maurizio Gentilucci. 2006. Speech and gesture share the same communication system. Neuropsychologia 44, 2 (2006), 178--190.Google ScholarCross Ref
- Paul Boersma and David Weenink. 2002. PRAAT, a system for doing phonetics by computer. Glot Int. 5, 9/10 (2002), 341--345.Google Scholar
- Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the International Conference on Computer Vision. 1--8.Google ScholarCross Ref
- Susan M. Brookhart and Fei Chen. 2014. The quality and effectiveness of descriptive rubrics. Edu. Rev. 67, 3 (2014), 343--368.Google Scholar
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google ScholarCross Ref
- Lei Chen, Gary Feng, Jilliam Joe, Chee Wee Leong, Christopher Kitchen, and Chong Min Lee. 2014. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the International Conference on Multimodal Interaction. 200--203. Google ScholarDigital Library
- Lei Chen, Chee Wee Leong, Gary Feng, and Chong Min Lee. 2014. Using multimodal cues to analyze MLA’14 oral presentation quality corpus: Presentation delivery and slides quality. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 45--52. Google ScholarDigital Library
- Nivja H. de Jong and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behav. Res. Methods 41, 2 (2009), 385--390.Google ScholarCross Ref
- Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan S. Kankanhalli, and Weidong Geng. 2016. Marker-less 3D human motion capture with monocular image sequence and height-maps. In Proceedings of the European Conference Computer Vision (Lecture Notes in Computer Science), Vol. 9908. 20--36.Google ScholarCross Ref
- Norah E. Dunbar, Catherine F. Brooks, and Tara Kubicka-Miller. 2006. Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovat. High. Edu. 31, 2 (2006), 115--128.Google ScholarCross Ref
- Vanessa Echeverría, Allan Avenda no, Katherine Chiluiza, Aníbal Vásquez, and Xavier Ochoa. 2014. Presentation skills estimation based on video and kinect data analysis. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 53--60. Google ScholarDigital Library
- K. Anders Ericsson, Ralf T. Krampe, and Clemens Tesch-Römer. 1993. The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 100, 3 (1993), 363--406.Google ScholarCross Ref
- Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans. Info. Technol. Biomed. 12, 1 (2008), 20--26. Google ScholarDigital Library
- Stephen B. Fawcett and L. Keith Miller. 1975. Training public-speaking behavior: An experimental analysis and social validation. J. Appl. Behav. Anal. 2 (1975), 125--135.Google ScholarCross Ref
- Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, and Mohan S. Kankanhalli. 2015. Multi-sensor self-quantification of presentations. In Proceedings of ACM International Conference on Multimedia. 601--610. Google ScholarDigital Library
- Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, Liyuan Li, Joo-Hwee Lim, and Mohan S. Kankanhalli. 2014. Recovering social interaction spatial structure from multiple first-person views. In Proceedings of International Workshop on Socially-Aware Multimedia. 7--12. Google ScholarDigital Library
- Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S. Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In Proceedings of ACM International Conference on Multimedia. 937--946. Google ScholarDigital Library
- Uri Hadar, Dafna Wenkert-Olenik, Robert Krauss, and Nachum Soroker. 1998. Gesture and the processing of speech: Neuropsychological evidence. Brain Lang. 62, 1 (1998), 107--126.Google ScholarCross Ref
- David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639--2664. Google ScholarDigital Library
- Javier Hernandez, Yin Li, James M. Rehg, and Rosalind W. Picard. 2014. BioGlass: Physiological parameter estimation using a head-mounted wearable device. In Proceedings of the EAI International Conference on Wireless Mobile Communication and Healthcare. 55--58.Google Scholar
- Rebecca Hincks. 2005. Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism. System 33, 4 (2005), 575--591.Google ScholarCross Ref
- Mohammed (Ehsan) Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W. Picard. 2013. MACH: My automated conversation coach. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing. 697--706. Google ScholarDigital Library
- Wenping Hu, Yao Qian, and Frank K. Soong. 2013. A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’13). 1886--1890.Google Scholar
- Spencer D. Kelly, Aslı Özyürek, and Eric Maris. 2009. Two sides of the same coin: Speech and gesture mutually interact to enhance comprehension. Psychol. Sci. 21 (2009), 260--267.Google ScholarCross Ref
- Edward S. Klima. 1979. The Signs of Language. Harvard University Press.Google Scholar
- Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra M. Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2176--2184.Google ScholarCross Ref
- Robert M. Krauss, Robert A. Dushay, Yihsiu Chen, and Frances Rauscher. 1995. The communicative value of conversational hand gestures. J. Exp. Soc. Psychol. 31, 6 (1995), 533--552.Google ScholarCross Ref
- Kazutaka Kurihara, Masataka Goto, Jun Ogata, Yosuke Matsusaka, and Takeo Igarashi. 2007. Presentation sensei: A presentation training system using speech and image processing. In Proceedings of the International Conference on Multimodal Interfaces. 358--365. Google ScholarDigital Library
- Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys Tutor. 15, 3 (2013), 1192--1209.Google ScholarCross Ref
- Junnan Li, Yongkang Wong, and Mohan S. Kankanhalli. 2016. Multi-stream deep learning framework for automated presentation assessment. In Proceedings of the IEEE International Symposium on Multimedia. 222--225.Google Scholar
- Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2018. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems. MIT Press, 1260--1270. Google ScholarDigital Library
- Gonzalo Luzardo, Bruno Guamán, Katherine Chiluiza, Jaime Castells, and Xavier Ochoa. 2014. Estimation of presentations skills based on slides and audio features. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 37--44. Google ScholarDigital Library
- Warren Mansell, David M. Clark, Anke Ehlers, and Yi-Ping Chen. 1999. Social anxiety and attention away from emotional faces. Cogn. Emot. 13, 6 (1999), 673--690.Google ScholarCross Ref
- Sylvain Meignier and Teva Merlin. 2010. LIUM SpkDiarization: An open-source toolkit for diarization. In Proceedings of the CMU Sphinx Workshop for Users and Developers (CMUSPUD’10).Google Scholar
- Alaeddine Mihoub and Grégoire Lefebvre. 2017. Social intelligence modeling using wearable devices. In Proceedings of the International Conference on Intelligent User Interfaces. 331--341. Google ScholarDigital Library
- Sherwyn P. Morreale and Phillip M. Backlund. 1996. Large-scale Assessment of Oral Communication: K-12 and Higher Education. National Communication Association.Google Scholar
- Sherwyn P. Morreale, Michael R. Moore, K. Phillip Taylor, Donna Surges-Tatum, and Ruth Hulbert-Johnson. 1993. The Competent Speaker Speech Evaluation Form. National Communication Association.Google Scholar
- Jörg Müller, Juliane Exeler, Markus Buzeck, and Antonio Krüger. 2009. ReflectiveSigns: Digital signs that adapt to audience attention. In Proceedings of the International Conference on Pervasive Computing. 17--24. Google ScholarDigital Library
- Kevin G. Munhall, Jeffery A. Jones, Daniel E. Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychol. Sci. 15, 2 (2004), 133--137.Google ScholarCross Ref
- Sasha Nikolic, David Stirling, and Montserrat Ros. 2017. Formative assessment to develop oral communication competency using YouTube: Self- and peer assessment in engineering. Eur. J. Eng. Edu. 43, 4 (2017), 538--551.Google ScholarCross Ref
- Tomas Pfister and Peter Robinson. 2010. Speech emotion classification and public speaking skill assessment. In Proceedings of the Human Behavior Understanding Workshop. 151--162. Google ScholarDigital Library
- Tomas Pfister and Peter Robinson. 2011. Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis. IEEE Trans. Affect. Comput. 2, 2 (2011), 66--78. Google ScholarDigital Library
- Richard L. Quianthy. 1990. Communication is Life: Essential College Sophomore Speaking and Listening Competencies. Speech Communication Association.Google Scholar
- Don Michael Randel. 2003. The Harvard Dictionary of Music. Vol. 16. Harvard University Press.Google Scholar
- Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A. Murat Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 7 (2007), 1396--1403. Google ScholarDigital Library
- Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2015. Presentation trainer, your public speaking multimodal coach. In Proceedings of the ACM on International Conference on Multimodal Interaction. 539--546. Google ScholarDigital Library
- Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2016. Can you help me with my pitch? Studying a tool for real-time automated feedback. IEEE Trans. Learn. Technol. 9, 4 (2016), 318--327. Google ScholarDigital Library
- Lisa M. Schreiber, Gregory D. Paul, and Lisa R. Shibley. 2012. The development and test of the public speaking competence rubric. Commun. Edu. 61, 3 (2012), 205--233.Google ScholarCross Ref
- Aaron W. Siegman and Stanley Feldstein. 2014. Nonverbal Behavior and Communication. Psychology Press.Google Scholar
- Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.Google ScholarCross Ref
- Mel Slater, David-Paul Pertaub, Chris Barker, and David M. Clark. 2006. An experimental study on fear of public speaking using a virtual environment. Cyberpsychol. Behav. Soc. Netw. 9, 5 (2006), 627--633.Google ScholarCross Ref
- Joan Josep Suñol, Gerard Arbat, Joan Pujol, Lidia Feliu, Rosa Maria Fraguell, and Anna Planas-Lladó. 2016. Peer and self-assessment applied to oral presentations from a multidisciplinary perspective. Assess. Eval. High. Edu. 41, 4 (2016), 622--637.Google ScholarCross Ref
- Stephen M. Tasko and Kristin Greilick. 2010. Acoustic and articulatory features of diphthong production: A speech clarity study. J. Speech, Lang. Hear. Res. 53, 1 (2010), 84--99.Google ScholarCross Ref
- Joseph Tepperman and Shrikanth Narayanan. 2005. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). 937--940.Google ScholarCross Ref
- Stephanie Thomson and Mary L. Rucker. 2002. The development of a specialized public speaking competency scale: Test of reliability. Commun. Res. Reports 19, 1 (2002), 18--28.Google ScholarCross Ref
- Anh tuan Nguyen, Wei Chen, and Matthias Rauterberg. 2012. Online feedback system for public speakers. In Proceedings of the IEEE Symposium on E-Learning, E-Management and E-Services.Google ScholarCross Ref
- Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. Fostering oral presentation performance: Does the quality of feedback differ when provided by the teacher, peers or peers guided by tutor? Assess. Eval. Higher Edu. 42, 6 (2017), 953--966.Google ScholarCross Ref
- Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. The impact of the feedback source on developing oral presentation competence. Studies Higher Edu. 42, 9 (2017), 1671--1685.Google ScholarCross Ref
- Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image Vision Comput. 27, 12 (2009), 1743--1759. Google ScholarDigital Library
- Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209--232. Google ScholarDigital Library
- Jane Webster and Hayes Ho. 1997. Audience engagement in multimedia presentations. DATA BASE 28, 2 (1997), 63--77. Google ScholarDigital Library
- Xiao-Yong Wei and Zhen-Qun Yang. 2012. Mining in-class social networks for large-scale pedagogical analysis. In Proceedings of the ACM International Conference on Multimedia. 639--648. Google ScholarDigital Library
- Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Proceedings of the IEEE International Conference on Data Mining. 1037--1042. Google ScholarDigital Library
- Felix Weninger, Jarek Krajewski, Anton Batliner, and Björn W. Schuller. 2012. The voice of leadership: Models and performances of automatic analysis in online speeches. IEEE Trans. Affect. Comput. 3, 4 (2012), 496--508. Google ScholarDigital Library
- Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM International Conference on Multimedia. 572--579. Google ScholarDigital Library
- Toshihiko Yamasaki, Yusuke Fukushima, Ryosuke Furuta, Litian Sun, Kiyoharu Aizawa, and Danushka Bollegala. 2015. Prediction of user ratings of oral presentations using label relations. In Proceedings of the International Workshop on Affect 8 Sentiment in Multimedia. 33--38. Google ScholarDigital Library
- Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1 (2009), 39--58. Google ScholarDigital Library
- Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.Google ScholarCross Ref
- Jing Zheng, Chao Huang, Min Chu, Frank K. Soong, and Weiping Ye. 2007. Generalized segment posterior probability for automatic mandarin pronunciation evaluation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’07). 201--204.Google ScholarCross Ref
Index Terms
- A Multi-sensor Framework for Personal Presentation Analytics
Recommendations
Multi-sensor Self-Quantification of Presentations
MM '15: Proceedings of the 23rd ACM international conference on MultimediaPresentations have been an effective means of delivering information to groups for ages. Over the past few decades, technological advancements have revolutionized the way humans deliver presentations. Despite that, the quality of presentations can be ...
A Conceptual Framework and Content Model for Next Generation Presentation Solutions
Mainstream presentation tools such as Microsoft PowerPoint were originally built to mimic physical media like photographic slides and still exhibit the same characteristics. However, the state of the art in presentation tools shows that more recent ...
An Improved Grade Point Average, With Applications to CS Undergraduate Education Analytics
Special Issue on Learning Analytics and Regular PapersWe present a methodological improvement for calculating Grade Point Averages (GPAs). Heterogeneity in grading between courses systematically biases observed GPAs for individual students: the GPA observed depends on course selection. We show how a ...
Comments