research-article

A Multi-sensor Framework for Personal Presentation Analytics

Authors:
Tian Gan

Shandong University, China

Shandong University, China
View Profile

,
Junnan Li

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Yongkang Wong

National University of Singapore, Singapore

National University of Singapore, Singapore

0000-0002-1239-4428
View Profile

,
Mohan S. Kankanhalli

National University of Singapore, Singapore

National University of Singapore, Singapore

0000-0002-4846-2015
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15 Issue 2Article No.: 30pp 1–21https://doi.org/10.1145/3300941

Published:05 June 2019Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and is often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speaker’s vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely, NUS Multi-Sensor Presentation dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.

References

Motasem Alrahabi and Jean-Pierre Desclés. 2008. Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities. In Advances in Natural Language Processing. Springer, 40--51. Google ScholarDigital Library
Vahid Aryadoust. 2016. Gender and academic major bias in peer assessment of oral presentations. Lang. Assess. Quart. 13, 1 (2016), 1--24.Google ScholarCross Ref
Kartik Audhkhasi, Kundan Kandhway, Om Deshmukh, and Ashish Verma. 2009. Formant-based technique for automatic filled-pause detection in spontaneous spoken english. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 4857--4860. Google ScholarDigital Library
Trudy W. Banta. 2007. Assessing Student Achievement in General Education: Assessment Update Collections, Vol. 5. John Wiley 8 Sons.Google Scholar
Dean C. Barnlund. 1970. A transactional model of communication. In Language Behavior: A Book of Readings in Communication. 43--61.Google Scholar
Paolo Bernardis and Maurizio Gentilucci. 2006. Speech and gesture share the same communication system. Neuropsychologia 44, 2 (2006), 178--190.Google ScholarCross Ref
Paul Boersma and David Weenink. 2002. PRAAT, a system for doing phonetics by computer. Glot Int. 5, 9/10 (2002), 341--345.Google Scholar
Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the International Conference on Computer Vision. 1--8.Google ScholarCross Ref
Susan M. Brookhart and Fei Chen. 2014. The quality and effectiveness of descriptive rubrics. Edu. Rev. 67, 3 (2014), 343--368.Google Scholar
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google ScholarCross Ref
Lei Chen, Gary Feng, Jilliam Joe, Chee Wee Leong, Christopher Kitchen, and Chong Min Lee. 2014. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the International Conference on Multimodal Interaction. 200--203. Google ScholarDigital Library
Lei Chen, Chee Wee Leong, Gary Feng, and Chong Min Lee. 2014. Using multimodal cues to analyze MLA’14 oral presentation quality corpus: Presentation delivery and slides quality. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 45--52. Google ScholarDigital Library
Nivja H. de Jong and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behav. Res. Methods 41, 2 (2009), 385--390.Google ScholarCross Ref
Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan S. Kankanhalli, and Weidong Geng. 2016. Marker-less 3D human motion capture with monocular image sequence and height-maps. In Proceedings of the European Conference Computer Vision (Lecture Notes in Computer Science), Vol. 9908. 20--36.Google ScholarCross Ref
Norah E. Dunbar, Catherine F. Brooks, and Tara Kubicka-Miller. 2006. Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovat. High. Edu. 31, 2 (2006), 115--128.Google ScholarCross Ref
Vanessa Echeverría, Allan Avenda no, Katherine Chiluiza, Aníbal Vásquez, and Xavier Ochoa. 2014. Presentation skills estimation based on video and kinect data analysis. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 53--60. Google ScholarDigital Library
K. Anders Ericsson, Ralf T. Krampe, and Clemens Tesch-Römer. 1993. The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 100, 3 (1993), 363--406.Google ScholarCross Ref
Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans. Info. Technol. Biomed. 12, 1 (2008), 20--26. Google ScholarDigital Library
Stephen B. Fawcett and L. Keith Miller. 1975. Training public-speaking behavior: An experimental analysis and social validation. J. Appl. Behav. Anal. 2 (1975), 125--135.Google ScholarCross Ref
Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, and Mohan S. Kankanhalli. 2015. Multi-sensor self-quantification of presentations. In Proceedings of ACM International Conference on Multimedia. 601--610. Google ScholarDigital Library
Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, Liyuan Li, Joo-Hwee Lim, and Mohan S. Kankanhalli. 2014. Recovering social interaction spatial structure from multiple first-person views. In Proceedings of International Workshop on Socially-Aware Multimedia. 7--12. Google ScholarDigital Library
Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S. Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In Proceedings of ACM International Conference on Multimedia. 937--946. Google ScholarDigital Library
Uri Hadar, Dafna Wenkert-Olenik, Robert Krauss, and Nachum Soroker. 1998. Gesture and the processing of speech: Neuropsychological evidence. Brain Lang. 62, 1 (1998), 107--126.Google ScholarCross Ref
David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639--2664. Google ScholarDigital Library
Javier Hernandez, Yin Li, James M. Rehg, and Rosalind W. Picard. 2014. BioGlass: Physiological parameter estimation using a head-mounted wearable device. In Proceedings of the EAI International Conference on Wireless Mobile Communication and Healthcare. 55--58.Google Scholar
Rebecca Hincks. 2005. Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism. System 33, 4 (2005), 575--591.Google ScholarCross Ref
Mohammed (Ehsan) Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W. Picard. 2013. MACH: My automated conversation coach. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing. 697--706. Google ScholarDigital Library
Wenping Hu, Yao Qian, and Frank K. Soong. 2013. A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’13). 1886--1890.Google Scholar
Spencer D. Kelly, Aslı Özyürek, and Eric Maris. 2009. Two sides of the same coin: Speech and gesture mutually interact to enhance comprehension. Psychol. Sci. 21 (2009), 260--267.Google ScholarCross Ref
Edward S. Klima. 1979. The Signs of Language. Harvard University Press.Google Scholar
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra M. Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2176--2184.Google ScholarCross Ref
Robert M. Krauss, Robert A. Dushay, Yihsiu Chen, and Frances Rauscher. 1995. The communicative value of conversational hand gestures. J. Exp. Soc. Psychol. 31, 6 (1995), 533--552.Google ScholarCross Ref
Kazutaka Kurihara, Masataka Goto, Jun Ogata, Yosuke Matsusaka, and Takeo Igarashi. 2007. Presentation sensei: A presentation training system using speech and image processing. In Proceedings of the International Conference on Multimodal Interfaces. 358--365. Google ScholarDigital Library
Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys Tutor. 15, 3 (2013), 1192--1209.Google ScholarCross Ref
Junnan Li, Yongkang Wong, and Mohan S. Kankanhalli. 2016. Multi-stream deep learning framework for automated presentation assessment. In Proceedings of the IEEE International Symposium on Multimedia. 222--225.Google Scholar
Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2018. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems. MIT Press, 1260--1270. Google ScholarDigital Library
Gonzalo Luzardo, Bruno Guamán, Katherine Chiluiza, Jaime Castells, and Xavier Ochoa. 2014. Estimation of presentations skills based on slides and audio features. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 37--44. Google ScholarDigital Library
Warren Mansell, David M. Clark, Anke Ehlers, and Yi-Ping Chen. 1999. Social anxiety and attention away from emotional faces. Cogn. Emot. 13, 6 (1999), 673--690.Google ScholarCross Ref
Sylvain Meignier and Teva Merlin. 2010. LIUM SpkDiarization: An open-source toolkit for diarization. In Proceedings of the CMU Sphinx Workshop for Users and Developers (CMUSPUD’10).Google Scholar
Alaeddine Mihoub and Grégoire Lefebvre. 2017. Social intelligence modeling using wearable devices. In Proceedings of the International Conference on Intelligent User Interfaces. 331--341. Google ScholarDigital Library
Sherwyn P. Morreale and Phillip M. Backlund. 1996. Large-scale Assessment of Oral Communication: K-12 and Higher Education. National Communication Association.Google Scholar
Sherwyn P. Morreale, Michael R. Moore, K. Phillip Taylor, Donna Surges-Tatum, and Ruth Hulbert-Johnson. 1993. The Competent Speaker Speech Evaluation Form. National Communication Association.Google Scholar
Jörg Müller, Juliane Exeler, Markus Buzeck, and Antonio Krüger. 2009. ReflectiveSigns: Digital signs that adapt to audience attention. In Proceedings of the International Conference on Pervasive Computing. 17--24. Google ScholarDigital Library
Kevin G. Munhall, Jeffery A. Jones, Daniel E. Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychol. Sci. 15, 2 (2004), 133--137.Google ScholarCross Ref
Sasha Nikolic, David Stirling, and Montserrat Ros. 2017. Formative assessment to develop oral communication competency using YouTube: Self- and peer assessment in engineering. Eur. J. Eng. Edu. 43, 4 (2017), 538--551.Google ScholarCross Ref
Tomas Pfister and Peter Robinson. 2010. Speech emotion classification and public speaking skill assessment. In Proceedings of the Human Behavior Understanding Workshop. 151--162. Google ScholarDigital Library
Tomas Pfister and Peter Robinson. 2011. Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis. IEEE Trans. Affect. Comput. 2, 2 (2011), 66--78. Google ScholarDigital Library
Richard L. Quianthy. 1990. Communication is Life: Essential College Sophomore Speaking and Listening Competencies. Speech Communication Association.Google Scholar
Don Michael Randel. 2003. The Harvard Dictionary of Music. Vol. 16. Harvard University Press.Google Scholar
Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A. Murat Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 7 (2007), 1396--1403. Google ScholarDigital Library
Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2015. Presentation trainer, your public speaking multimodal coach. In Proceedings of the ACM on International Conference on Multimodal Interaction. 539--546. Google ScholarDigital Library
Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2016. Can you help me with my pitch? Studying a tool for real-time automated feedback. IEEE Trans. Learn. Technol. 9, 4 (2016), 318--327. Google ScholarDigital Library
Lisa M. Schreiber, Gregory D. Paul, and Lisa R. Shibley. 2012. The development and test of the public speaking competence rubric. Commun. Edu. 61, 3 (2012), 205--233.Google ScholarCross Ref
Aaron W. Siegman and Stanley Feldstein. 2014. Nonverbal Behavior and Communication. Psychology Press.Google Scholar
Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.Google ScholarCross Ref
Mel Slater, David-Paul Pertaub, Chris Barker, and David M. Clark. 2006. An experimental study on fear of public speaking using a virtual environment. Cyberpsychol. Behav. Soc. Netw. 9, 5 (2006), 627--633.Google ScholarCross Ref
Joan Josep Suñol, Gerard Arbat, Joan Pujol, Lidia Feliu, Rosa Maria Fraguell, and Anna Planas-Lladó. 2016. Peer and self-assessment applied to oral presentations from a multidisciplinary perspective. Assess. Eval. High. Edu. 41, 4 (2016), 622--637.Google ScholarCross Ref
Stephen M. Tasko and Kristin Greilick. 2010. Acoustic and articulatory features of diphthong production: A speech clarity study. J. Speech, Lang. Hear. Res. 53, 1 (2010), 84--99.Google ScholarCross Ref
Joseph Tepperman and Shrikanth Narayanan. 2005. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). 937--940.Google ScholarCross Ref
Stephanie Thomson and Mary L. Rucker. 2002. The development of a specialized public speaking competency scale: Test of reliability. Commun. Res. Reports 19, 1 (2002), 18--28.Google ScholarCross Ref
Anh tuan Nguyen, Wei Chen, and Matthias Rauterberg. 2012. Online feedback system for public speakers. In Proceedings of the IEEE Symposium on E-Learning, E-Management and E-Services.Google ScholarCross Ref
Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. Fostering oral presentation performance: Does the quality of feedback differ when provided by the teacher, peers or peers guided by tutor? Assess. Eval. Higher Edu. 42, 6 (2017), 953--966.Google ScholarCross Ref
Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. The impact of the feedback source on developing oral presentation competence. Studies Higher Edu. 42, 9 (2017), 1671--1685.Google ScholarCross Ref
Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image Vision Comput. 27, 12 (2009), 1743--1759. Google ScholarDigital Library
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209--232. Google ScholarDigital Library
Jane Webster and Hayes Ho. 1997. Audience engagement in multimedia presentations. DATA BASE 28, 2 (1997), 63--77. Google ScholarDigital Library
Xiao-Yong Wei and Zhen-Qun Yang. 2012. Mining in-class social networks for large-scale pedagogical analysis. In Proceedings of the ACM International Conference on Multimedia. 639--648. Google ScholarDigital Library
Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Proceedings of the IEEE International Conference on Data Mining. 1037--1042. Google ScholarDigital Library
Felix Weninger, Jarek Krajewski, Anton Batliner, and Björn W. Schuller. 2012. The voice of leadership: Models and performances of automatic analysis in online speeches. IEEE Trans. Affect. Comput. 3, 4 (2012), 496--508. Google ScholarDigital Library
Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM International Conference on Multimedia. 572--579. Google ScholarDigital Library
Toshihiko Yamasaki, Yusuke Fukushima, Ryosuke Furuta, Litian Sun, Kiyoharu Aizawa, and Danushka Bollegala. 2015. Prediction of user ratings of oral presentations using label relations. In Proceedings of the International Workshop on Affect 8 Sentiment in Multimedia. 33--38. Google ScholarDigital Library
Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1 (2009), 39--58. Google ScholarDigital Library
Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.Google ScholarCross Ref
Jing Zheng, Chao Huang, Min Chu, Frank K. Soong, and Weiping Ye. 2007. Generalized segment posterior probability for automatic mandarin pronunciation evaluation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’07). 201--204.Google ScholarCross Ref

Index Terms

A Multi-sensor Framework for Personal Presentation Analytics

Recommendations

Multi-sensor Self-Quantification of Presentations
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Presentations have been an effective means of delivering information to groups for ages. Over the past few decades, technological advancements have revolutionized the way humans deliver presentations. Despite that, the quality of presentations can be ...
Read More
A Conceptual Framework and Content Model for Next Generation Presentation Solutions

Mainstream presentation tools such as Microsoft PowerPoint were originally built to mimic physical media like photographic slides and still exhibit the same characteristics. However, the state of the art in presentation tools shows that more recent ...
Read More
An Improved Grade Point Average, With Applications to CS Undergraduate Education Analytics
Special Issue on Learning Analytics and Regular Papers

We present a methodological improvement for calculating Grade Point Averages (GPAs). Heterogeneity in grading between courses systematically biases observed GPAs for individual students: the GPA observed depends on course selection. We show how a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 2
May 2019
375 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3339884
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2019
- Accepted: 1 December 2018
- Revised: 1 May 2018
- Received: 1 October 2017
Published in tomm Volume 15, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Quantified self
learning analytics
multi-modal analysis
presentations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 411
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Multi-sensor Framework for Personal Presentation Analytics

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Multi-sensor Self-Quantification of Presentations

A Conceptual Framework and Content Model for Next Generation Presentation Solutions

An Improved Grade Point Average, With Applications to CS Undergraduate Education Analytics