skip to main content
research-article

A Multi-sensor Framework for Personal Presentation Analytics

Published:05 June 2019Publication History
Skip Abstract Section

Abstract

Presentation has been an effective method for delivering information to an audience for many years. Over the past few decades, technological advancements have revolutionized the way humans deliver presentation. Conventionally, the quality of a presentation is usually evaluated through painstaking manual analysis with experts. Although the expert feedback is effective in assisting users to improve their presentation skills, manual evaluation suffers from high cost and is often not available to most individuals. In this work, we propose a novel multi-sensor self-quantification system for presentations, which is designed based on a new proposed assessment rubric. We present our analytics model with conventional ambient sensors (i.e., static cameras and Kinect sensor) and the emerging wearable egocentric sensors (i.e., Google Glass). In addition, we performed a cross-correlation analysis of speaker’s vocal behavior and body language. The proposed framework is evaluated on a new presentation dataset, namely, NUS Multi-Sensor Presentation dataset, which consists of 51 presentations covering a diverse range of topics. To validate the efficacy of the proposed system, we have conducted a series of user studies with the speakers and an interview with an English communication expert, which reveals positive and promising feedback.

References

  1. Motasem Alrahabi and Jean-Pierre Desclés. 2008. Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities. In Advances in Natural Language Processing. Springer, 40--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Vahid Aryadoust. 2016. Gender and academic major bias in peer assessment of oral presentations. Lang. Assess. Quart. 13, 1 (2016), 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  3. Kartik Audhkhasi, Kundan Kandhway, Om Deshmukh, and Ashish Verma. 2009. Formant-based technique for automatic filled-pause detection in spontaneous spoken english. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 4857--4860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Trudy W. Banta. 2007. Assessing Student Achievement in General Education: Assessment Update Collections, Vol. 5. John Wiley 8 Sons.Google ScholarGoogle Scholar
  5. Dean C. Barnlund. 1970. A transactional model of communication. In Language Behavior: A Book of Readings in Communication. 43--61.Google ScholarGoogle Scholar
  6. Paolo Bernardis and Maurizio Gentilucci. 2006. Speech and gesture share the same communication system. Neuropsychologia 44, 2 (2006), 178--190.Google ScholarGoogle ScholarCross RefCross Ref
  7. Paul Boersma and David Weenink. 2002. PRAAT, a system for doing phonetics by computer. Glot Int. 5, 9/10 (2002), 341--345.Google ScholarGoogle Scholar
  8. Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the International Conference on Computer Vision. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  9. Susan M. Brookhart and Fei Chen. 2014. The quality and effectiveness of descriptive rubrics. Edu. Rev. 67, 3 (2014), 343--368.Google ScholarGoogle Scholar
  10. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google ScholarGoogle ScholarCross RefCross Ref
  11. Lei Chen, Gary Feng, Jilliam Joe, Chee Wee Leong, Christopher Kitchen, and Chong Min Lee. 2014. Towards automated assessment of public speaking skills using multimodal cues. In Proceedings of the International Conference on Multimodal Interaction. 200--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lei Chen, Chee Wee Leong, Gary Feng, and Chong Min Lee. 2014. Using multimodal cues to analyze MLA’14 oral presentation quality corpus: Presentation delivery and slides quality. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 45--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nivja H. de Jong and Ton Wempe. 2009. Praat script to detect syllable nuclei and measure speech rate automatically. Behav. Res. Methods 41, 2 (2009), 385--390.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan S. Kankanhalli, and Weidong Geng. 2016. Marker-less 3D human motion capture with monocular image sequence and height-maps. In Proceedings of the European Conference Computer Vision (Lecture Notes in Computer Science), Vol. 9908. 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  15. Norah E. Dunbar, Catherine F. Brooks, and Tara Kubicka-Miller. 2006. Oral communication skills in higher education: Using a performance-based evaluation rubric to assess communication skills. Innovat. High. Edu. 31, 2 (2006), 115--128.Google ScholarGoogle ScholarCross RefCross Ref
  16. Vanessa Echeverría, Allan Avenda no, Katherine Chiluiza, Aníbal Vásquez, and Xavier Ochoa. 2014. Presentation skills estimation based on video and kinect data analysis. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Anders Ericsson, Ralf T. Krampe, and Clemens Tesch-Römer. 1993. The role of deliberate practice in the acquisition of expert performance. Psychol. Rev. 100, 3 (1993), 363--406.Google ScholarGoogle ScholarCross RefCross Ref
  18. Miikka Ermes, Juha Pärkkä, Jani Mäntyjärvi, and Ilkka Korhonen. 2008. Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans. Info. Technol. Biomed. 12, 1 (2008), 20--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Stephen B. Fawcett and L. Keith Miller. 1975. Training public-speaking behavior: An experimental analysis and social validation. J. Appl. Behav. Anal. 2 (1975), 125--135.Google ScholarGoogle ScholarCross RefCross Ref
  20. Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, and Mohan S. Kankanhalli. 2015. Multi-sensor self-quantification of presentations. In Proceedings of ACM International Conference on Multimedia. 601--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, Liyuan Li, Joo-Hwee Lim, and Mohan S. Kankanhalli. 2014. Recovering social interaction spatial structure from multiple first-person views. In Proceedings of International Workshop on Socially-Aware Multimedia. 7--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S. Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In Proceedings of ACM International Conference on Multimedia. 937--946. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Uri Hadar, Dafna Wenkert-Olenik, Robert Krauss, and Nachum Soroker. 1998. Gesture and the processing of speech: Neuropsychological evidence. Brain Lang. 62, 1 (1998), 107--126.Google ScholarGoogle ScholarCross RefCross Ref
  24. David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Javier Hernandez, Yin Li, James M. Rehg, and Rosalind W. Picard. 2014. BioGlass: Physiological parameter estimation using a head-mounted wearable device. In Proceedings of the EAI International Conference on Wireless Mobile Communication and Healthcare. 55--58.Google ScholarGoogle Scholar
  26. Rebecca Hincks. 2005. Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism. System 33, 4 (2005), 575--591.Google ScholarGoogle ScholarCross RefCross Ref
  27. Mohammed (Ehsan) Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W. Picard. 2013. MACH: My automated conversation coach. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing. 697--706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wenping Hu, Yao Qian, and Frank K. Soong. 2013. A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH’13). 1886--1890.Google ScholarGoogle Scholar
  29. Spencer D. Kelly, Aslı Özyürek, and Eric Maris. 2009. Two sides of the same coin: Speech and gesture mutually interact to enhance comprehension. Psychol. Sci. 21 (2009), 260--267.Google ScholarGoogle ScholarCross RefCross Ref
  30. Edward S. Klima. 1979. The Signs of Language. Harvard University Press.Google ScholarGoogle Scholar
  31. Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra M. Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2176--2184.Google ScholarGoogle ScholarCross RefCross Ref
  32. Robert M. Krauss, Robert A. Dushay, Yihsiu Chen, and Frances Rauscher. 1995. The communicative value of conversational hand gestures. J. Exp. Soc. Psychol. 31, 6 (1995), 533--552.Google ScholarGoogle ScholarCross RefCross Ref
  33. Kazutaka Kurihara, Masataka Goto, Jun Ogata, Yosuke Matsusaka, and Takeo Igarashi. 2007. Presentation sensei: A presentation training system using speech and image processing. In Proceedings of the International Conference on Multimodal Interfaces. 358--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys Tutor. 15, 3 (2013), 1192--1209.Google ScholarGoogle ScholarCross RefCross Ref
  35. Junnan Li, Yongkang Wong, and Mohan S. Kankanhalli. 2016. Multi-stream deep learning framework for automated presentation assessment. In Proceedings of the IEEE International Symposium on Multimedia. 222--225.Google ScholarGoogle Scholar
  36. Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2018. Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems. MIT Press, 1260--1270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gonzalo Luzardo, Bruno Guamán, Katherine Chiluiza, Jaime Castells, and Xavier Ochoa. 2014. Estimation of presentations skills based on slides and audio features. In Proceedings of the ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 37--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Warren Mansell, David M. Clark, Anke Ehlers, and Yi-Ping Chen. 1999. Social anxiety and attention away from emotional faces. Cogn. Emot. 13, 6 (1999), 673--690.Google ScholarGoogle ScholarCross RefCross Ref
  39. Sylvain Meignier and Teva Merlin. 2010. LIUM SpkDiarization: An open-source toolkit for diarization. In Proceedings of the CMU Sphinx Workshop for Users and Developers (CMUSPUD’10).Google ScholarGoogle Scholar
  40. Alaeddine Mihoub and Grégoire Lefebvre. 2017. Social intelligence modeling using wearable devices. In Proceedings of the International Conference on Intelligent User Interfaces. 331--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sherwyn P. Morreale and Phillip M. Backlund. 1996. Large-scale Assessment of Oral Communication: K-12 and Higher Education. National Communication Association.Google ScholarGoogle Scholar
  42. Sherwyn P. Morreale, Michael R. Moore, K. Phillip Taylor, Donna Surges-Tatum, and Ruth Hulbert-Johnson. 1993. The Competent Speaker Speech Evaluation Form. National Communication Association.Google ScholarGoogle Scholar
  43. Jörg Müller, Juliane Exeler, Markus Buzeck, and Antonio Krüger. 2009. ReflectiveSigns: Digital signs that adapt to audience attention. In Proceedings of the International Conference on Pervasive Computing. 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Kevin G. Munhall, Jeffery A. Jones, Daniel E. Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility head movement improves auditory speech perception. Psychol. Sci. 15, 2 (2004), 133--137.Google ScholarGoogle ScholarCross RefCross Ref
  45. Sasha Nikolic, David Stirling, and Montserrat Ros. 2017. Formative assessment to develop oral communication competency using YouTube: Self- and peer assessment in engineering. Eur. J. Eng. Edu. 43, 4 (2017), 538--551.Google ScholarGoogle ScholarCross RefCross Ref
  46. Tomas Pfister and Peter Robinson. 2010. Speech emotion classification and public speaking skill assessment. In Proceedings of the Human Behavior Understanding Workshop. 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tomas Pfister and Peter Robinson. 2011. Real-time recognition of affective states from nonverbal features of speech and its application for public speaking skill analysis. IEEE Trans. Affect. Comput. 2, 2 (2011), 66--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Richard L. Quianthy. 1990. Communication is Life: Essential College Sophomore Speaking and Listening Competencies. Speech Communication Association.Google ScholarGoogle Scholar
  49. Don Michael Randel. 2003. The Harvard Dictionary of Music. Vol. 16. Harvard University Press.Google ScholarGoogle Scholar
  50. Mehmet Emre Sargin, Yücel Yemez, Engin Erzin, and A. Murat Tekalp. 2007. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 7 (2007), 1396--1403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2015. Presentation trainer, your public speaking multimodal coach. In Proceedings of the ACM on International Conference on Multimodal Interaction. 539--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jan Schneider, Dirk Börner, Peter van Rosmalen, and Marcus Specht. 2016. Can you help me with my pitch? Studying a tool for real-time automated feedback. IEEE Trans. Learn. Technol. 9, 4 (2016), 318--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lisa M. Schreiber, Gregory D. Paul, and Lisa R. Shibley. 2012. The development and test of the public speaking competence rubric. Commun. Edu. 61, 3 (2012), 205--233.Google ScholarGoogle ScholarCross RefCross Ref
  54. Aaron W. Siegman and Stanley Feldstein. 2014. Nonverbal Behavior and Communication. Psychology Press.Google ScholarGoogle Scholar
  55. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1145--1153.Google ScholarGoogle ScholarCross RefCross Ref
  56. Mel Slater, David-Paul Pertaub, Chris Barker, and David M. Clark. 2006. An experimental study on fear of public speaking using a virtual environment. Cyberpsychol. Behav. Soc. Netw. 9, 5 (2006), 627--633.Google ScholarGoogle ScholarCross RefCross Ref
  57. Joan Josep Suñol, Gerard Arbat, Joan Pujol, Lidia Feliu, Rosa Maria Fraguell, and Anna Planas-Lladó. 2016. Peer and self-assessment applied to oral presentations from a multidisciplinary perspective. Assess. Eval. High. Edu. 41, 4 (2016), 622--637.Google ScholarGoogle ScholarCross RefCross Ref
  58. Stephen M. Tasko and Kristin Greilick. 2010. Acoustic and articulatory features of diphthong production: A speech clarity study. J. Speech, Lang. Hear. Res. 53, 1 (2010), 84--99.Google ScholarGoogle ScholarCross RefCross Ref
  59. Joseph Tepperman and Shrikanth Narayanan. 2005. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). 937--940.Google ScholarGoogle ScholarCross RefCross Ref
  60. Stephanie Thomson and Mary L. Rucker. 2002. The development of a specialized public speaking competency scale: Test of reliability. Commun. Res. Reports 19, 1 (2002), 18--28.Google ScholarGoogle ScholarCross RefCross Ref
  61. Anh tuan Nguyen, Wei Chen, and Matthias Rauterberg. 2012. Online feedback system for public speakers. In Proceedings of the IEEE Symposium on E-Learning, E-Management and E-Services.Google ScholarGoogle ScholarCross RefCross Ref
  62. Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. Fostering oral presentation performance: Does the quality of feedback differ when provided by the teacher, peers or peers guided by tutor? Assess. Eval. Higher Edu. 42, 6 (2017), 953--966.Google ScholarGoogle ScholarCross RefCross Ref
  63. Stan van Ginkel, Judith Gulikers, Harm Biemans, and Martin Mulder. 2017. The impact of the feedback source on developing oral presentation competence. Studies Higher Edu. 42, 9 (2017), 1671--1685.Google ScholarGoogle ScholarCross RefCross Ref
  64. Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image Vision Comput. 27, 12 (2009), 1743--1759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Commun. 57 (2014), 209--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Jane Webster and Hayes Ho. 1997. Audience engagement in multimedia presentations. DATA BASE 28, 2 (1997), 63--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Xiao-Yong Wei and Zhen-Qun Yang. 2012. Mining in-class social networks for large-scale pedagogical analysis. In Proceedings of the ACM International Conference on Multimedia. 639--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Proceedings of the IEEE International Conference on Data Mining. 1037--1042. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Felix Weninger, Jarek Krajewski, Anton Batliner, and Björn W. Schuller. 2012. The voice of leadership: Models and performances of automatic analysis in online speeches. IEEE Trans. Affect. Comput. 3, 4 (2012), 496--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal multimodal fusion for multimedia data analysis. In Proceedings of ACM International Conference on Multimedia. 572--579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Toshihiko Yamasaki, Yusuke Fukushima, Ryosuke Furuta, Litian Sun, Kiyoharu Aizawa, and Danushka Bollegala. 2015. Prediction of user ratings of oral presentations using label relations. In Proceedings of the International Workshop on Affect 8 Sentiment in Multimedia. 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1 (2009), 39--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.Google ScholarGoogle ScholarCross RefCross Ref
  74. Jing Zheng, Chao Huang, Min Chu, Frank K. Soong, and Weiping Ye. 2007. Generalized segment posterior probability for automatic mandarin pronunciation evaluation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’07). 201--204.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Multi-sensor Framework for Personal Presentation Analytics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2
          May 2019
          375 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3339884
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 June 2019
          • Accepted: 1 December 2018
          • Revised: 1 May 2018
          • Received: 1 October 2017
          Published in tomm Volume 15, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format