ABSTRACT
We present a wearable interface that allows a user to silently converse with a computing device without any voice or any discernible movements - thereby enabling the user to communicate with devices, AI assistants, applications or other people in a silent, concealed and seamless manner. A user's intention to speak and internal speech is characterized by neuromuscular signals in internal speech articulators that are captured by the AlterEgo system to reconstruct this speech. We use this to facilitate a natural language user interface, where users can silently communicate in natural language and receive aural output (e.g - bone conduction headphones), thereby enabling a discreet, bi-directional interface with a computing device, and providing a seamless form of intelligence augmentation. The paper describes the architecture, design, implementation and operation of the entire system. We demonstrate robustness of the system through user studies and report 92% median word accuracy levels.
- Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv {cs.CL}. Retrieved from http://arxiv.org/abs/1512.02595Google Scholar
- W. Ross Ashby. 1956. Design for an intelligence-amplifier. Automata studies 400: 215--233.Google Scholar
- W. Ross Ashby. 1957. An introduction to cybernetics. Retrieved from http://dspace.utalca.cl/handle/1950/6344Google Scholar
- Alan Baddeley, Marge Eldridge, and Vivien Lewis. 1981. The role of subvocalisation in reading. The Quarterly Journal of Experimental Psychology Section A 33, 4: 439--454.Google ScholarCross Ref
- Richard A. Bolt. 1980. Put-that-there: Voice and Gesture at the Graphics Interface. SIGGRAPH Comput. Graph. 14, 3: 262--270. Google ScholarDigital Library
- Jonathan S. Brumberg, Alfonso Nieto-Castanon, Philip R. Kennedy, and Frank H. Guenther. 2010. Brain-Computer Interfaces for Speech Communication. Speech communication 52, 4: 367--379. Google ScholarDigital Library
- Douglas C. Engelbart. 2001. Augmenting human intellect: a conceptual framework (1962). PACKER, Randall and JORDAN, Ken. Multimedia. From Wagner to Virtual Reality. New York: WW Norton & Company: 64--90.Google ScholarCross Ref
- Douglas C. Engelbart and William K. English. 1968. A Research Center for Augmenting Human Intellect. In Proceedings of the December 9--11, 1968, Fall Joint Computer Conference, Part I (AFIPS '68 (Fall, part I)), 395--410. Google ScholarDigital Library
- M. J. Fagan, S. R. Ell, J. M. Gilbert, E. Sarrazin, and P. M. Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical engineering & physics 30, 4: 419--425.Google Scholar
- Victoria M. Florescu, Lise Crevier-Buchman, Bruce Denby, Thomas Hueber, Antonia Colazo-Simon, Claire Pillot-Loiseau, Pierre Roussel, Cédric Gendrot, and Sophie Quattrocchi. 2010. Silent vs vocalized articulation for a portable ultrasound-based silent speech interface. In Eleventh Annual Conference of the International Speech Communication Association. Retrieved from http://www.gipsa-lab.inpg.fr/~thomas.hueber/mes_documents/florescu_etal_interspeech_2010.PDFGoogle ScholarCross Ref
- Carl Benedikt Frey and Michael A. Osborne. 2017. The future of employment: How susceptible are jobs to computerisation? Technological forecasting and social change 114, Supplement C: 254--280.Google Scholar
- A. Graves, A. r. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 6645--6649.Google Scholar
- Jefferson Y. Han. 2005. Low-cost Multi-touch Sensing Through Frustrated Total Internal Reflection. In Proceedings of the 18th Annual ACM Symposium on User Interface Software and Technology (UIST '05), 115--118. Google ScholarDigital Library
- William J. Hardcastle. 1976. Physiology of speech production: an introduction for speech scientists. Academic Press.Google Scholar
- Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech communication 52, 4: 301--313. Google ScholarDigital Library
- Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech communication 55, 1: 22--32. Google ScholarDigital Library
- Thomas Hueber, Gérard Chollet, Bruce Denby, and Maureen Stone. 2008. Acquisition of ultrasound, video and acoustic speech data for a silent-speech interface application. Proc. of ISSP: 365--369.Google Scholar
- Jorgensen, C., & Binsted, K. (2005, January). Web browser control using EMG based sub vocal speech recognition. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on (pp. 294c-294c). IEEE. Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv {cs.LG}. Retrieved from http://arxiv.org/abs/1412.6980Google Scholar
- J. C. R. Licklider. 1960. Man-Computer Symbiosis. IRE Transactions on Human Factors in Electronics HFE-1, 1: 4--11.Google ScholarCross Ref
- S. Mitra and T. Acharya. 2007. Gesture Recognition: A Survey. IEEE transactions on systems, man and cybernetics. Part C, Applications and reviews: a publication of the IEEE Systems, Man, and Cybernetics Society 37, 3: 311--324. Google ScholarDigital Library
- A. Nijholt, D. Tan, G. Pfurtscheller, C. Brunner, J. d. R. Millán, B. Allison, B. Graimann, F. Popescu, B. Blankertz, and K. R. Müller. 2008. Brain-Computer Interfacing for Intelligent Systems. IEEE intelligent systems 23, 3: 72--79. Google ScholarDigital Library
- Sharon Oviatt, Phil Cohen, Lizhong Wu, Lisbeth Duncan, Bernhard Suhm, Josh Bers, Thomas Holzman, Terry Winograd, James Landay, Jim Larson, and David Ferro. 2000. Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions. Human--Computer Interaction 15, 4: 263--322. Google ScholarDigital Library
- Anne Porbadnigk, Marek Wester, Jan-P Calliess, and Tanja Schultz. 2009. EEG-BASED SPEECH RECOGNITION Impact of Temporal Effects.Google Scholar
- Michael Wand and Tanja Schultz. 2011. Session-independent EMG-based Speech Recognition. In Biosignals, 295--300.Google Scholar
- M. Wand, J. Koutník, and J. Schmidhuber. 2016. Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6115--6119.Google Scholar
- Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. 1995. Designing SpeechActs: Issues in Speech User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '95), 369--376. Google ScholarDigital Library
- iOS - Siri. Apple. Retrieved October 9, 2017 from https://www.apple.com/ios/siri/Google Scholar
- Alexa. Retrieved from https://developer.amazon.com/alexaGoogle Scholar
- Cortana | Your Intelligent Virtual & Personal Assistant | Microsoft. Retrieved October 9, 2017 from https://www.microsoft.com/en-us/windows/cortanaGoogle Scholar
- Google Home. Google Store. Retrieved October 9, 2017 from https://store.google.com/us/product/google_home?hl=en-USGoogle Scholar
- Echo. Retrieved from https://www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?ie=UTF8&node=9818047011Google Scholar
Index Terms
- AlterEgo: A Personalized Wearable Silent Speech Interface
Recommendations
TongueBoard: An Oral Interface for Subtle Input
AH2019: Proceedings of the 10th Augmented Human International Conference 2019We present TongueBoard, a retainer form-factor device for recognizing non-vocalized speech. TongueBoard enables absolute position tracking of the tongue by placing capacitive touch sensors on the roof of the mouth. We collect a dataset of 21 common ...
Statistical conversion of silent articulation into audible speech using full-covariance HMM
Conversion of silent articulation captured by ultrasound and video to modal speech.Comparison of GMM and full-covariance phonetic HMM without vocabulary limitation.HMM-based approach allows the use of linguistic information for regularization.Objective ...
Improvement to a NAM-captured whisper-to-speech system
Exploiting a tissue-conductive sensor - a stethoscopic microphone - the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GMM-based statistical mapping is a very promising technique. The quality of the converted ...
Comments