ABSTRACT
We propose to organise a series of sharedtask NLG events, where participants are asked to build systems with similar input/output functionalities, and these systems are evaluated with a range of different evaluation techniques. The main purpose of these events is to allow us to compare different evaluation techniques, by correlating the results of different evaluations on the systems entered in the events.
- Srinavas Bangalore, Owen Rambow, and Steve Whit-taker. 2000. Evaluation metrics for generation. In Proceedings of INLG-2000, pages 1--8. Google ScholarDigital Library
- Anja Belz and Adam Kilgarriff. 2006. Shared-task evaluations in HLT: Lessons for NLG. In Proceedings of INLG-2006. Google ScholarDigital Library
- Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In Proceedings of EACL-2006, pages 313--320.Google Scholar
- Lynette Hirschman. 1998. The evolution of evaluation: Lessons from the Message Understanding Conferences. Computer Speech and Language, 12:283--285.Google ScholarCross Ref
- Anna Law, Yvonne Freer, Jim Hunter, Robert Logie, Neil McIntosh, and John Quinn. 2005. Generating textual summaries of graphical time series data to support medical decision making in the neonatal intensive care unit. Journal of Clinical Monitoring and Computing, 19:183--194.Google ScholarCross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of ACL-2002, pages 311--318. Google ScholarDigital Library
- Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press. Google ScholarDigital Library
- Ehud Reiter and Somayajulu Sripada. 2002. Should corpora texts be gold standards for NLG? In Proceedings of INLG-2002, pages 97--104.Google Scholar
- Ehud Reiter, Roma Robertson, and Liesl Osman. 2003. Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144:41--58. Google ScholarDigital Library
- Somayajulu Sripada, Ehud Reiter, Jim Hunter, and Jin Yu. 2003. Exploiting a parallel text-data corpus. In Proceedings of Corpus Linguistics 2003, pages 734--743.Google Scholar
- GENEVAL: a proposal for shared-task evaluation in NLG
Recommendations
Multi-attribute comprehensive evaluation of individual research output based on published research papers
This paper proposes a multi-attribute comprehensive evaluation method of individual research output (IRO). It highlights the fact that a single index can never give more than a rough approximation to IRO, and the evaluation of IRO is a multi-attribute ...
Experimental teaching quality evaluation practice based on AHP-fuzzy comprehensive evaluation model
ICIC'13: Proceedings of the 9th international conference on Intelligent Computing Theories and TechnologyIn this thesis, we use the integration method of AHP and fuzzy comprehensive evaluation as the evaluation model for the experimental teaching evaluation system. First, we build a hierarchy model and calculate the weigh of evaluation factor by AHP, and ...
Quantitative Evaluation Study of Public Libraries
ISIE '11: Proceedings of the 2011 International Conference on Intelligence Science and Information EngineeringBased on the theory of sustainable development, the evaluation indexes system of public libraries has been established. The evaluation indexes system was established including 12 indexes. Both principal component analysis and comprehensive evaluation ...
Comments