Publications
2024
- EMNLPLLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple ConstraintsThomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, and 7 more authors2024
Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
2023
- AAAITowards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive SetupSijia Liu, Patrick Lange, Behnam Hedayatnia, and 5 more authorsProceedings of the AAAI Conference on Artificial Intelligence, Jun 2023
Evaluating open-domain conversation models has been an open challenge due to the open-ended nature of conversations. In addition to static evaluations, recent work has started to explore a variety of per-turn and per-dialog interactive evaluation mechanisms and provide advice on the best setup. In this work, we adopt the interactive evaluation framework and further apply to multiple models with a focus on per-turn evaluation techniques. Apart from the widely used setting where participants select the best response among different candidates at each turn, one more novel per-turn evaluation setting is adopted, where participants can select all appropriate responses with different fallback strategies to continue the conversation when no response is selected. We evaluate these settings based on sensitivity and consistency using four GPT2-based models that differ in model sizes or fine-tuning data. To better generalize to any model groups with no prior assumptions on their rankings and control evaluation costs for all setups, we also propose a methodology to estimate the required sample size given a minimum performance gap of interest before running most experiments. Our comprehensive human evaluation results shed light on how to conduct credible human evaluations of open domain dialog systems using the interactive setup, and suggest additional future directions.
- ProceedingsAdvancing open domain dialog: The Fifth Alexa Prize SocialBot Grand ChallengeMichael Johnston, Cris Flagg, Anna Gottardi, and 28 more authorsIn Alexa Prize SocialBot Grand Challenge 5 Proceedings, Sep 2023
Creating conversational dialog systems that are able to converse naturally and engagingly with humans on any topic remains one of the fundamental challenges of artificial intelligence. The Alexa Prize SocialBot Grand Challenge was launched in 2016 to take on the problem of enabling conversational systems to support natural, sustained, coherent, and compelling open-domain dialog. The competition enables university teams from around the world to test their innovations at scale with Alexa customers. The 5th SocialBot Grand Challenge (SGC5) expanded the competition to include both a live judged competition on system performance and a Science and Innovation prize to acknowledge the underlying scientific achievements. SGC5 also added multimodality to the challenge and encouraged teams to augment their open-domain conversations with multimedia content and multimodal interaction. The challenge included an extensively updated version of the CoBot (Conversational Bot) Toolkit, along with numerous models and APIs, including topic and intent classifiers, offensive content classifiers, pre-trained neural response generators and rankers, and multimodal support so that teams could land running and focus on building compelling multimodal conversational experiences. Use of large language models (LLMs) was a key theme in the fifth iteration of the competition and, in addition to neural response generators fine-tuned on previous Alexa Prize conversations, we provided APIs and fine-tuning capabilities enabling teams to make use of the 20 billion parameter Alexa Teacher Model LLM. The paper describes the operation of the competition and capabilities provided to teams. We outline and summarize the advances developed both by university teams and the Alexa Prize team in pursuit of the Grand Challenge objective, including use of LLMs and instruction prompting for dialog control, synthetic data and knowledge generation, multimedia response generation, and dialog evaluation.
- EMNLPDialGuide: Aligning Dialogue Model Behavior with Developer GuidelinesPrakhar Gupta, Yang Liu, Di Jin, and 6 more authorsIn EMNLP 2023, Dec 2023
Dialogue models are able to generate coherent and fluent responses, but they can still be challenging to control and may produce non-engaging, unsafe results. This unpredictability diminishes user trust and can hinder the use of the models in the real world. To address this, we introduce DialGuide, a novel framework for controlling dialogue model behavior using natural language rules, or guidelines. These guidelines provide information about the context they are applicable to and what should be included in the response, allowing the models to generate responses that are more closely aligned with the developer’s expectations and intent. We evaluate DialGuide on three tasks in open-domain dialogue response generation: guideline selection, response generation, and response entailment verification. Our dataset contains 10,737 positive and 15,467 negative dialogue context-response-guideline triplets across two domains - chit-chat and safety. We provide baseline models for the tasks and benchmark their performance. We also demonstrate that DialGuide is effective in the dialogue safety domain, producing safe and engaging responses that follow developer guidelines.
- SigDialMERCY: Multiple ranking concurrently in realistic open-domain conversational systemsSarik Ghazarian, Behnam Hedayatnia, Di Jin, and 4 more authorsIn SIGDIAL 2023, Sep 2023
Automatic Evaluation (AE) and Response Selection (RS) models assign quality scores to various candidate responses and rank them in conversational setups. Prior response ranking research compares various models’ performance on synthetically generated test sets. In this work, we investigate the performance of model-based reference-free AE and RS models on our constructed response ranking datasets that mirror real-case scenarios of ranking candidates during inference time. Metrics’ unsatisfying performance can be interpreted as their low generalizability over more pragmatic conversational domains such as human-chatbot dialogs. To alleviate this issue we propose a novel RS model called MERCY that simulates human behavior in selecting the best candidate by taking into account distinct candidates concurrently and learns to rank them. In addition, MERCY leverages natural language feedback as another component to help the ranking task by explaining why each candidate response is relevant/irrelevant to the dialog context. These feedbacks are generated by prompting large language models in a few-shot setup. Our experiments show the better performance of MERCY over baselines for the response ranking task in our curated realistic datasets.
2022
- SigDialImproving Bot Response Contradiction Detection via Utterance RewritingDi Jin, Sijia Liu, Yang Liu, and 1 more authorIn SIGDIAL 2022, Jul 2022
Though chatbots based on large neural models can often produce fluent responses in open domain conversations, one salient error type is contradiction or inconsistency with the preceding conversation turns. Previous work has treated contradiction detection in bot responses as a task similar to natural language inference, e.g., detect the contradiction between a pair of bot utterances. However, utterances in conversations may contain co-references or ellipsis, and using these utterances as is may not always be sufficient for identifying contradictions. This work aims to improve the contradiction detection via rewriting all bot utterances to restore antecedents and ellipsis. We curated a new dataset for utterance rewriting and built a rewriting model on it. We empirically demonstrate that this model can produce satisfactory rewrites to make bot utterances more complete. Furthermore, using rewritten utterances improves contradiction detection performance significantly, e.g., the AUPR and joint accuracy scores (detecting contradiction along with evidence) increase by 6.5% and 4.5% (absolute increase), respectively.