Publications | Sijia L. Liu

2025

Preprint
Humanline: Online Alignment as Perceptual Loss

Sijia Liu, Niklas Muennighoff, and Kawin Ethayarajh

2025

Abs arXiv Bib Blog

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) – but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping – originally introduced to just stabilize training – recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.
@article{liu2025humanline, title = {Humanline: Online Alignment as Perceptual Loss}, author = {Liu, Sijia and Muennighoff, Niklas and Ethayarajh, Kawin}, year = {2025}, eprint = {2509.24207}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, }

2024

Model Card
The Amazon Nova family of models: Technical report and model card

Amazon Artificial General Intelligence

Amazon Technical Reports, 2024

Abs Bib PDF

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
@article{Intelligence2024, author = {Intelligence, Amazon Artificial General}, title = {The Amazon Nova family of models: Technical report and model card}, year = {2024}, url = {https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card}, journal = {Amazon Technical Reports}, }
EMNLP
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, and 7 more authors

EMNLP, NeurIPS Workshop on System 2 Reasoning at Scale (Oral), 2024

Abs Bib PDF Slides

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistral’s performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
@article{ferraz2024llmselfcorrectiondecrimdecompose, title = {LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints}, author = {Ferraz, Thomas Palmeira and Mehta, Kartik and Lin, Yu-Hsiang and Chang, Haw-Shiuan and Oraby, Shereen and Liu, Sijia and Subramanian, Vivek and Chung, Tagyoung and Bansal, Mohit and Peng, Nanyun}, year = {2024}, journal = {EMNLP, NeurIPS Workshop on System 2 Reasoning at Scale (Oral)}, eprint = {2410.06458}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2410.06458}, }

2023

AAAI
Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup

Sijia Liu, Patrick Lange, Behnam Hedayatnia, and 5 more authors

AAAI (Oral), NeurIPS Workshop on Human Evaluation of Generative Models (Oral), EMNLP Workshop on Natural Language Generation, Evaluation, and Metrics (Oral), Jun 2023

Abs Bib PDF Poster Slides

Evaluating open-domain conversation models has been an open challenge due to the open-ended nature of conversations. In addition to static evaluations, recent work has started to explore a variety of per-turn and per-dialog interactive evaluation mechanisms and provide advice on the best setup. In this work, we adopt the interactive evaluation framework and further apply to multiple models with a focus on per-turn evaluation techniques. Apart from the widely used setting where participants select the best response among different candidates at each turn, one more novel per-turn evaluation setting is adopted, where participants can select all appropriate responses with different fallback strategies to continue the conversation when no response is selected. We evaluate these settings based on sensitivity and consistency using four GPT2-based models that differ in model sizes or fine-tuning data. To better generalize to any model groups with no prior assumptions on their rankings and control evaluation costs for all setups, we also propose a methodology to estimate the required sample size given a minimum performance gap of interest before running most experiments. Our comprehensive human evaluation results shed light on how to conduct credible human evaluations of open domain dialog systems using the interactive setup, and suggest additional future directions.
@article{Liu_Lange_Hedayatnia_Papangelis_Jin_Wirth_Liu_Hakkani-Tur_2023, title = {Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup}, number = {11}, journal = {AAAI (Oral), NeurIPS Workshop on Human Evaluation of Generative Models (Oral), EMNLP Workshop on Natural Language Generation, Evaluation, and Metrics (Oral)}, volume = {37}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/26557}, doi = {10.1609/aaai.v37i11.26557}, author = {Liu, Sijia and Lange, Patrick and Hedayatnia, Behnam and Papangelis, Alexandros and Jin, Di and Wirth, Andrew and Liu, Yang and Hakkani-Tur, Dilek}, year = {2023}, month = jun, pages = {13264-13272}, }
Proceedings
Advancing open domain dialog: The Fifth Alexa Prize SocialBot Grand Challenge

Michael Johnston, Cris Flagg, Anna Gottardi, and 28 more authors

In Alexa Prize SocialBot Grand Challenge 5 Proceedings, Sep 2023

Abs Bib PDF

Creating conversational dialog systems that are able to converse naturally and engagingly with humans on any topic remains one of the fundamental challenges of artificial intelligence. The Alexa Prize SocialBot Grand Challenge was launched in 2016 to take on the problem of enabling conversational systems to support natural, sustained, coherent, and compelling open-domain dialog. The competition enables university teams from around the world to test their innovations at scale with Alexa customers. The 5th SocialBot Grand Challenge (SGC5) expanded the competition to include both a live judged competition on system performance and a Science and Innovation prize to acknowledge the underlying scientific achievements. SGC5 also added multimodality to the challenge and encouraged teams to augment their open-domain conversations with multimedia content and multimodal interaction. The challenge included an extensively updated version of the CoBot (Conversational Bot) Toolkit, along with numerous models and APIs, including topic and intent classifiers, offensive content classifiers, pre-trained neural response generators and rankers, and multimodal support so that teams could land running and focus on building compelling multimodal conversational experiences. Use of large language models (LLMs) was a key theme in the fifth iteration of the competition and, in addition to neural response generators fine-tuned on previous Alexa Prize conversations, we provided APIs and fine-tuning capabilities enabling teams to make use of the 20 billion parameter Alexa Teacher Model LLM. The paper describes the operation of the competition and capabilities provided to teams. We outline and summarize the advances developed both by university teams and the Alexa Prize team in pursuit of the Grand Challenge objective, including use of LLMs and instruction prompting for dialog control, synthetic data and knowledge generation, multimedia response generation, and dialog evaluation.
@inproceedings{Johnston2023, author = {Johnston, Michael and Flagg, Cris and Gottardi, Anna and Sahai, Sattvik and Lu, Yao and Sagi, Samyuth and Dai, Luke and Goyal, Prasoon and Hedayatnia, Behnam and Hu, Lucy and Jin, Di and Lange, Patrick and Liu, Shaohua and Liu, Sijia and Pressel, Daniel and Shi, Hangjie and Yang, Zhejia and Zhang, Chao and Zhang, Desheng and Ball, Leslie and Bland, Kate and Hu, Shui and Ipek, Osman and Jeun, James and Rocker, Heather and Vaz, Lavina and Iyengar, Akshaya and Liu, Yang and Mandal, Arindam and Hakkani-Tür, Dilek and Ghanadan, Reza}, title = {Advancing open domain dialog: The Fifth Alexa Prize SocialBot Grand Challenge}, year = {2023}, month = sep, url = {https://www.amazon.science/alexa-prize/proceedings/advancing-open-domain-dialog-the-fifth-alexa-prize-socialbot-grand-challenge}, booktitle = {Alexa Prize SocialBot Grand Challenge 5 Proceedings} }
EMNLP
DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines

Prakhar Gupta, Yang Liu, Di Jin, and 6 more authors

EMNLP, Dec 2023

Abs Bib PDF

Dialogue models are able to generate coherent and fluent responses, but they can still be challenging to control and may produce non-engaging, unsafe results. This unpredictability diminishes user trust and can hinder the use of the models in the real world. To address this, we introduce DialGuide, a novel framework for controlling dialogue model behavior using natural language rules, or guidelines. These guidelines provide information about the context they are applicable to and what should be included in the response, allowing the models to generate responses that are more closely aligned with the developer’s expectations and intent. We evaluate DialGuide on three tasks in open-domain dialogue response generation: guideline selection, response generation, and response entailment verification. Our dataset contains 10,737 positive and 15,467 negative dialogue context-response-guideline triplets across two domains - chit-chat and safety. We provide baseline models for the tasks and benchmark their performance. We also demonstrate that DialGuide is effective in the dialogue safety domain, producing safe and engaging responses that follow developer guidelines.
@article{gupta2023dialguide, title = {DialGuide: Aligning Dialogue Model Behavior with Developer Guidelines}, author = {Gupta, Prakhar and Liu, Yang and Jin, Di and Hedayatnia, Behnam and Gella, Spandana and Liu, Sijia and Lange, Patrick and Hirschberg, Julia and Hakkani-Tur, Dilek}, year = {2023}, month = dec, eprint = {2212.10557}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, journal = {EMNLP} }
SigDial
MERCY: Multiple ranking concurrently in realistic open-domain conversational systems

Sarik Ghazarian, Behnam Hedayatnia, Di Jin, and 4 more authors

SIGDIAL, Sep 2023

Abs Bib PDF

Automatic Evaluation (AE) and Response Selection (RS) models assign quality scores to various candidate responses and rank them in conversational setups. Prior response ranking research compares various models’ performance on synthetically generated test sets. In this work, we investigate the performance of model-based reference-free AE and RS models on our constructed response ranking datasets that mirror real-case scenarios of ranking candidates during inference time. Metrics’ unsatisfying performance can be interpreted as their low generalizability over more pragmatic conversational domains such as human-chatbot dialogs. To alleviate this issue we propose a novel RS model called MERCY that simulates human behavior in selecting the best candidate by taking into account distinct candidates concurrently and learns to rank them. In addition, MERCY leverages natural language feedback as another component to help the ranking task by explaining why each candidate response is relevant/irrelevant to the dialog context. These feedbacks are generated by prompting large language models in a few-shot setup. Our experiments show the better performance of MERCY over baselines for the response ranking task in our curated realistic datasets.
@article{Ghazarian2023, author = {Ghazarian, Sarik and Hedayatnia, Behnam and Jin, Di and Liu, Sijia and Peng, Violet and Liu, Yang and Hakkani-Tür, Dilek}, title = {MERCY: Multiple ranking concurrently in realistic open-domain conversational systems}, year = {2023}, month = sep, url = {https://www.amazon.science/publications/mercy-multiple-ranking-concurrently-in-realistic-open-domain-conversational-systems}, journal = {SIGDIAL} }

2022

SigDial
Improving Bot Response Contradiction Detection via Utterance Rewriting

Di Jin, Sijia Liu, Yang Liu, and 1 more author

SIGDIAL (Oral), Jul 2022

Abs Bib PDF Code Slides

Though chatbots based on large neural models can often produce fluent responses in open domain conversations, one salient error type is contradiction or inconsistency with the preceding conversation turns. Previous work has treated contradiction detection in bot responses as a task similar to natural language inference, e.g., detect the contradiction between a pair of bot utterances. However, utterances in conversations may contain co-references or ellipsis, and using these utterances as is may not always be sufficient for identifying contradictions. This work aims to improve the contradiction detection via rewriting all bot utterances to restore antecedents and ellipsis. We curated a new dataset for utterance rewriting and built a rewriting model on it. We empirically demonstrate that this model can produce satisfactory rewrites to make bot utterances more complete. Furthermore, using rewritten utterances improves contradiction detection performance significantly, e.g., the AUPR and joint accuracy scores (detecting contradiction along with evidence) increase by 6.5% and 4.5% (absolute increase), respectively.
@article{jin2022improving, title = {Improving Bot Response Contradiction Detection via Utterance Rewriting}, author = {Jin, Di and Liu, Sijia and Liu, Yang and Hakkani-Tur, Dilek}, year = {2022}, month = jul, journal = {SIGDIAL (Oral)}, }