Item type:Thesis, Open Access

Understanding Content Selection in Text Summarization and Simplification

Loading...
Thumbnail Image

Publisher

Supervisors

Item type:Person,

Abstract

The goal of text summarization and simplification is to communicate ideas in concise and accessible language. A critical aspect of these tasks is content selection. Given an input text, content selection requires to pick the most important pieces of information to include in an output text. With the development of Large Language Models (LLMs) the field has seen substantial improvements in the quality of generated summaries and simplifications. However, a key challenge remains: the behavior of LLMs is highly opaque and difficult to anticipate or control. This thesis aims to (i) better understand and (ii) control the content selection behavior of text summarization and simplification models. We take four complimentary angles. First, we make content selection observable. We introduce an interpretable representation of content units based on the theory of Questions Under Discussion (QUDs). With this content representation, we analyze what information models consider as important when summarizing text. By tracking what questions are answerable with summaries of different lengths, we derive a proxy for how a model prioritizes information. We discover that the notion of content salience is highly consistent within and across current summarization models. However, the salience notion only weakly aligns with human expectations, and models cannot directly rate the importance of information. Second, we develop methods to recover information loss from simplified text. We propose InfoLossQA: a task and dataset aiming to address this problem. We show that omissions occur frequently in LLM-generated simplifications and that question-answer pairs following the QUD-theory are an effective tool to understand such information loss. We develop several methods to automatically detect and recover information loss and provide a comprehensive benchmark for future method development. Third, we consider how to control content selection. We develop a novel guidance signal based on variable-length extractive summaries. Intuitively, generating a longer summary requires more guidance than generating a shorter one. We demonstrate the utility of this guidance signal in the radiology domain, where it is competitive with earlier domain-specific guidance signals, but easier to apply. Furthermore, we conduct an error analysis to determine current bottlenecks in radiology report summarization. We show that some content selection decisions are likely only determined by dataset-level artifacts or require awareness of latent factors in the clinical process. Finally, we study content selection strategies by humans. We collaborate with domain experts to collect the first document-level text simplification dataset in the clinical domain. We find that simplifications are often longer than the original documents. Qualitatively, experts contextualize examinations in the clinical process and explain how findings motivate a particular diagnosis. Generating such elaborations requires both an awareness of the clinical workflow and grounding in latest clinical research.

Review

Metadata

show more
Trienes, Jan: Understanding Content Selection in Text Summarization and Simplification. : 2025-12-17.

License

Except where otherwised noted, this item's license is described as Attribution-ShareAlike 4.0 International