Learning to Refine with Fine-Grained Natural Language Feedback

Learning to Reﬁne with Fine-Grained Natural Language Feedback

Manya Wadhwa Xinyu Zhao Junyi Jessy Li Greg Durrett

The University of Texas at Austin

[email protected]

Abstract

Recent work has explored the capability of

large language models (LLMs) to identify and

correct errors in LLM-generated responses.

These reﬁnement approaches frequently evalu-

ate what sizes of models are able to do reﬁne-

ment for what problems, but less attention is

paid to what effective feedback for reﬁnement

looks like. In this work, we propose looking

at reﬁnement with feedback as a composition

of three distinct LLM competencies: (1) iden-

tiﬁcation of bad generations; (2) ﬁne-grained

natural language feedback generation; (3) reﬁn-

ing with ﬁne-grained feedback. The ﬁrst step

can be implemented with a high-performing

discriminative model and steps 2 and 3 can

be implemented either via prompted or ﬁne-

tuned LLMs. A key property of this approach

is that the step 2 critique model can give ﬁne-

grained feedback about errors, made possible

by ofﬂoading the discrimination to a separate

model in step 1. We show that models of dif-

ferent capabilities beneﬁt from reﬁning with

this approach on the task of improving fac-

tual consistency of document grounded sum-

maries. Overall, our proposed method consis-

tently outperforms existing end-to-end reﬁne-

ment approaches and current trained models

not ﬁne-tuned for factuality critiquing.

1 Introduction

Large language models (LLMs) have been ob-

served to display inconsistent behavior such hallu-

cinations, not following instructions, and unfaithful

reasoning (Levy et al., 2021; Ye and Durrett, 2022;

Zhang et al., 2024; Turpin et al., 2024; Shaikh et al.,

2023; Zhuo et al., 2023) One recent strategy to ﬁx

these mistakes is to perform post-hoc reﬁnement of

the response with natural language feedback (Pan

et al., 2024; Madaan et al., 2023). These methods

either use human feedback (Saunders, 2023) or,

Code and models available at:

https://github.com/

ManyaWadhwa/DCR

… has caused controversy since its implementation in 2005.

In response to a lawsuit by 10 students

who failed the exam, a judge has allowed thousands of

students who did not pass to potentially graduate.

!"#$%&'%(#)*#

Give a summary of the document on the topic “California's

high school exit exam”

The document does not specify the year … error span is:… To ﬁx..

The document does not specify the judge’s … error span is:… To ﬁx..

A California judge has issued a temporary injunction against the

state's high ….. 10th-grade level English, has caused controversy

since its implementation in 2005. In response to a lawsuit by 10

students who failed the exam, a judge has allowed thousands of

students who did not pass to potentially graduate issued a temporary

injunction against California's high school exit exam

!"#$%+'%,#"#-"

!"#$%.'%/01234#

A California judge has issued a temporary injunction

against the state's high..

Figure 1: Overview of the proposed DETECT, CRI-

TIQUE, REFINE method. For a document-grounded re-

sponse, DCR ﬁrst detects if the initial response should

be reﬁned. If so, we critique the response by generat-

ing ﬁne-grained natural language feedback about errors,

then reﬁne by making targeted edits.

more frequently, automated feedback, such as from

self-critiquing (Madaan et al., 2023; Gero et al.,

2023; Shinn et al., 2024; Raunak et al., 2023; Ye

et al., 2023), from a trained model (Xu et al., 2023a;

Akyurek et al., 2023; Paul et al., 2024; Chern et al.,

2024), or from external tools (Jiang et al., 2023;

Olausson et al., 2024; Gou et al., 2024; Chen et al.,

2023; Stengel-Eskin et al., 2024).

Critiques are straightforward to obtain in some

of these settings: for example, verifying that an

acronym starts with the correct letters (Madaan

et al., 2023) or that source code passes test cases

(Olausson et al., 2024). However, in the context

of natural language generation, as opposed to code

generation or math problem solving, notions of

correctness are relatively less well deﬁned. As

a result, substantial prior work has taken a “one-

arXiv:2407.02397v1 [cs.CL] 2 Jul 2024

size-ﬁts-all” approach and either directly reﬁned

responses with an LLM, or generated feedback

across a wide variety of aspects and then used those

for reﬁnement (Wang et al., 2023b; Ye et al., 2023).

Prior work has not studied what happens when

reﬁning for a dimension like factual correctness,

which has the property of not being easily veriﬁable

with external tools while still being objective.

In this work, we propose a three-stage reﬁnement

framework suitable for tasks like factual correct-

ness. We look at reﬁnement as a composition of

three distinct LLM competencies: DETECT, CRI-

TIQUE and REFINE. Figure 1 shows an example of

this pipeline. We ﬁrst detect erroneous generations

at a sentence level. Then, if any sentence is identi-

ﬁed to have an error, we proceed to generate ﬁne-

grained natural language feedback describing the

errors and how to ﬁx them. Finally, we reﬁne the

original outputs with the generated feedback. We

show that models of different abilities perform bet-

ter when reﬁning with our proposed decomposition

compared to baselines where the response is either

(a) reﬁned with a general instruction (e.g. “improve

factual consistency”) (Saunders et al., 2022) , or (b)

reﬁned with natural language feedback where the

model needs to do veriﬁcation in the feedback step

(Madaan et al., 2023). This process allows for two

key differences from prior work: ﬁrst, the use of a

detector to focus the feedback, and second, the abil-

ity to ﬁne-tune models on ﬁne-grained feedback to

enumerate speciﬁc errors. This sentence-level ap-

proaches allows for individually enumerating many

errors across an entire LLM output by handling

them in a factored way.

We evaluate our approach on two datasets of

document-grounded LLM outputs: TofuEval (Tang

et al., 2024b) and a subset of UltraChat (Ding et al.,

2023) consisting of queries asking for summaries.

We believe these are representative tasks for a wider

range of such use cases. Across both tasks, we

show that our three-stage approach outperforms

ablations removing or simplifying these stages. In

addition, the form of feedback given by our models

leads to higher factual consistency post-reﬁnement

than feedback from Shepherd (Wang et al., 2023b)

or SelFee (Ye et al., 2023). Finally, we show that

ﬁne-tuning our critique model improves its capabil-

ities over prompting, and our model is able to give

feedback on a variety of factual inconsistencies.

Our main contributions are: (1) we introduce a

novel post-hoc reﬁnement method: DETECT, CRI-

TIQUE and REFINE (DCR), that reﬁnes with natural

language feedback to enhance factual consistency;

(2) we ﬁne-tune models to generate ﬁne-grained

factual inconsistency localization, reasoning about

the error, and a suggested ﬁx for the inconsistency;

(3) we show the importance of the DETECT and

CRITIQUE steps in enhancing the post-hoc reﬁne-

ment capabilities of models.

2 Background and Task Setup

We assume we are given an LLM output

, gener-

ated from a document

by prompting a model

;

this accommodates tasks like traditional summa-

rization, query-focused summarization, document-

grounded question answering, and more. Our goal

is to generate a reﬁnement

ˆr = M

′

(r)

where

′

the reﬁnement model, which can be distinct from

in our setting. We have two conditions for reﬁne-

ment to be successful. First, we want to improve

the quality of the response along the desired reﬁne-

ment axis, which in our case is factual consistency.

We deﬁne a function

E(D, r )

to score responses.

For the tasks we consider, there is not a ﬁrm binary

notion of factual consistency; this follows from

work in NLI showing that entailment judgments are

inherently subjective (Pavlick and Kwiatkowski,

2019; Nie et al., 2020; Chen et al., 2020)). There-

fore, we will evaluate if

E(D, ˆr) > E(D, r)

; that

is, did our reﬁnement successfully improve factual

consistency. Second, we want to edit the response

such the reﬁnement preserves the style, structure

and most of the content from the original response.

We do not want the reﬁnement process to simply re-

place the original response or delete large portions

of it. We evaluate this by quantifying the num-

ber of edits at a word level and doing a qualitative

analysis, but our ﬁrst priority is to optimize for E.

Relation to past formalizations Prior work like

Self-Reﬁne (Madaan et al., 2023) leverages the

source model to critique and reﬁne its own out-

put, which assumes that the source model has the

capacity to follow a prompt and evaluate its own

generations. These methods evaluate their reﬁne-

ment methods on more structured tasks with auto-

matic metrics, such as solve rate for math reasoning

(Cobbe et al., 2021) and fraction of programs opti-

mized for code optimization (Shypula et al., 2024).

Our work does not constrain the reﬁnement model

to be the same as the source, which changes the

nature of the questions we investigate, and further-

more the factual consistency task has different prop-

erties than logical reasoning problems like math.

Size Factuality

Document grounded “No Error” cases Error localization Eval. on reﬁnement

UltraCM 13B ✗ ✓ ✗ ✗ ✗

Shepherd 13B ✗ ✗ ✗ ✗ ✓

SelFee 7B/13B ✗ ✓ ✓ ✗ ✓

DCR (Ours)

7B ✓ ✓ ✓ ✓ ✓

Table 1: Comparison between existing feedback models and our trained model. Our approach focuses on generating

ﬁne-grained feedback for improving factual consistency of document-grounded responses.

In domains like program synthesis, reﬁnement is

often compared with sampling more completions

from the original model (Olausson et al., 2024). In

this work, we assume as part of the problem deﬁ-

nition that we are reﬁning a base response

. This

task is useful when generating the base response

may be expensive, or if it may follow other con-

straints or instructions that make it challenging to

regenerate. Furthermore, this allows us to use past

datasets that annotate errors over responses (Tang

et al., 2024b), which enables us to perform more

ﬁne-grained analyses of ﬁxed and remaining errors.

Prior critiquing methods Table 1 shows a com-

parison of our proposed critique model with prior

approaches like UltraCM, Shepherd and Selfee.

Feedback from these models mostly focus on cri-

tiquing the overall quality of the response without

necessarily verifying whether or not the respose

needs reﬁnement. While UltraCM does not evalu-

ate the effectiveness of its feedback via reﬁnement,

Shepherd and SelFee evaluate on tasks like multi-

ple choice QA (Wang et al., 2023a; Mihaylov et al.,

2018; Lin et al., 2021) where the evaluation objec-

tive is well-deﬁned. Furthermore, current reﬁne-

ment methods often perform the CRITIQUE step

directly on a response without knowing whether or

not it needs to be reﬁned. This approach places the

burden of both veriﬁcation and critiquing on the

same model, which our pipeline improves upon.

3 Reﬁning with Fine-Grained Feedback

We propose decomposing the task of reﬁning tex-

tual responses using natural language feedback into

three steps: DETECT, CRITIQUE, REFINE. Algo-

rithm 1 concretely shows the cascade of these steps.

Step 1: DETECT with

detect

For a response

grounded in document

, we split the response

into sentences

s =split(r)

using NLTK, and for

each sentence

∈ s

, we determine if there is an

error by computing

detect

, r, D) ∈ {0, 1}

. If

all sentences are correct, we do not modify the

response. If any sentence is marked with an error,

we generate feedback using the CRITIQUE step.

Algorithm 1 Proposed Reﬁnement Algorithm

Input:

Document:

, Initial Response:

, Models

detect

critique

reﬁne

Output Reﬁned response ˆr

1: s = split(r)

2: F ← ∅

3: for s

∈ s do

4: if M

detect

,r , D) then ▷ Detect

5: f

= M

critique

, r, D) ▷ Critique

6: F ← F ∪ f

7: end if

8: end for

9: ˆr = M

reﬁne

(F, r, D) ▷ Reﬁne

10: return ˆr

Step 2: CRITIQUE with

critique

Once we have

determined

to have an error, for each sentence

with

detect

= 1

, we generate a natural language

feedback

= M

critique

(D, r, s

)

that does span

localization, reasons about why the span has an

error and then suggests a natural language ﬁx for

the span. We combine the sentence wise feedback

to create F and use it for the REFINE step.

Step 3: REFINE with

reﬁne

Given the docu-

ment

, the response with errors

and the natural

language feedback

, we use

ˆr = M

reﬁne

(D, r, F )

to generate a response which targets editing er-

ror spans mentioned in

. We reﬁne with com-

bined feedback as opposed to per-sentence to en-

able

reﬁne

to place edits in context and better

produce a ﬁnal coherent response.

3.1 Supervised Fine-Tuning

Section 5 shows how our proposed approach gener-

alizes for models of different capabilities, ranging

from LLAMA-2-7B-CHAT to GPT-4. For smaller

models, we explore ﬁne-tuning

critique

and

reﬁne

to generate natural language outputs for our task at

hand. We train based on distilled critiques and re-

ﬁnements from a stronger language model

teacher

in our case GPT-4-0613.

Figure 2 gives an overview of the data generation

and ﬁne-tuning process. We ﬁrst generate natural

language feedback and reﬁnements from

teacher

using prompts

critique

and

reﬁne

. We then ﬁne-tune

Training Data Genera,on

Document

Response

[{

“error_span” : “...”

“feedback”: “...”

“ﬁx”: “...”

}]

Feedback

Reﬁned

<latexit sha1_base64="jlOUrNTW4sH1C8afZOEmV+BUl4k=">AAAB+nicbVC7TsNAEDyHVwgvB0oaiwiJKrIRrzKChjJI5CEllnW+bJJTzmdztwYik0+hoQAhWr6Ejr/BcVxAwkgrjWZ2tbvjR4JrtO1vo7C0vLK6VlwvbWxube+Y5d2mDmPFoMFCEaq2TzUILqGBHAW0IwU08AW0/NHV1G/dg9I8lLc4jsAN6EDyPmcUU8kzy5HXRXjEhCmO/C6GiWdW7KqdwVokTk4qJEfdM7+6vZDFAUhkgmrdcewI3YQq5EzApNSNNUSUjegAOimVNADtJtnpE+swVXpWP1RpSbQy9fdEQgOtx4GfdgYUh3rem4r/eZ0Y+xduwmUUI0g2W9SPhYWhNc3B6nEFDMU4JTT7nVlsSBVlmKZVSkNw5l9eJM3jqnNWPb05qdQu8ziKZJ8ckCPikHNSI9ekThqEkQfyTF7Jm/FkvBjvxsestWDkM3vkD4zPH0c4lK0=</latexit>

critique

<latexit sha1_base64="p6Z2xjiNYbTtXtCZF4trk03iyRQ=">AAAB+XicbVBNS8NAEN3Ur1q/oh69BIvgqSTi17HoxYtQwX5AG8JmO22XbjZhd1Isof/EiwdFvPpPvPlv3LY5aOuDgcd7M8zMCxPBNbrut1VYWV1b3yhulra2d3b37P2Dho5TxaDOYhGrVkg1CC6hjhwFtBIFNAoFNMPh7dRvjkBpHstHHCfgR7QveY8zikYKbPs+6CA8YYZA2QDUJLDLbsWdwVkmXk7KJEctsL863ZilEUhkgmrd9twE/Ywq5EzApNRJNSSUDWkf2oZKGoH2s9nlE+fEKF2nFytTEp2Z+nsio5HW4yg0nRHFgV70puJ/XjvF3rWfcZmkCJLNF/VS4WDsTGNwulwBQzE2hDLFza0OG1BFGZqwSiYEb/HlZdI4q3iXlYuH83L1Jo+jSI7IMTklHrkiVXJHaqROGBmRZ/JK3qzMerHerY95a8HKZw7JH1ifPxKVk/Y=</latexit>

teacher

<latexit sha1_base64="lYFBjjKU81HPrsYGuGAwDNfoIQU=">AAAB+HicbVDJSgNBEO2JW4xLRj16aQyCpzAjbsegF48RzALJEHo6NUmTnoXuGjEO+RIvHhTx6qd482/sJHPQxAcFj/eququen0ih0XG+rcLK6tr6RnGztLW9s1u29/abOk4VhwaPZazaPtMgRQQNFCihnShgoS+h5Y9upn7rAZQWcXSP4wS8kA0iEQjO0Eg9u5z0ugiPmCkIzBOTnl1xqs4MdJm4OamQHPWe/dXtxzwNIUIumdYd10nQy5hCwSVMSt1UQ8L4iA2gY2jEQtBeNlt8Qo+N0qdBrExFSGfq74mMhVqPQ990hgyHetGbiv95nRSDKy8TUZIiRHz+UZBKijGdpkD7QgFHOTaEcSXMrpQPmWIcTVYlE4K7ePIyaZ5W3Yvq+d1ZpXadx1Ekh+SInBCXXJIauSV10iCcpOSZvJI368l6sd6tj3lrwcpnDsgfWJ8/iFOTrA==</latexit>

reﬁne

Inputs

teacher

<latexit sha1_base64="avveZ3kNBgkCdwbOu+l5PKT3q1Q=">AAACBnicbVDLSsNAFJ3UV42vqEsRgqXgqiTia1l047KCfUAbwmQ6aYdOJmHmRiyhKzf+ihsXirj1G9z5N07aLLT1wMCZc+69M/cECWcKHOfbKC0tr6yuldfNjc2t7R1rd6+l4lQS2iQxj2UnwIpyJmgTGHDaSSTFUcBpOxhd5377nkrFYnEH44R6ER4IFjKCQUu+dVhN/B7QB8gkDfWMiVnN8juJTekz36o4NWcKe5G4BamgAg3f+ur1Y5JGVADhWKmu6yTgZVgCI1wP76WKJpiM8IB2NRU4osrLpmtM7KpW+nYYS30E2FP1d0eGI6XGUaArIwxDNe/l4n9eN4Xw0suYSFKggsweClNuQ2znmdh9JikBPtYEE8n0X20yxBIT0MmZOgR3fuVF0jqpuee1s9vTSv2qiKOMDtAROkYuukB1dIMaqIkIekTP6BW9GU/Gi/FufMxKS0bRs4/+wPj8Ae3bmMo=</latexit>

<latexit sha1_base64="ZWxhIjZZd5sEhBci0VDqyxRo0DM=">AAACBnicbVDLSgMxFM34rONr1KUIwVJwVWbE17KoC5cV7APaYcikmTY08yC5I5ahKzf+ihsXirj1G9z5N2baWWjrgcDJOffe5B4/EVyBbX8bC4tLyyurpTVzfWNza9va2W2qOJWUNWgsYtn2iWKCR6wBHARrJ5KR0Bes5Q+vcr91z6TicXQHo4S5IelHPOCUgJY866CSeF1gD5BJFugZY7OS5Xcam9ce96yyXbUnwPPEKUgZFah71le3F9M0ZBFQQZTqOHYCbkYkcCr08G6qWELokPRZR9OIhEy52WSNMa5opYeDWOoTAZ6ovzsyEio1Cn1dGRIYqFkvF//zOikEF27GoyQFFtHpQ0EqMMQ4zwT3uGQUxEgTQiXXf8V0QCShoJMzdQjO7MrzpHlcdc6qp7cn5dplEUcJ7aNDdIQcdI5q6AbVUQNR9Iie0St6M56MF+Pd+JiWLhhFzx76A+PzB6fHmJw=</latexit>

<latexit sha1_base64="VQvUB+7qdQr0rdBfuSp7p6wQGxs=">AAACBnicbVDLSsNAFJ3UV42vqEsRgqXgqiTia1l047KCfUAbwmQ6aYdOJmHmRiyhKzf+ihsXirj1G9z5N07aLLT1wMCZc+69M/cECWcKHOfbKC0tr6yuldfNjc2t7R1rd6+l4lQS2iQxj2UnwIpyJmgTGHDaSSTFUcBpOxhd5377nkrFYnEH44R6ER4IFjKCQUu+dVhN/B7QB8gkDfWMiVnN8juJTeUz36o4NWcKe5G4BamgAg3f+ur1Y5JGVADhWKmu6yTgZVgCI1wP76WKJpiM8IB2NRU4osrLpmtM7KpW+nYYS30E2FP1d0eGI6XGUaArIwxDNe/l4n9eN4Xw0suYSFKggsweClNuQ2znmdh9JikBPtYEE8n0X20yxBIT0MmZOgR3fuVF0jqpuee1s9vTSv2qiKOMDtAROkYuukB1dIMaqIkIekTP6BW9GU/Gi/FufMxKS0bRs4/+wPj8Ae9hmMs=</latexit>

<latexit sha1_base64="67atyHTzm274DT8fa6ltA9lOw5Q=">AAACQXicbVDLSsNAFJ34rPEVdekmWAouJCTW564oiEsFq4UkhMl00g6dPJi5EUvIr7nxD9y5d+NCEbdunMYutHphhsM55965c8KMMwm2/aRNTc/Mzs3XFvTFpeWVVWNt/VqmuSC0TVKeik6IJeUsoW1gwGknExTHIac34eB0pN/cUiFZmlzBMKN+jHsJixjBoKjA6DSywAN6B4WgkZpR6g2vmuqKXugXtuUc79jWXlNdjlN6fayMpT5h2VVq0x75jsqzgOkNGbDAqNuWXZX5FzhjUEfjugiMR6+bkjymCRCOpXQdOwO/wAIY4WovL5c0w2SAe9RVMMExlX5RLVKaDcV0zSgV6iRgVuzPjgLHUg7jUDljDH05qY3I/zQ3h+jIL1iS5UAT8v1QlHMTUnMUp9llghLgQwUwEUztapI+FpiACl1XITiTX/4Lrnct58Dav9yrt07GcdTQJtpC28hBh6iFztEFaiOC7tEzekVv2oP2or1rH9/WKW3cs4F+lfb5BV+jrHQ=</latexit>

… In response to a

lawsuit by 10 students

who failed the exam, a

judge issued a

temporary injunction…

<latexit sha1_base64="zMnshStl//JsofnrnDvScAygXpQ=">AAACUnicbVJNSwMxEE3rV61Vqx69LJaCB1l226r1VhTEi1DBfkB3WbJptg3NfpDMimXZ3yiIF3+IFw9q2u5BWwcSHm/evEwmcSPOJBjGey6/tr6xuVXYLu6Udvf2yweHXRnGgtAOCXko+i6WlLOAdoABp/1IUOy7nPbcyc0s33uiQrIweIRpRG0fjwLmMYJBUU6ZVSPHAvoMiaCe8kiLVWvuOhAj104M3bw6M/RGXW2mmVpjrISrmppK142ZsJneOqx4v+wpHeaUK4ZuzENbBWYGKiiLtlN+tYYhiX0aAOFYyoFpRGAnWAAjXJlasaQRJhM8ogMFA+xTaSfzxlKtqpih5oVCrQC0Ofu7IsG+lFPfVUofw1gu52bkf7lBDF7TTlgQxUADsjjIi7kGoTabrzZkghLgUwUwEUz1qpExFpiAeoWiGoK5fOVV0K3p5oV+/tCotK6zcRTQMTpBp8hEl6iF7lAbdRBBL+gDfaHv3FvuM69+yUKaz2U1R+hP5Es/bHGwjg==</latexit>

reﬁne

<latexit sha1_base64="bMs6Tvub2kbF4+qPTN+jQeC7NBA=">AAACWnicbVFbS8MwGE3rZa7e5uXNl+IY+CCl3ebtbSiIL8IEp8JaSpqlW1x6MfkqjrI/6YsI/hXBrNuDbn6QcDjnfCfJlyDlTIJtf2r60vLKammtbKxvbG5tV3Z2H2SSCUI7JOGJeAqwpJzFtAMMOH1KBcVRwOljMLya6I+vVEiWxPcwSqkX4X7MQkYwKMqvvNRS3wX6BrmgocoYGzW3SO2KfuDltuVcHNtWs6E2xxm7A6yMi566khv2xHg+vvaZUbudCw39nD2rPukzv1K1LbsocxE4M1BFs2r7lXe3l5AsojEQjqXsOnYKXo4FMMJVtptJmmIyxH3aVTDGEZVeXlxwbNYU0zPDRKgVg1mwvztyHEk5igLljDAM5Lw2If/TuhmE517O4jQDGpPpQWHGTUjMyZzNHhOUAB8pgIlg6q4mGWCBCajfMNQQnPknL4KHuuWcWid3zWrrcjaONXSADtERctAZaqEb1EYdRNAH+tZWtZL2pet6WV+fWnVt1rOH/pS+/wNoMbCg</latexit>

Document

Response

Inputs

Sentence

separate

Sentence-

level

feedback

Fine-tuning

<latexit sha1_base64="mzGJsRfAi80RbOH3+b8ufBJDwf4=">AAACbXicbVFdT9swFHUCY102RgDtAZhQRFWNBxQl0LHurQIJ8YLEpLVFaqLIcZ3Ww/nAvplWRXnbL9zb/gIv/AWcEKSt5Uq2js4599i+DjPOJDjOX01fWX219rr1xnj7bv39hrm5NZRpLggdkJSn4ibEknKW0AEw4PQmExTHIaej8Pa80kc/qZAsTb7DPKN+jKcJixjBoKjA/N3JAg/oLygEjVRGaXS8OnUspqFfOLb79cixuydqc93Sm2FlXPYcK/nEqYy98iJgRudqMTQKCvajNJ55Ihiwu7xSZMACs+3YTl3WMnAb0EZNXQfmH2+SkjymCRCOpRy7TgZ+gQUwwlWol0uaYXKLp3SsYIJjKv2ivnNpdRQzsaJUqJWAVbP/dhQ4lnIeh8oZY5jJRa0iX9LGOUQ9v2BJlgNNyNNBUc4tSK1q9NaECUqAzxXA9QCIRWZYYALqgww1BHfxyctgeGy7p/bnb912/6wZRwvtoQN0iFz0BfXRJbpGA0TQvWZqO9qu9qB/0D/q+09WXWt6ttF/pX96BDzOt7k=</latexit>

critique

Fine-tuning

Document

Response

Feedback

Inputs

Reﬁned

<latexit sha1_base64="L1P0xeefYFoKfVLDPm1tCAGJpXY=">AAACaHicbVHZSsQwFE3rXre6IeJLcSj6IKV1900UxBdBwVFhppQ0k47RdDG5FYdS/Eff/ABf/ArTOqCOXkg4nHO3nIQZZxJc903Th4ZHRsfGJ4zJqemZWXNu/lqmuSC0SVKeitsQS8pZQpvAgNPbTFAch5zehA8nlX7zRIVkaXIFvYz6Me4mLGIEg6IC88XOgjbQZygEjVSP0mjXTVuiG/qF63iHm66zs60uzyvFumEPyFtK2XarnIPyNGCGfT7Qzo6Cgt2X3wIRDNhjXkkyYIHZcB23Dusv8PqggfpxEZiv7U5K8pgmQDiWsuW5GfgFFsAIr9bPJc0wecBd2lIwwTGVflEvXVq2YjpWlAp1ErBq9mdFgWMpe3GoMmMMd3JQq8j/tFYO0YFfsCTLgSbka1CUcwtSq3Ld6jBBCfCeArg2gFjkDgtMQP2NoUzwBp/8F1xvOd6es3u50zg67tsxjlbRGtpAHtpHR+gMXaAmIuhdm9QWtSXtQzf1ZX3lK1XX+jUL6Ffoa5/RZbUd</latexit>

Figure 2: Training data generation pipeline for our pro-

posed models. We ﬁrst generate structured feedback

from

teacher

, convert it to a natural language form (

and use that to generate a reﬁnement (r

′

critique

and

reﬁne

. Notably, we ﬁrst generate a

structured feedback to induce speciﬁc aspects from

teacher

that would allow us to obtain training data

for different capabilities.

Training Data Creation We construct our ﬁne-

tuning data over a collection of (document, re-

sponse) pairs

{(D

, r

)}

i=1

. We discard any re-

sponses for which

detect

) = 0

for all

∈ r

such that

critique

and

reﬁne

are only trained on

responses containing errors. For each

detected to

have an error, we prompt a teacher model

teacher

using

critique

to give a structured feedback. This

is a list of objects where each object has an error

span, reasoning as to why the span is an error, and a

suggested ﬁx. We convert this structure to a natural

language form

′

. The feedback prompt

critique

anchors the feedback in error categories derived

from prior work (Tang et al., 2024b).

We then prompt

teacher

using

reﬁne

to generate

a reﬁnement

′

using

′

. As per our task setup,

reﬁne

contains a minimal editing instruction that

guides the teacher model to make targeted edits.

Prompts

critique

and

reﬁne

for training data genera-

tion are given in Appendix B.

Fine Tuning For

critique

, we use document

response

and a sentence from the response

as input. The model is then optimized to generated

a natural language feedback

′

for sentence

Note, that we train the model to output a feedback

which has the error span, reasoning for the span

being factually inconsistent, and the suggested ﬁx.

We also ﬁne-tune

reﬁne

in a similar manner.

The input to this model is the document

, re-

sponse

and response-level feedback

′

. The

model is optimized to generate a reﬁnement

′

Note that the

critique

for data generation and ﬁne-

tuning are different. The prompts for ﬁne-tuning

are in Appendix 4.2. Appendix B describes the

compute and the hyperparameters for ﬁne-tuning.

Subset Dataset

Size

Doc Len Resp Len % Correct

Train MediaSum 1344 1189 43 46.9

UltraChat 1072 486 225 50.0

Val MediaSum 149 1186 45 47.0

UltraChat 124 524 238 50.0

Test TofuEval 267 778 52 56.6

UltraChat 272 497 227 24.2

Table 2: Statistics of training/validation/test split for

MediaSum/Tofueval and Ultrachat.

4 Experimental Setup

4.1 Datasets

We consider two datasets for our task of post-hoc

reﬁnement: UltraChat (Ding et al., 2023) and Me-

diaSum (Zhu et al., 2021), with annotations from

TofuEval (Tang et al., 2024b). In both datasets, we

focus on reﬁning document-grounded summaries

to improve their factual consistency. With Ultra-

Chat, we create a train/val/test set by sampling

summarization instructions from the dataset. For

TofuEval, we use the MediaSum split as the test

set and sample from the original train set of Medi-

aSum to create the train/val set for our task. This

gives us a distinct set of summaries from those in

the TofuEval dataset (derived from MediaSum’s

test set). Appendix A discusses the data creation

process in more detail.

Initial Response Generation We run UltraChat

and MediaSum prompts (source documents +

summarization instructions) through LLAMA2-7B-

CHAT (Touvron et al., 2023) and generate initial

responses. To be consistent with TofuEval’s sum-

mary generation process, we ensure that MediaSum

summaries are restricted to be less than 50 words

(prompt A.2). We combine the train/val splits from

UltraChat and MediaSum for ﬁne-tuning. For eval-

uation we use the MediaSum split of TofuEval, and

the responses generated from Llama2 for UltraChat.

Appendix A.3 shows examples of instructions and

initial responses. Table 2 shows the number of

responses in each set along with their statistics.

4.2 Reﬁnement Strategies: Baselines

One-Step: Direct Reﬁnement (DR) (Welleck

et al., 2023; Chen et al., 2022; Saunders et al.,

2022). We generate a reﬁnement

ˆr = M

reﬁne

(r)

by directly prompting (or ﬁne-tuning) the model

with a general reﬁnement instruction to improve

the factual consistency of the initial response. For

ﬁne-tuning

reﬁne

for this baseline, we train with a

Strategy M

detect

critique

reﬁne

MediaSum/TofuEval UltraChat

∆A ↑ ∆G ↑ W ↑ S L ∆MCS ↑ ∆A ↑ ∆G ↑ W ↑ S L ∆M CS ↑

DR - - GPT-4 0.03

†

0.22

†

0.13

†

0.87 0.0 - 0.01

†

0.08

†

0.06

†

0.94 0.0

Feed +

Reﬁne

GPT-4 GPT-4 0.10 0.49 0.25 0.73 0.01 - 0.09 0.28 0.21 0.74 0.04 -

DETECT+

Mini

Check

- GPT-4 0.11 0.47 0.19 0.81 0.0 17.23

†

0.04

†

0.27 0.19 0.81 0.0 5.51

†

DCR

(proposed)

Mini

Check

GPT-4 GPT-4 0.10 0.53 0.21 0.78 0.02 19.85 0.07 0.21 0.18 0.80 0.02 22.79

DR - -

Reﬁne-

L3-FT

0.05

†

0.20

†

0.11

†

0.88 0.01 - 0.02

†

0.08 0.06

†

0.94 0.0 -

Feed +

Reﬁne

Critique-

L3-FT

Reﬁne-

L3-FT

0.01

†

0.04

†

0.01

†

0.97 0.01 - 0.0

†

0.02 0.00

†

1 0.0 -

DETECT+

Mini

Check

Reﬁne-

L3-FT

0.05

†

0.19

†

0.09

†

0.89 0.01 7.49

†

0.00

†

0.08 0.03

†

0.94 0.0 4.41

†

DCR

(proposed)

Mini

Check

Critique-

L3-FT

Reﬁne-

L3-FT

0.08 0.33 0.17 0.82 0.01 22.10 0.06 0.04 0.14 0.82 0.04 23.90

DR - -

Reﬁne-

L2-FT

0.01

†

0.00

†

0.01

†

0.98 0.0 - 0.00

†

0.03 0.0

†

1.0 0.0 -

Feed +

Reﬁne

Critique-

L2-FT

Reﬁne-

L2-FT

-0.01

†

-0.04

†

0.0

†

0.99 0.00 - 0.0

†

0.02 0.00

†

1.00 0.00 -

DETECT+

Mini

Check

Reﬁne-

L2-FT

0.04

†

0.22 0.07

†

0.91 0.01 8.24

†

0.01 0.01 0.02

†

0.96 0.0 5.88

†

DCR

(proposed)

Mini

Check

Critique-

L2-FT

Reﬁne-

L2-FT

0.09 0.15 0.13 0.82 0.04 19.10 0.03 -0.10 0.07 0.76 0.09 21.32

Table 3: Downstream evaluation of reﬁnements as generated by our proposed method DCR and various reﬁnement

strategies.

†

:Statistically signiﬁcant gains from DCR over these methods with

p < 0.05

according to a paired

bootstrap test. DCR generally achieves the strongest performance across all base LLMs compared to other

approaches, particularly on LLAMA2-7B-CHAT and LLAMA3-8B-INSTRUCT, and particularly on MCS.

balanced set of factually consistent and inconsistent

responses (to prevent any copying behavior). The

model is optimized to generate a GPT-4 distilled

reﬁnement if the initial response is inconsistent or

simply copy the response if it is consistent.

Two-Step: Reﬁnement with Natural Language

Feedback (Feed+Reﬁne) This approach fol-

lows prior work using a two-step version of our

pipeline (Madaan et al., 2023; Saunders et al.,

2022; Akyurek et al., 2023). We ﬁrst prompt

the model to generate a natural language feed-

back for all sentences in the initial response

F =

∪M

critique

)∀s

∈ r

. We then generate a reﬁne-

ment that is on the feedback ˆr = M

reﬁne

(F, r).

We train

critique

with a balanced set of con-

sistent and inconsistent sentences. The model is

optimized to generate ‘no error’ if the sentence is

factually consistent and the GPT-4 distilled feed-

back if the sentence is factually inconsistent. Simi-

larly,

reﬁne

is trained with a balanced dataset of

consistent and inconsistent summaries, and the re-

ﬁnement is conditioned on the feedback.

reﬁne

optimized to give a GPT-4 distilled reﬁnement if

the initial response is inconsistent or simply copy

the response if it is consistent.

Two-Step: Direct Reﬁnement with DETECT (De-

tect+DR) We modify DR by ﬁrst determining

if the response is factually inconsistent or not

= ∪M

detect

)∀s

∈ r

. If the response is factu-

ally inconsistent, we perform reﬁnement with DR

ˆr = M

reﬁne

(r)

= 1

. Note, this baseline does

not use any external or self-generation feedback.

We ﬁne-tune

reﬁne

for this baseline and train it to

generate GPT-4 distilled reﬁnements for inconsis-

tent responses.

Three-Step: DETECT, CRITIQUE - REFINE,

DCR (our method) We ﬁrst ﬁlter initial re-

sponses that are factually consistent using

detect

We use

critique

to generate sentence-wise feed-

back for any sentences that were detected to have

an error. We combine this feedback and use

reﬁne

to make targeted changes. Models for this baseline

are trained as described in Section 3.1.

Prompts associated with each of the baselines

are listed in Appendix C. Note, that all reﬁnement

models are prompted (and ﬁne-tuned) with a mini-

mum editing instruction.

4.3 Models

DETECT We use MiniCheck (Tang et al., 2024a)

detect

. MiniCheck performs on par with GPT-4

while being light weight and more cost-friendly.

This model is ideal for our approach since it does

sentence level veriﬁcation of factual consistency

against a source document.

CRITIQUE and REFINE In order to test the ef-

fectiveness of our proposed method, we experi-

ment with models of different capabilities: GPT-

4, LLAMA-3-8B-INSTRUCT (Meta, 2024) and

LLAMA-2-7B-CHAT (Touvron et al., 2023). We

abbreviate non-ﬁne-tuned versions of Llama as L2

and L3. We ﬁne-tune our Llama models to serve

critique

and

reﬁne

separately. The resulting

Llama 3 models are referred to as Critique-L3-FT

for

critique

and Reﬁne-L3-FT for

reﬁne

, and anal-

ogously for Llama 2.

Existing models as

critique

We evaluate how

existing feedback models SHEPHERD (Wang et al.,

2023b)

, ULTRACM (Cui et al., 2023) and SELFEE

(Ye et al., 2023) perform as

critique

. We run end-

to-end reﬁnements, varying

critique

with GPT-

4, and non ﬁne-tuned versions of LLAMA-3-8B-

INSTRUCT and LLAMA-2-7B-CHAT as M

reﬁne

4.4 Evaluation Metrics

AlignScore (

∆A

) AlignScore (Zha et al., 2023)

scores two texts in terms of general “information

alignment” on a scale from 0-1 using RoBERTa

(Liu et al., 2019) as the base model. We report

the delta in AlignScore which is the difference

AlignScore(document, reﬁned response)

−

Align-

Score(document, initial response).

GPT-4 Factuality Likert Scale Score (

∆G

)

Leveraging GPT-4’s ability to score generations

when given a well-deﬁned rubric (Li et al., 2024)

we prompt GPT-4-0613 to score the factual con-

sistency of a generation on a scale of 1-5 using a

rubric. We score the initial and the reﬁned response

in independent GPT-4 calls and report the delta be-

tween them. The scoring prompt with the rubric is

given in Appendix D.1.

GPT-4 Win-Rate (

) We run pairwise

scoring of the initial response and reﬁnement using

GPT-4-0613 (Chiang et al., 2024; Dubois et al.,

2024) and prompt it to score generations on a scale

of 1-5. We use the scores to determine the win

rate. In each call we randomize the order of the

two responses. We report the fractions of Wins

Note that Wang et al. (2023b) did not open source their

model weights, so we use the model from the community that

has been trained on their data:

https://huggingface.co/

reciprocate/shepherd-13b

(

), Same scores (

) and Losses (

). The scoring

prompt with the rubric is given in Appendix D.2.

MiniCheck score (

∆

% MCS) We calculate the

difference in the percentage of factually correct

summaries before and after reﬁnement, as detected

detect

i.e., MiniCheck. For fairness, we only

use this metric to compare among methods that

use

detect

as a part of the reﬁnement process i.e.

DETECT+ DR and our proposed method DCR.

5 Results

Does the three-step reﬁnement help over stan-

dard reﬁnement strategies? Table 3 shows how

our proposed method compares against existing

reﬁnement baselines using the metrics deﬁned in

Section 4.4. DCR gives the largest improvement

∆A

when reﬁning with (Critique-L3-FT, Reﬁne-

L3-FT) and (Critique-L2-FT, Reﬁne-L2-FT). We

observe a similar trend for

. When reﬁning with

GPT-4, we observe DCR performing much bet-

ter than DR, and being on par with Feed+Reﬁne

and DETECT+DR. We attribute this to GPT-4 be-

ing a stronger model and achieving closer to ideal

reﬁnement already. DCR also leads to a larger im-

provement in the fraction of summaries improved

(

∆MCS

) compared to DETECT+DR. This points

to the importance of reﬁning with ﬁne-grained feed-

back beyond the DETECT step. Tables 10 and 11

show the average values for AlignScore, GPT-4

Score and pairwise GPT-4 score for the original

response and the reﬁnement.

Table 18 shows examples of reﬁnements gener-

ated by GPT-4, Reﬁne-L3-FT and Reﬁne-L2-FT

using DCR. The edits are localized to a sentence

or phrase. The changes made are sophisticated

and add the correct information instead of trivially

deleting factually inconsistent information. Fur-

thermore, even our smaller-scale models are mak-

ing similar edits to GPT-4.

How do existing feedback models reﬁne com-

pare to the proposed critic model? Table 4

shows the effectiveness of the DCR-generated feed-

back by comparing it against reﬁning with feed-

back from existing critic models. Using DCR as

the reﬁnement strategy, we vary

critique

and use

GPT-4 and LLAMA3-8B-INSTRUCT as

reﬁne

(see

Table 9 for LLAMA2-7B-CHAT).

The proposed feedback leads to the highest gains

in reﬁnements across all our metrics, datasets and

models. This is expected since the existing mod-

els were trained to give an “overall” summary

critique

reﬁne

MediaSum/TofuEval UltraChat

∆A ↑ ∆G ↑ W ↑ S L ∆A ↑ ∆G ↑ W ↑ S L

Shepherd GPT-4 0.06

†

0.32

†

0.10

†

0.88 0.03 0.06

†

0.26

†

0.11

†

0.87 0.02

UltraCM GPT-4 0.05 0.21 0.08

†

0.91 0.01 0.03

†

0.13 0.10

†

0.88 0.02

SelFee 7b GPT-4 0.04 0.17 0.20 0.71 0.9 0.02 0.44

†

0.21 0.68 0.11

SelFee 13b GPT-4 0.03 0.24 0.19 0.71 0.10 0.01 0.40 0.19 0.71 0.10

GPT-4 GPT-4 0.10 0.53 0.21 0.78 0.02 0.07 0.21 0.18 0.80 0.02

Shepherd L3 0.05

†

0.09

†

0.06

†

0.90 0.04 0.05

†

0.00 0.10

†

0.85 0.05

UltraCM L3 0.03

†

0.09

†

0.05

†

0.90 0.05 0.05

†

0.04 0.08

†

0.84 0.07

SelFee 7b L3 0.04 -0.03 0.15 0.64 0.21 0.00 0.04 0.11 0.74 0.15

SelFee 13b L3 0.02 -0.15 0.12 0.57 0.31 0.00 0.05 0.12 0.71 0.16

L3 L3 0.07 0.27 0.16 0.80 0.03 0.05 0.08 0.08

†

0.88 0.04

Critique-L3-FT L3 0.10 0.39 0.16 0.82 0.02 0.08 0.19 0.18 0.79 0.04

Critique-L3-FT Reﬁne-L3-FT 0.08 0.33 0.17 0.82 0.01 0.06 0.04 0.14 0.82 0.04

Table 4: Results from DCR while varying the

critique

to evaluate existing feedback models on our task.

†

: signiﬁcant

gains by DCR with a

p < 0.05

according to a paired bootstrap test. We see the that our proposed

critique

achieves

the largest gains across all metrics when compared to reﬁning with feedback from existing critic models.

level feedback instead of an aspect-speciﬁc ﬁne-

grained feedback, which our model learns to gen-

erate. Also, reﬁning with a ﬁne-tuned

critique

and non ﬁne-tuned

reﬁne

gives larger improve-

ments over reﬁning with non ﬁne-tuned

critique

and

reﬁne

. This shows that smaller models can be

ﬁne-tuned to give more effective feedback which is

useful for reﬁnement. Table 17 shows examples of

feedback generated by different models. The pro-

posed ﬁne-grained feedback does error localization

and also suggests a ﬁx, inheriting such structure

from our structured prompts on stronger models.

In contrast, feedback from existing models, focus

more on the missing details rather than factual con-

sistency even when prompted for the latter.

6 Understanding generated feedback

Is the proposed detailed feedback form helpful?

The feedback used in our proposed method has two

important parts: error localization and a reasoning

for why it is an error with a suggested ﬁx. To eval-

uate the importance of this detailed feedback, we

ﬁne-tune

critique

to generate only the error local-

ization as feedback and reﬁne with a non ﬁne-tuned

reﬁne

. Focusing on LLAMA3-8B-INSTRUCT, we

show in Table 5 that reﬁning with our proposed

feedback form (rows 2 & 3) does signiﬁcantly bet-

ter than reﬁning with a less detailed feedback i.e.

only error localization. We can particularly see this

∆G

and

. This points to the effectiveness of

using a more detailed feedback for reﬁnement and

validates the usefulness of our proposed feedback.

How does the generated feedback compare

against human-written feedback? The Medi-

aSum split of TofuEval has human-written expla-

Feedback Detail TofuEval UltraChat

Critique w/ ∆A ↑ ∆G ↑ W ↑ ∆A ↑ ∆G ↑ W ↑

localization

0.08 0.19

†

0.13

†

0.05 0.04 0.08

†

+feedback 0.10 0.39 0.16 0.08 0.19 0.18

+ﬁne-tuned M

reﬁne

0.08 0.33 0.17 0.06 0.04 0.14

Table 5: Comparison of reﬁning with our proposed feed-

back form versus reﬁning with just error localization as

feedback.

†

:signiﬁcant with p < 0.05.

nations of why a sentence in the initial response

is factually inconsistent with the source document.

We leverage these to calculate sentence-level recall

statistics by comparing them against the feedback

from Feed+Reﬁne and DCR. We divide these in

the following categories (1) “Error Match”: when

both the human and model generated feedback dis-

cuss the same error (2) “Error, No Match”: when

the human and model generated feedback discuss

different errors (3) “No Error Detected, No Match”:

when the human written explanation talks about an

error but the model generated feedback says no er-

ror. We prompt GPT-4 (GPT-4-0613) with Prompt

F.1 to evaluate the above.

In Table 6 we see the feedback generated from

Critique-L3-FT and Critique-L2-FT using DCR has

a signiﬁcantly higher match rate compared when to

Feed+Reﬁne. In the latter, we see the model only

learns to say “no error”. We also see how beneﬁcial

ﬁne-tuning is for DCR when using LLAMA3-8B-

INSTRUCT (L2) as

critique

; however, LLAMA3-

8B-INSTRUCT (L3) beneﬁts substantially less. Ta-

ble 21 shows examples of human annotation and

Critique-L3-FT feedback on TofuEval.

What kind of edits does the feedback model

suggest and the reﬁnement make? We manu-

ally examine 50 feedbacks generated from GPT-

critique

Reﬁnement

Strategy

Error

Match

↑

Error

No Match

↓

No Error

Detected

No Match

↓

GPT-4 Feed+Reﬁne 0.76 0.06 0.18

GPT-4 DCR 0.61 0.01 0.38

CRTQ-L3-FT

Feed+Reﬁne 0.01 0.0 0.99

CRTQ-L3-FT

DCR 0.58 0.06 0.36

L3 DCR 0.56 0.08 0.36

CRTQ-L2-FT

Feed+Reﬁne 0.01 0.0 0.99

CRTQ-L2-FT

DCR 0.58 0.06 0.36

L2 DCR 0.1 0.02 0.87

Table 6: Sentence-wise comparison of generated feed-

back against human-written feedback in TofuEval.

CRTQ is preﬁxed for models trained as M

critique

4, Critique-L3-FT, and Critique-L2-FT when

prompted to reﬁne with DCR. Table 7 shows

the distribution (in percentage) of suggested ed-

its based on error span granularity and edit actions.

We observe variation in granularity of error spans

and the edit actions suggested across all models,

with phrase substitution being the most common.

This is further supported by the breakdown of edit

types in Table 12.

Are the edits made by the reﬁnement model

faithful to the feedback? We manually exam-

ined 50 MediaSum/TofuEval reﬁnements generated

using DCR by GPT-4, Reﬁne-L3-FT. and Reﬁne-

L2-FT. For each instance, we ﬁrst look at the error

span and the reasoning and then look for evidence

in the reﬁnement for whether or not the error span

was ﬁxed according to the feedback. If the error

span was ﬁxed as per the feedback, we mark the

edit being faithful to the feedback. For each re-

sponse, we calculate the percentage of feedback

points that were incorporated in the reﬁnement and

then average that over the 50 examples for every

model. We ﬁnd that on average GPT-4 is able to

incorporate 92% of the feedback, where as Reﬁne-

L3-FT is able to incorporate 96% and Reﬁne-L2-

FT is able to incorporate 69%. Table 19 shows

examples of the original response, feedback and

reﬁnements as generated by Reﬁne-L3-FT when

reﬁning with DCR. We see that the feedback ex-

tracts error spans along with reasons for why the

span was an error and a suggested ﬁx. We also see

the reﬁnement incorporating these changes in the

ﬁnal generation.

7 Related Work

Several recent evaluation datasets in NLP have fol-

lowed the trend of collecting explanations along-

side evaluation scores. This ranges from model

based evaluation datasets (Jiang et al., 2024; Xu

et al., 2023b; Li et al., 2024; Kim et al., 2024b;

Cui et al., 2023; Kim et al., 2023, 2024a) to human

evaluation (Trienes et al., 2024; Wadhwa et al.,

2023; Saunders et al., 2022). In our work, we test

feedback for effectiveness at reﬁnement, which

feedback for evaluation is not always optimal for.

For reﬁnement speciﬁcally, Liu et al. (2023) im-

plement a two step reﬁnement pipeline for the task

of improving factual consistency in summarization.

However, their use of the XSum dataset results

in several key differences: the high prevalence of

errors, simple summarization task, and short sum-

maries mean that no “detect” step is necessary. Fur-

thermore, our use of stronger LLMs and factuality

evaluators substantially changes the performance

regime of our base model and evaluator methods,

leading to different conclusions. Xu et al. (2023a)

also compare feedback modalities like in our work.

However, they focus on reﬁning with a general

instruction vs. using scalar feedback vs. using a bi-

nary feedback. Furthermore, they improve overall

quality rather than a speciﬁc aspect like factuality.

To reﬁne outputs for factuality, Mishra et al.

(2024) train a hallucination detector and editor by

ﬁne-tuning the model to localize ﬁne-grained hal-

lucination error types by tagging spans and then re-

moving the tagged spans in post-editing. Balachan-

dran et al. (2022); Fabbri et al. (2022); Thorne and

Vlachos (2021) train post-editing models with tech-

niques like inﬁlling and sentence-compression to

train better post-editing models. These approaches

do not use natural language feedback, and are most

optimized for deletion or replacement of simple er-

rors rather than complex rewriting, where feedback

can more clearly articulate a subtle error.

8 Conclusion

In this work, we propose a new post-hoc reﬁnement

method: DETECT, CRITIQUE and REFINE (DCR).

We showed that our method performs better than

existing reﬁnement baselines on the task of im-

proving factual consistency in document-grounded

topic-focused summaries. We also showed that

smaller models can be ﬁne-tuned to perform ﬁne-

grained feedback generation for identifying and

reasoning about any factual inconsistency. When

reﬁning with our proposed method, smaller models

perform on par with GPT-4 on our task. We also

show that our critic model produces more effec-

tive feedback for the task of post-hoc reﬁning as

compared to existing models.

Limitations

Our work is scoped to focus on reﬁning LLM re-

sponses to improve factual consistency. Factual-

ity has two important properties as an evaluation

dimension: (1) it has a mostly objective notion of

correctness (annotators can largely agree on what is

hallucinated or not); (2) reﬁning a response may in-

volve many small corrections to different parts of a

response. Other aspects of LLM responses such as

completeness and stylistic consistency share these

problem features; however, we focus on factuality

due to the existence of models for automatic evalu-

ation. Nevertheless, we believe our approach can

be generalized to other evaluation dimensions as

well, potentially leveraging new models such as

Prometheus (Kim et al., 2024a).

Our work relies on an off-the-shelf and reliable

DETECT model for doing sentence-level factual

consistency detection. We understand that such

a discriminator might not always be available for

different aspects and also that subjective tasks can-

not be classiﬁed as correct and incorrect. More

exploration needs to be done on how to effec-

tively choose and train

detect

for tasks other than

document-grounded factuality detection.

We also note that the ﬁne-tuning data comes

from a similar distribution of documents and in-

structions as the test data. It is also limited to En-

glish. It remains to be seen how our work extends

to other languages and general document-grounded

tasks. However, our approach is not fundamentally

restricted to English-language reﬁnement in these

domains.

Acknowledgments

This work was principally supported by a grant

from Open Philanthropy, as well as NSF CAREER

Awards IIS-2145280, IIS-2145479, and the NSF

AI Institute for Foundations of Machine Learning

(IFML). Thanks to Karim Villaescusa F., Kathryn

Kazanas, and Keziah Reina for human annotations

for the task of editing with feedback. Thanks to

Fangcong Yin for help with debugging ﬁne-tuning

code.

References

Afra Feyza Akyurek, Ekin Akyurek, Ashwin Kalyan,

Peter Clark, Derry Tanti Wijaya, and Niket Tandon.

2023. RL4F: Generating natural language feedback

with reinforcement learning for repairing model out-

puts. In Proceedings of the 61st Annual Meeting of

the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 7716–7733, Toronto,

Canada. Association for Computational Linguistics.

Vidhisha Balachandran, Hannaneh Hajishirzi, William

Cohen, and Yulia Tsvetkov. 2022. Correcting diverse

factual errors in abstractive summarization via post-

editing and language model inﬁlling. In Proceedings

of the 2022 Conference on Empirical Methods in Nat-

ural Language Processing, pages 9818–9830, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang

Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen.

2022. CodeT: Code Generation with Generated Tests.

arXiv preprint arXiv:2207.10397.

Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke

Sakaguchi, and Benjamin Van Durme. 2020. Un-

certain natural language inference. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 8772–8779, On-

line. Association for Computational Linguistics.

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and

Denny Zhou. 2023. Teaching Large Language Mod-

els to Self-Debug. arXiv preprint arXiv:2304.05128.

I-Chun Chern, Stefﬁ Chern, Shiqi Chen, Weizhe Yuan,

Kehua Feng, Chunting Zhou, Junxian He, Graham

Neubig, and Pengfei Liu. 2024. FacTool: Factual-

ity Detection in Generative AI - A Tool Augmented

Framework for Multi-Task and Multi-Domain Sce-

narios.

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta-

sios Nikolas Angelopoulos, Tianle Li, Dacheng Li,

Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E

Gonzalez, et al. 2024. Chatbot Arena: An Open

Platform for Evaluating LLMs by Human Preference.

arXiv preprint arXiv:2403.04132.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,

Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias

Plappert, Jerry Tworek, Jacob Hilton, Reiichiro

Nakano, Christopher Hesse, and John Schulman.

2021. Training veriﬁers to solve math word prob-

lems. ArXiv, abs/2110.14168.

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao,

Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and

Maosong Sun. 2023. UltraFeedback: Boosting Lan-

guage Models with High-quality Feedback. arXiv

preprint arXiv:2310.01377.

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin,

Shengding Hu, Zhiyuan Liu, Maosong Sun, and

Bowen Zhou. 2023. Enhancing chat language mod-

els by scaling high-quality instructional conversa-

tions. In Proceedings of the 2023 Conference on

Empirical Methods in Natural Language Processing,

pages 3029–3051, Singapore. Association for Com-

putational Linguistics.

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi

Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin,

Percy S Liang, and Tatsunori B Hashimoto. 2024.

Alpacafarm: A simulation framework for methods

that learn from human feedback. Advances in Neural

Information Processing Systems, 36.

Alex Fabbri, Prafulla Kumar Choubey, Jesse Vig, Chien-

Sheng Wu, and Caiming Xiong. 2022. Improving fac-

tual consistency in summarization with compression-

based post-editing. In Proceedings of the 2022 Con-

ference on Empirical Methods in Natural Language

Processing, pages 9149–9156, Abu Dhabi, United

Arab Emirates. Association for Computational Lin-

guistics.

Zelalem Gero, Chandan Singh, Hao Cheng, Tristan Nau-

mann, Michel Galley, Jianfeng Gao, and Hoifung

Poon. 2023. Self-veriﬁcation improves few-shot clin-

ical information extraction. In ICML 3rd Workshop

on Interpretable Machine Learning in Healthcare

(IMLH).

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen,

Yujiu Yang, Nan Duan, and Weizhu Chen. 2024.

CRITIC: Large Language Models Can Self-Correct

with Tool-Interactive Critiquing.

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-

Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu

Chen. 2022. LoRA: Low-rank adaptation of large

language models. In International Conference on

Learning Representations.

Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang,

Bill Yuchen Lin, and Wenhu Chen. 2024. TIGER-

Score: Building Explainable Metric for All Text Gen-

eration Task.

Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. Self-

Evolve: A Code Evolution Framework via Large

Language Models. arXiv preprint arXiv:2306.02907.

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,

Shayne Longpre, Hwaran Lee, Sangdoo Yun,

Seongjin Shin, Sungdong Kim, James Thorne, et al.

2023. Prometheus: Inducing evaluation capability

in language models. In NeurIPS 2023 Workshop on

Instruction Tuning and Instruction Following.

Seungone Kim, Juyoung Suk, Shayne Longpre,

Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham

Neubig, Moontae Lee, Kyungjae Lee, and Minjoon

Seo. 2024a. Prometheus 2: An Open Source Lan-

guage Model Specialized in Evaluating Other Lan-

guage Models. arXiv preprint arXiv:2405.01535.

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho

Kim, and Juho Kim. 2024b. EvalLM: Interactive

Evaluation of Large Language Model Prompts on

User-Deﬁned Criteria. In Proceedings of the CHI

Conference on Human Factors in Computing Systems,

pages 1–21.

Leandro von Werra, Younes Belkada, Lewis Tunstall,

Edward Beeching, Tristan Thrush, Nathan Lambert,

and Shengyi Huang. 2020. TRL: Transformer Rein-

forcement Learning.

Vladimir I Levenshtein et al. 1966. Binary codes capa-

ble of correcting deletions, insertions, and reversals.

In Soviet physics doklady, volume 10, pages 707–710.

Soviet Union.

Sharon Levy, Michael Saxon, and William Yang Wang.

2021. Investigating memorization of conspiracy theo-

ries in text generation. In Findings of the Association

for Computational Linguistics: ACL-IJCNLP 2021,

pages 4718–4729, Online. Association for Computa-

tional Linguistics.

Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen

Gu, and Chongyang Tao. 2024. Leveraging Large

Language Models for NLG Evaluation: A Survey.

arXiv preprint arXiv:2401.07103.

Stephanie C. Lin, Jacob Hilton, and Owain Evans. 2021.

TruthfulQA: Measuring How Models Mimic Human

Falsehoods. In Annual Meeting of the Association

for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

RoBERTa: A Robustly Optimized BERT Pretrain-

ing Approach. arXiv preprint arXiv:1907.11692.

Yixin Liu, Budhaditya Deb, Milagro Teruel, Aaron Hal-

faker, Dragomir Radev, and Ahmed Hassan Awadal-

lah. 2023. On improving summarization factual con-

sistency from natural language feedback. In Proceed-

ings of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 15144–15161, Toronto, Canada. Association

for Computational Linguistics.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler

Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,

Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,

et al. 2023. Self-reﬁne: Iterative reﬁnement with

self-feedback. Advances in Neural Information Pro-

cessing Systems, 36.

Meta. 2024. Introducing Meta Llama 3: The most

capable openly available LLM to date.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish

Sabharwal. 2018. Can a suit of armor conduct elec-

tricity? a new dataset for open book question answer-

ing. In Conference on Empirical Methods in Natural

Language Processing.

Abhika Mishra, Akari Asai, Vidhisha Balachandran,

Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and

Hannaneh Hajishirzi. 2024. Fine-grained Halluci-

nation Detection and Editing For Language Models.

arXiv preprint arXiv:2401.06855.

Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What

can we learn from collective human opinions on nat-

ural language inference data? In Proceedings of the

2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 9131–9143,

Online. Association for Computational Linguistics.

Theo X. Olausson, Jeevana Priya Inala, Chenglong

Wang, Jianfeng Gao, and Armando Solar-Lezama.

2024. Is self-repair a silver bullet for code genera-

tion? In The Twelfth International Conference on

Learning Representations.

Liangming Pan, Michael Saxon, Wenda Xu, Deepak

Nathani, Xinyi Wang, and William Yang Wang. 2024.

Automatically correcting large language models: Sur-

veying the landscape of diverse automated correction

strategies. Transactions of the Association for Com-

putational Linguistics, 12:484–506.

Sheena Panthaplackel, Miltiadis Allamanis, and Marc

Brockschmidt. 2021. Copy that! editing sequences

by copying spans. In Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, volume 35, pages

13622–13630.

Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beat-

riz Borges, Antoine Bosselut, Robert West, and Boi

Faltings. 2024. REFINER: Reasoning feedback on

intermediate representations. In Proceedings of the

18th Conference of the European Chapter of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 1100–1126, St. Julian’s, Malta.

Association for Computational Linguistics.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent

disagreements in human textual inferences. Transac-

tions of the Association for Computational Linguis-

tics, 7:677–694.

Vikas Raunak, Amr Sharaf, Yiren Wang, Hany

Awadalla, and Arul Menezes. 2023. Leveraging GPT-

4 for automatic translation post-editing. In Find-

ings of the Association for Computational Linguis-

tics: EMNLP 2023, pages 12009–12024, Singapore.

Association for Computational Linguistics.

Jarem Saunders. 2023. Improving automated predic-

tion of English lexical blends through the use of ob-

servable linguistic features. In Proceedings of the

20th SIGMORPHON workshop on Computational

Research in Phonetics, Phonology, and Morphology,

pages 93–97, Toronto, Canada. Association for Com-

putational Linguistics.

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,

Long Ouyang, Jonathan Ward, and Jan Leike. 2022.

Self-critiquing models for assisting human evaluators.

arXiv preprint arXiv:2206.05802.

Omar Shaikh, Hongxin Zhang, William Held, Michael

Bernstein, and Diyi Yang. 2023. On second thought,

let’s not think step by step! bias and toxicity in zero-

shot reasoning. In Proceedings of the 61st Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 4454–4470.

Noah Shinn, Federico Cassano, Ashwin Gopinath,

Karthik Narasimhan, and Shunyu Yao. 2024. Re-

ﬂexion: Language agents with verbal reinforcement

learning. Advances in Neural Information Process-

ing Systems, 36.

Alexander G Shypula, Aman Madaan, Yimeng Zeng,

Uri Alon, Jacob R. Gardner, Yiming Yang, Mi-

lad Hashemi, Graham Neubig, Parthasarathy Ran-

ganathan, Osbert Bastani, and Amir Yazdanbakhsh.

2024. Learning performance-improving code edits.

In The Twelfth International Conference on Learning

Representations.

Elias Stengel-Eskin, Archiki Prasad, and Mohit Bansal.

2024. ReGAL: Refactoring Programs to Dis-

cover Generalizable Abstractions. arXiv preprint

arXiv:2401.16467.

Liyan Tang, Philippe Laban, and Greg Durrett.

2024a. MiniCheck: Efﬁcient Fact-Checking of

LLMs on Grounding Documents. arXiv preprint

arXiv:2404.10774.

Liyan Tang, Igor Shalyminov, Amy Wong, Jon Burnsky,

Jake Vincent, Yu’an Yang, Sifﬁ Singh, Song Feng,

Hwanjun Song, Hang Su, Justin Sun, Yi Zhang, Saab

Mansour, and Kathleen McKeown. 2024b. TofuEval:

Evaluating hallucinations of LLMs on topic-focused

dialogue summarization. In NAACL 2024.

James Thorne and Andreas Vlachos. 2021. Evidence-

based factual error correction. In Proceedings of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing (Vol-

ume 1: Long Papers), pages 3298–3309, Online. As-

sociation for Computational Linguistics.

Yuan Tian, Nan Xu, Ruike Zhang, and Wenji Mao. 2023.

Dynamic routing transformer network for multimodal

sarcasm detection. In Proceedings of the 61st Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 2468–2480,

Toronto, Canada. Association for Computational Lin-

guistics.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-

bert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti

Bhosale, et al. 2023. Llama 2: Open founda-

tion and ﬁne-tuned chat models. arXiv preprint

arXiv:2307.09288.

Jan Trienes, Sebastian Joseph, Jörg Schlötterer, Christin

Seifert, Kyle Lo, Wei Xu, Byron C. Wallace, and

Junyi Jessy Li. 2024. InfoLossQA: Characterizing

and recovering information loss in text simpliﬁcation.

In Proceedings of the 62nd Annual Meeting of the

Association for Computational Linguistics.

Miles Turpin, Julian Michael, Ethan Perez, and Samuel

Bowman. 2024. Language models don’t always say

what they think: unfaithful explanations in chain-of-

thought prompting. Advances in Neural Information

Processing Systems, 36.

Manya Wadhwa, Jifan Chen, Junyi Jessy Li, and Greg

Durrett. 2023. Using natural language explana-

tions to rescale human judgments. arXiv preprint

arXiv:2305.14770.

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai

Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang

Sui. 2023a. Large Language Models are not Fair

Evaluators. ArXiv, abs/2305.17926.

Tianlu Wang, Ping Yu, Xiaoqing Ellen Tan, Sean

O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu,

Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-

Zarandi, and Asli Celikyilmaz. 2023b. Shepherd: A

critic for language model generation. arXiv preprint

arXiv:2308.04592.

Sean Welleck, Ximing Lu, Peter West, Faeze Brah-

man, Tianxiao Shen, Daniel Khashabi, and Yejin

Choi. 2023. Generating sequences by learning to

self-correct. In The Eleventh International Confer-

ence on Learning Representations.

Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj

Juraska, Biao Zhang, Zhongtao Liu, William Yang

Wang, Lei Li, and Markus Freitag. 2023a. LLMRe-

ﬁne: Pinpointing and Reﬁning Large Language Mod-

els via Fine-Grained Actionable Feedback. arXiv

preprint arXiv:2311.09336.

Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao

Song, Markus Freitag, William Wang, and Lei Li.

2023b. INSTRUCTSCORE: Towards explainable

text generation evaluation with automatic feedback.

In Proceedings of the 2023 Conference on Empiri-

cal Methods in Natural Language Processing, pages

5967–5994, Singapore. Association for Computa-

tional Linguistics.

Seonghyeon Ye, Yongrae Jo, Doyoung Kim, Sungdong

Kim, Hyeonbin Hwang, and Minjoon Seo. 2023.

SelFee: Iterative Self-Revising LLM Empowered

by Self-Feedback Generation. Blog post.

Xi Ye and Greg Durrett. 2022. The Unreliability of

Explanations in Few-shot Prompting for Textual Rea-

soning. In Advances in Neural Information Process-

ing Systems.

Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu.

2023. AlignScore: Evaluating factual consistency

with a uniﬁed alignment function. In Proceedings

of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 11328–11348, Toronto, Canada. Association

for Computational Linguistics.

Muru Zhang, Oﬁr Press, William Merrill, Alisa Liu, and

Noah A. Smith. 2024. How language model hallu-

cinations can snowball. In Forty-ﬁrst International

Conference on Machine Learning.

Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng.

2021. MediaSum: A large-scale media interview

dataset for dialogue summarization. In Proceedings

of the 2021 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, pages 5927–5934,

Online. Association for Computational Linguistics.

Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and

Zhenchang Xing. 2023. Red teaming ChatGPT via

Jailbreaking: Bias, Robustness, Reliability and Toxi-

city. arXiv preprint arXiv:2301.12867.

A Dataset Creation

A.1 UltraChat

UltraChat is an open-source, large-scale, and multi-

round dialogue data, released under the MIT Li-

cense. The dataset was constructed with the aim of

constructing powerful language models with gen-

eral conversational capability.

As mentioned in Section 4.1, we use a sub-

set of document-grounded instructions from Ul-

traChat. To guarantee sufﬁcient context, we only

sample instances which are at least 1000 charac-

ters long. Each UltraChat instance has a source

document and the summarization instruction in one

prompt. To insure that the instruction is summa-

rization related, we check for the following list of

keywords: [“can you summarize”,“summarize the

following”,“give a summary”,“can you provide a

summary”,“provide a brief summary”,“summarize

the”,“can you give me a summary”]. This dataset

is in English only. Tables 13 and 14 show examples

of UltraChat instructions.

A.2 MediaSum/TofuEval

TofuEval is a benchmark evaluating factual con-

sistency of document grounded summaries. This

dataset is released under the MIT-0 license.

We use the MediaSum subset of TofuEval as our

evaluation set; but sample from MediaSum’s orig-

inal train set to gather more data for training. We

create a summary instruction process similar to To-

fuEval, where they create topic-focused summaries

by prompting GPT-4 to generate 3 topics being dis-

cussed in the dialogue to be summarized, and then

converting each of those topics to be a summariza-

tion instruction. We follow the same generation

strategy and prompt GPT-4 (Prompt A.1) to give 3

topics under discussion in the sampled dialogues

from MediaSum. We then use topics as separate

summarization instruction and generate an initial

response. Tables 15 and 16 show examples of Medi-

aSum source document along with the instruction.

This dataset is in English only.

Prompt A.1: Zero-Shot prompt used with GPT-4 to

generate topics for MediaSum articles

Document:

Enumerate three main topics that people would like to

know from the provided document. Each topic should be

around 5 words.

A.3 Initial Response Generation

We prompt LLAMA2-7B-CHAT to get responses

for instructions from UltraChat and MediaSum.

Since UltraChat instructions have the source doc-

ument and the summarization instruction, we use

them as is. For MediaSum, we create a summariza-

tion prompt using the topics generated by GPT-4.

The initial response generation prompt for Medi-

aSum is given in prompt A.2. Tables 13 and 14

show examples of the UltraChat prompt along with

the initial response. Tables 15 and 16 show exam-

ples of the MediaSum prompt along with initial

response.

Once we get initial responses for UltraChat, we

ﬁlter and only keep instances where the initial re-

sponse length is shorter than the source document.

Prompt A.2: Prompt used with Llama2-7b-chat

to generate the initial summary for MediaSum

instances

Document:

Summarize the provided document focusing on {{

topic

}}. The summary should be less than 50 words in length.

B Supervised Fine-Tuning

In Section 3.1 we outline the creation of the training

data and its use for ﬁne-tuning. The structured feed-

back from GPT-4 has the following components:

error localization in the form of a span, feedback

reasoning for why the span is inconsistent and a

suggested ﬁx. We map this structured feedback to

a natural language form using the following tem-

plate: {feedback} The error span is: ‘{span}’. To

ﬁx this, consider changing the span to ‘{ﬁx}’"

We ﬁne-tune LLAMA-3-8B-INSTRUCT and

LLAMA-2-7B-CHAT for our proposed method as

well as for the baselines. We use SFTTrainer from

TRL (Leandro von Werra, Younes Belkada, Lewis

Tunstall, Edward Beeching, Tristan Thrush, Nathan

Lambert, and Shengyi Huang, 2020) to ﬁne-tune.

All the ﬁne-tuning can be conducted on 3 x A6000

GPU with 48GB of memory. We use LoRA (Hu

et al., 2022) (Rank = 8) with a learning rate of 2e-4

and a warmup ratio of 0.05. We set the per GPU

batch size to 2 and the maximum sequence length

to 2048.

Prompt B.1: p

critique

used to generate feedback

for training data creation from GPT-4

Summarize the following document on the topic: {{

topic

}}: {{ document }}

Summary on topic: {{ topic }}

—–

The provided summary is factually inconsistent with

the corresponding document. This implies that there is

information in the summary that is NOT substantiated

by the document. Factual inconsistencies can be of the

following types:

1. Mis-Referencing: a property or an event in the summary

can be found in the document, but are associated with the

wrong entity

2. Stating Opinion As Fact: the summary entails a

proposition that’s mentioned in the document not as a fact,

but as someone’s opinion

3. Reasoning Error: the summary makes one or more

wrong inferences from the information in the document

4. Tense/modality Error: the tense or modal (eg: can,

may, must) used in the summary does not match the

tense/modality of the document

5. Extrinsic Information: the summary contains new

information not grounded in the source document

6. Contradiction: the summary contradicts the document

7. Nuanced Meaning Shift: the summary twists informa-

tion from the document in a subtle way

Identify factually inconsistent information in the form of a

JSON and return a list with the following keys:

1. inconsistency: <span from the summary that is factually

inconsistent>

2. inconsistency type: <the inconsistency type from the

above list of types>

3. feedback: <explanation of the error and how it can be

ﬁxed>

4. ﬁx: <correct span that ﬁxes the inconsistency>

Prompt B.2: p

refine

to generate refinements with

natural language feedback for training data

curation

I summarized the following document on the topic: ’{{

topic }}: {{ document }}

Summary of the above document on topic: {{

topic

}}:

Feedback for the above summary: {{ feedback }}

Edit the summary such that the reﬁnement doesn’t have

any errors mentioned in the feedback. Make the minimum

number of changes when doing the reﬁnement.

C Reﬁnement Strategies

We list all instructions for the different reﬁnement

strategies described in Section 4.2.

One-Step: Direct Reﬁnement (DR) We prompt

the model with a general reﬁnement instruction

to improve the factual consistency of the initial

response. This strategy does not use any external

or self-generated feedback.

Prompt Instruction:

Prompt C.1: DR Prompt for MediaSum

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

If there are any factual inconsistencies in the sum-

mary then edit the summary such that the reﬁnement

doesn’t have any inconsistencies. Consistency in this

context implies that all information presented in the

summary is substantiated by the document.If the summary

is consistent, then just the copy the same summary with no

changes. When reﬁning, make the minimum number of

changes.

Two-Step: Reﬁnement with Natural Language

Feedback (Feed+DR) We ﬁrst prompt the model

to generate a feedback reasoning about any fac-

tual inconsistencies in the initial response, then we

pass the feedback to the reﬁnement model. The

feedback in this case is generated sentence wise,

combined together and then used for reﬁnement.

Feedback Prompt Instruction:

Prompt C.2: Feedback Prompt for MediaSum For

Two-Step refinement

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

For the following sentence in the summary:

reason if there is any factually inconsistent span in the

sentence. A span is factually inconsistent if it cannot be

substantiated by the document. If there is no inconsistency,

then end your answer with “no error”. Otherwise if there

is a factual inconsistency, then give reasons for it, point

to the error span by stating “The error span: <span from

sentence>” and end your answer with a suggested ﬁx to

the summary

Reﬁnement Prompt Instruction:

Prompt C.3: Refine Prompt for MediaSum For

Two-Step refinement

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

Feedback for the above summary: {{ feedback }}

Edit the user response such that the reﬁnement

doesn’t have any errors mentioned in the feedback.

Make the minimum number of changes when doing the

reﬁnement. Do not include a preamble.

Two-Step: Direct Reﬁnement with DETECT (De-

tect+DR) We ﬁrst ﬁlter any initial responses that

are factually consistent using our

detect

, and then

reﬁne the summaries that have an inconsistency

with a general instruction. This baseline does not

use any external or self-generated feedback.

Prompt Instruction:

Prompt C.4: Refinement prompt for MediaSum for

Two-Step: Direct Refinement with Detect

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

Edit the response such that the reﬁnement doesn’t

have any factual inconsistencies. Consistency in this

context implies that all information presented in the

response is substantiated by the document. When reﬁning,

make the minimum number of changes.

Note: in this baseline we remove the need for the

model to do the detection and reﬁnement together.

Three-Step: DETECT, REASON, FIX: DCR (pro-

posed) We ﬁrst ﬁlter any initial responses that

are factually consistent using the

detect

. We use

critique

to generate a sentence wise feedback for

any sentences that were detected to have an error.

We combine this feedback and generate a reﬁne-

ment using M

reﬁne

Feedback Prompt:

Prompt C.5: Feedback prompt for MediaSum for

Three-Step Refinement:DCR (proposed)

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

reason about the factually inconsistent span in the

sentence. A span is factually inconsistent if it cannot

be substantiated by the document. Give reasons for the

factual inconsistency, point to the error span by stating

“The error span: <span from sentence> and end your

answer with a suggested ﬁx to the summary.

Reﬁnement Instruction:

Prompt C.6: Refinement prompt for MediaSum for

Three-Step Refinement:DCR (proposed)

I summarized the following document on the topic: ‘{{

topic }}’:

Summary of the above document on topic ‘{{

topic

}}’:

Feedback for the above summary: {{ feedback }}

Edit the user response such that the reﬁnement

doesn’t have any errors mentioned in the feedback.

Make the minimum number of changes when doing the

reﬁnement. Do not include a preamble.

D Automatic Evaluation

D.1 GPT-4 Factuality Likert Scale

Prompt D.1: Zero-Shot prompt used with GPT-4 to

generate a factuality score on a scale of 1-5.

Determine whether the provided summary is consistent

with the corresponding document. Consistency in this

context implies that all information presented in the

response is substantiated by the document. If not, it should

be considered inconsistent.

The response can have one or more of the following errors:

1. Extrinsic Information: the response contains new

information not grounded in the source material

2. Mis-Referencing: a property or an event in the response

can be found in the source material, but are associated

with the wrong entity

3. Stating Opinion As Fact: the response entails a

proposition that’s mentioned in the source material not as

a fact, but as someone’s opinion

4. Reasoning Error: the response makes one or more

wrong inferences from the information in the source

material

5. Tense/modality Error: the tense or modal (eg: can, may,

must) used in the response sentence does not match the

tense/modality of the source material

6. Contradiction: the response contradicts the source

material

7. Nuanced Meaning Shift: the response twists information

from the source material in a subtle way

Given the error categories, rate the above response on a

scale of 1 to 5 based on extent of factual consistency:

5. completely consistent: the response is com-

pletely factually consistent with the source material.

4. insigniﬁcant inconsistencies: the response is mostly

factually consistent, with slight inconsistencies not

affecting main points.

3. partially inconsistent: overall factually consistent, with

a few inconsistencies with the source material.

2. severe inconsistencies: nearly half response is factually

inconsistent, with severe deviation from main points.

1. completely inconsistent: the entire response is factually

inconsistent with the source material.

First output a list of errors that the summary makes, then

conclude the response with a score in the following format:

"therefore, the score is:"

D.2

GPT-4-Pairwise Evaluation For Win-Rate

Prompt D.2: Zero-Shot prompt used with GPT-4 to

generate pairwise factuality scores on a scale

of 1-5.

Determine whether the provided summary is consistent

with the corresponding document. Consistency in this

context implies that all information presented in the

response is substantiated by the document. If not, it should

be considered inconsistent.

## Response 1 {{ response1 }}

## Response 2 {{ response2 }}

A response can have one or more of the following errors:

1. Extrinsic Information: the response contains new

information not grounded in the source material

2. Mis-Referencing: a property or an event in the response

can be found in the source material, but are associated

with the wrong entity

3. Stating Opinion As Fact: the response entails a

proposition that’s mentioned in the source material not as

a fact, but as someone’s opinion

4. Reasoning Error: the response makes one or more

wrong inferences from the information in the source

material

5. Tense/modality Error: the tense or modal (eg: can, may,

must) used in the response sentence does not match the

tense/modality of the source material

6. Contradiction: the response contradicts the source

material

7. Nuanced Meaning Shift: the response twists information

from the source material in a subtle way

Given the error categories, rate each response on a scale of

1 to 5 based on extent of factual consistency:

5. completely consistent: the response is completely

factually consistent with the source material.

4. insigniﬁcant inconsistencies: the response is mostly

factually consistent, with slight inconsistencies not

affecting main points.

3. partially inconsistent: overall factually consistent, with

a few inconsistencies with the source material.

2. severe inconsistencies: nearly half response is factually

inconsistent, with severe deviation from main points.

1. completely inconsistent: the entire response is factually

inconsistent with the source material

For each response, ﬁrst output a list of errors that the

summary makes, then conclude the response with a score

in the following format: "therefore, the score is:"

Output Format:

## Response 1

...

## Response 2

...

E DETECT Step

We use MiniCheck (Tang et al., 2024a) as

detect

To check its performance against ground truth bi-

nary factual consistency labels, we calculate preci-

sion/recall/F1 and balanced accuracy on the Medi-

aSum split of TofuEval which has sentence level

factual consistency labels. MiniCheck achieves

a sentence-level balanced accuracy of 73.6%. It

achieves a precision of 0.54 and recall of 0.64

on the task of detecting factually inconsistent sen-

tences.

F CRITIQUE Step

Prompt F.1: Zero-Shot prompt used with GPT-4

to compare ground truth human written feedback

against model generated feedback

Document:

Sentence: {{ sentence }}

For the above sentence, I received the following two

feedbacks:

Feedback 1:

Feedback 2:

Are feedback 1 and feedback 2 talking about the

same error in the sentence? Respond with one of the

following:

(1) same error or mostly the same error, one of them

covers a broader range of errors (2) totally different errors

(3) feedback 2 says there is no error but feedback 1 has an

error mentioned

G REFINE Step

As mentioned in Section 4.2, we train our baselines

to be comparable to our proposed method. How-

ever, we note that ﬁne-tuning DR baselines leads to

the model learning optimize for the “easy” action

and learning to copy instead of making ﬁne-grained

edits. We observe a similar behavior when train-

ing

critique

with a balanced set of sentences for

Feed+Reﬁne. The model learns the easier gener-

ation and learns to predict “no error”. Prior work

(Tian et al., 2023; Panthaplackel et al., 2021) has

observed this behavior with seq2seq models, and

we leave further exploration of this to future work.

Due to this behavior of our trained baselines,

we observe that the ﬁne-tuned baselines edit fewer

summaries. Table 8 shows the percentage of sum-

maries that remain unchanged in our test set. When

reﬁning with DR, with Reﬁne-L3-FT as

reﬁne

, the

model only edits 20% of the responses and Reﬁne-

L2-FT only edits 5% of responses. Similarly, when

reﬁning with Feed+DR, (Critique-L3-FT, Reﬁne-

L3-FT) edits 26% of the responses while (Critique-

L2-FT, Reﬁne-L2-FT) edits 19% of responses.

Using Levenshtein distance (Levenshtein et al.,

1966), we calculate the number of edits between the

Error

Granularity

Edit Type

GPT-4

Reﬁne-

L3-FT %

Reﬁne-

L2-FT %

Word Delete 4.0 4.0 0.0

Insert 2.0 2.0 4.0

Substitute 0.0 6.0 2.0

Phrase Delete 4.0 6.0 10.0

Insert 2.0 12.0 8.0

Substitute 24.0 40 .0 28.0

Sentence Delete 2.0 2.0 4.0

Substitute 16.0 16.0 38.0

Phrase/Phrase

Substitute/

Insert

10.0 6.0 0.0

Substitute/

Delete

0.0 0.0 2.0

Phrase/

Sentence

Substitute +

Insert

14.0 4.0 0.0

Delete/

Insert

2.0 0.0 0.0

No change 20.0 2.0 4.0

Table 7: Distribution of different edit actions for model

generated feedback. We a variation in the granularity at

which the errors are detects as well as the type of edits

that the model feedback suggests.

reﬁnement and the initial response. We break down

the edit distance by the number of deletes, adds and

substitutions. Table 12 shows the average number

of edits made by different models when reﬁning

with DCR. Note, the number of edits are averaged

over only edited summaries. For each model, we

see that the largest edit is the “substitution (sub)”

operation. While the word level edits are some

times 50% of the original length of the response,

in Table 19 we can qualitatively see that the reﬁne-

ments preserve the semantics, style and structure

of the initial response and follow the feedback.

H Data Release

We will release all our data, code and models under

the MIT License.

critique

/ M

reﬁne

Dataset

Reﬁnement Strategy GPT-4/ GPT-4 Critique-L3-FT/ Reﬁne-L3-FT Critique-L2-FT/ Reﬁne-L2-FT

TofuEval DR 0.66 0.83 0.95

Feed+DR 0.08 0.74 0.81

Detect+DR 0.49 0.75 0.65

DCR 0.51 0.50 0.50

UltraChat DR 0.87 0.91 0.99

Feed+DR 0.04 0.85 0.91

Detect+DR 0.24 0.84 0.86

DCR 0.25 0.25 0.24

Table 8: Percentage of responses that remain unchanged during the reﬁnement process by different reﬁnement

baselines and DCR.

critique

reﬁne

MediaSum/TofuEval UltraChat

∆A ↑ ∆G ↑ W ↑ S L ∆A ↑ ∆G ↑ W ↑ S L

Shepherd L2 -0.01

†

-0.10

†

0.03

†

0.87 0.09 -0.01

†

-0.03 0.06 0.83 0.11

UltraCM L2 -0.01

†

-0.13

†

0.04

†

0.84 0.12 -0.02

†

0.04 0.06 0.88 0.07

SelFee 7b L2 0.00 -0.49 0.03 0.55 0.41 -0.01 -0.08 0.08 0.73 0.20

SelFee 13b L2 -0.02 -0.54 0.05 0.55 0.40 -0.01 0.04 0.07 0.79 0.14

L2 L2 -0.03

†

-0.27

†

0.03

†

0.88 0.1 -0.05

†

-0.50

†

0.02 0.65 0.32

Critique-L2-FT L2 0.01

†

-0.27

†

0.08

†

0.75 0.17 0.01 -0.11 0.11 0.81 0.08

Critique-L2-FT Reﬁne-L2-FT 0.09 0.15 0.13 0.82 0.04 0.03 -0.10 0.07 0.76 0.09

Table 9: Continued results with

reﬁne

as LLAMA2-7B-CHAT for comparing feedback from our proposed method

and feedback from existing critic models.

†

:signiﬁcant gains by DCR with respect to these methods with

p < 0.05

according to a paired bootstrap test. We see the that our proposed

critique

achieves the largest gains across all

metrics when compared to reﬁning with feedback from existing critic models.

Strategy M

detect

critique

reﬁne

MediaSum/TofuEval

A(r ) A(ˆr) G(r) G(ˆr)

G(ˆr)

pairwise

G(ˆr)

pairwise

DR - - GPT-4 0.76 0.78 4.47 4.69 4.44 4.69

Feed+DR - GPT-4 GPT-4 0.76 0.86 4.46 4.95 4.33 4.82

DETECT + DR MiniCheck - GPT-4 0.76 0.87 4.48 4.94 4.41 4.88

DCR MiniCheck GPT-4 GPT-4 0.76 0.86 4.44 4.97 4.43 4.88

DR - - Reﬁne-L3-FT 0.76 0.80 4.47 4.67 4.45 4.65

Feed+DR - Critique-L3-FT Reﬁne-L3-FT 0.76 0.77 4.45 4.49 4.49 4.53

DETECT + DR MiniCheck - Reﬁne-L3-FT 0.76 0.81 4.47 4.66 4.45 4.64

DCR MiniCheck Critique-L3-FT Reﬁne-L3-FT 0.76 0.83 4.45 4.79 4.41 4.73

DR - - Reﬁne-L2-FT 0.76 0.77 4.46 4.46 4.48 4.50

Feed+DR - Critique-L2-FT Reﬁne-L2-FT 0.76 0.75 4.47 4.43 4.51 4.52

DETECT + DR MiniCheck - Reﬁne-L2-FT 0.76 0.8 4.48 4.70 4.49 4.62

DCR MiniCheck Critique-L2-FT Reﬁne-L2-FT 0.76 0.85 4.49 4.64 4.43 4.63

Table 10: Absolute metric values for different reﬁnement strategies for TofuEval.

is the original response,

ˆr

is the

reﬁned response.

is the AlignScore between 0-1.

is the GPT4 likert score on 1-5.

pairwise is the pairwise

score of the original and reﬁned response on a scale of 1-5.

Strategy M

detect

critique

reﬁne

UltraChat

A(r ) A(ˆr) G(r) G(ˆr)

G(ˆr)

pairwise

G(ˆr)

pairwise

DR - - GPT-4 0.70 0.71 4.51 4.59 4.30 4.39

Feed+DR - GPT-4 GPT-4 0.70 0.79 4.51 4.79 4.21 4.47

DETECT + DR MiniCheck - GPT-4 0.70 0.74 4.54 4.81 4.31 4.60

DCR MiniCheck GPT-4 GPT-4 0.70 0.76 4.53 4.74 4.33 4.58

DR - - Reﬁne-L3-FT 0.70 0.71 4.51 4.59 4.33 4.43

Feed+DR - Critique-L3-FT Reﬁne-L3-FT 0.70 0.70 4.50 4.52 4.33 4.33

DETECT + DR MiniCheck - Reﬁne-L3-FT 0.70 0.70 4.46 4.53 4.27 4.31

DCR MiniCheck Critique-L3-FT Reﬁne-L3-FT 0.70 0.75 4.50 4.54 4.23 4.32

DR - - Reﬁne-L2-FT 0.69 0.69 4.48 4.51 4.30 4.30

Feed+DR - Critique-L2-FT Reﬁne-L2-FT 0.7 0.7 4.51 4.53 4.32 4.31

DETECT + DR MiniCheck - Reﬁne-L2-FT 0.7 0.71 4.54 4.56 4.35 4.38

DCR MiniCheck Critique-L2-FT Reﬁne-L2-FT 0.7 0.73 4.51 4.42 4.27 4.21

Table 11: Absolute metric values for different reﬁnement strategies for UltraChat.

is the original response,

ˆr

is the

reﬁned response.

is the AlignScore between 0-1.

is the GPT4 likert score on 1-5.

pairwise is the pairwise

score of the original and reﬁned response on a scale of 1-5.

MediaSum/TofuEval UltraChat

critique

/ M

reﬁne

Adds Deletes Subs Len(r ) Len(ˆr) Adds Deletes Subs Len(r) Len(ˆr )

GPT-4/GPT-4

11.8 5.0 15.9 53.3 60.1 11.6 44.1 55.4 234.7 202.2

Critique-L3-FT/ Reﬁne-L3-FT

5.0 7.4 9.3 53.3 50.9 6.6 42.4 13.3 233.5 197.8

Critique-L2-FT/ Reﬁne-L2-FT

18.6 8.4 17.2 53.2 63.3 25.6 42.4 65.0 234.2 216.9

Table 12: Average edit distance broken down by the average number of additions, deletions, and substitutions

between the reﬁnement (

ˆr

) and original response (

) when reﬁning with DCR. The values are averaged over edited

summaries only. Table 8 shows the % of summaries that remain unchanged by different reﬁnement strategies

including DCR.

UltraChat Instruction - Example 1

Summarize the issues faced by Native American tribes on reservations in the mid-1800s, including lack of medical care

and access to food, and how Indian agents attempted to address these problems. Generate according to: When the tribes

got to the reservation in 1856, the federal Indian agents were then 100% responsible for feeding them and caring for their health

needs. I have documented in numerous essays that the federal government was slow to appropriate funds for the reservation,

even when they had treaties, and that hunger and starvation was a major issue on the reservations. As well, I have documented

that illnesses and diseases were also a major problem on the reservations. For the over 2000 Indians at Grand Ronde, there was

only one doctor, and while there was a hospital it was limited, because in the ﬁrst few years some 75% of the Indians were sick

in any 6 month period. Many people died on the reservation without any medical attention, either because they did not trust

the “Boston” doctor, or because the doctor was too busy to tour around and check every tent with a sick person. Many deaths

went unrecorded in these ﬁrst few years. In about 1857, the Indian agents were continually sending letters to the Commissioner

of Indian Affairs asking for more funding, for funds for building dwellings, schools, medical supplies, and food. It was very

apparent that the federal government would not easily change the way it does things, in order to save Indians some 3,000 miles

away from Washington, D.C. The agents began getting creative with their resources. They employed Indian labor in most

projects because they would not have to pay them much, then the Superintendent of Indian affairs for Oregon would apply the

funds from ratiﬁed treaties to other reservations where the tribes did not have treaties, like most of the Coast Reservation, just to

pay for the basic needs. But the problem of the need for more medical care, more employees, and more food did not go away, and

throughout the west Indians were starving on reservations. So in 1858, the Commissioner ordered that the tribes produce their

own food and asked for information about the medicines of the tribes. The following letter addresses this request for information

about medicinal plants, in a limited fashion. Its clear that the Native peoples were not trusting of the agent, or the doctor, and

would not reveal much about their plants, and that they were living in an unfamiliar land. about this last, its unclear why the

agent did not talk with the Kalapuyans as the Yamhills could have been more helpful for plants on the reservation, because

this was their original homeland. Those plants revealed are interesting as the agent includes their tribal names in his report. It

remains to be seen what these plants are today. Most of the Indians on this reservation having been brought from a considerable

distance and then medical wants supplied by the Dept. they pretend to but little knowledge of the medical virtues of the plants in

this vicinity. I have however with the assistance of the Physician attached to this Agency been able to obtain the names of a few

plants used by them. The bark of the root of what is known in this country as the Oregon grape is used by them, in the form of an

infusion, as a cathartic. The plant is a shrub with a woody, rough, erect, cylindrical stem, about 20 inches high, covered with a

grey epidermis, has a long horizontal root; ovate, glabrous, quinate radical, acuminate, dentate leaves, dark green above and light

green beneath. The ﬁrst grows in bunches or raccines, somewhat like the common frost grape, and resembles juniper berries in

size and I think in color. The following plant is considered a good narcotic, – it is known by the Rogue River Indian under the

name of “Yorks.” It has an erect, ﬁbrous, smooth, hollow, yellow stem, about 2 feet high; small, ovate, dentate, glabrous, green

leaves, brownish, knotty root, with numerous long ﬁbres. The roots of the following plants are used as emetics. The ﬁrst is called

in the Rogue River Language “Penn.” It has a root in shape and size precisely like a very small onion, and a ﬂax like stem, about

a foot high, with a bunch of small triangular seed sacks around the top. The second is known in that language as Es-quck. It has a

jointed, branching, smooth, green stem, somewhat trailing, cordate, pubescent leaves, dark green above and light green beneath;

root-stemlike, and gradually tapering. The above remedies are used in most complaints and without much judgment; they are

formed in latitude 45o [North Salem is at 45o ]North; and cannot be obtained in very large quantities; they can be gathered

within 25 miles of Steamboat navigation of the Yamhill River, a branch of the Willamette if not nearer, the facilities for sending

them to that point are good. Unfortunately, no persons are directly identiﬁed in this letter. Miller appears to have been taking

with strictly Rogue River informants. If anyone has any ideas what they plants are, I will update the blog with that information.

Llama2-7b-Chat Response: In the mid-1800s, Native American tribes on reservations faced numerous challenges, including

lack of access to medical care and food. Indian agents, who were responsible for providing for the basic needs of the tribes, were

often overwhelmed and underfunded in their efforts to address these problems.

One of the main issues was the lack of access to medical care. Many tribes did not have a dedicated doctor or medical supplies,

and those who were sick often received little to no medical attention. This led to high rates of illness and death, particularly

among children and the elderly.

Another major issue was the lack of access to food. Many tribes were forced to rely on scanty government rations, which often

did not provide enough food to sustain them. This led to widespread hunger and malnutrition, which further weakened the

already fragile health of the tribes.

In an effort to address these problems, Indian agents began to employ Indian labor to build dwellings, schools, and other

infrastructure. They also sought to obtain more funding from the federal government to provide for the basic needs of the

tribes. However, these efforts were often met with resistance and skepticism from the tribes, who had little trust in the federal

government and its agents.

In 1858, the Commissioner of Indian Affairs ordered the tribes to produce their own food, and the agent responsible for the

Grand Ronde reservation in Oregon sought information about the medicinal plants used by the tribes. However, the tribes were

reluctant to share their knowledge with the agent, who they saw as an outsider and a threat to their traditional ways of life.

The agent was able to obtain the names of a few plants used by the tribes, including the Oregon grape (which was used as a

cathartic), “Yorks" (a narcotic), and several emetics. However, the quantities of these plants were limited, and they could only be

gathered within 25 miles of Steamboat navigation of the Yamhill River, a branch of the Willamette River.

Overall, the mid-1800s were a difﬁcult time for Native American tribes on reservations, who faced numerous challenges to

their health, well-being, and way of life. Despite the efforts of Indian agents and the federal government, the tribes continued

to struggle to access basic necessities like food and medical care, and their traditional ways of life were threatened by the

encroachment of European-American settlers.

Table 13: Sample instruction from UltraChat and an initial response from LLAMA2-7B-CHAT.

UltraChat Instruction - Example 2

Given the text: Content options in all facets have its essence for each individuals within the society. Might seem to be a good

writing service till you start comparing it. For those who do, you probably ultius review want to discover a reliable company.

Ultius writing service – authentic researches at inexpensive prices out there right here will ﬂip your education into delight high

quality and reasonably priced essay to ease. As we said, the costs aren’t low cost. For example, a highschool scholar must pay

$19 per page for a paper they order inside the week, while a Master’s pupil is required to pay $30 per web page for a deadline

of 10 days. If you happen ultius reviews to go further and select one of the options like ‘prime 10% writer’ or wish to get a

previously assigned writer to work on your paper, you’d must pay an amazingly high 15% and 20% extra. Some websites provide

their own operational statistics. As an example, Papers Lead has an animated ﬁeld halfway down its front page that displays stay

stats.” These stats indicate that, on the essay rescue time of writing in mid-June 2016, more than 300 writers were engaged on

roughly 1,394 in-progress orders and that the company had successfully accomplished some 138,980 orders over the course of

recorded historical past. I might argue that ghostwriting websites have performed an unlimited function on this acceptance. The

sheer proliferation of such websites across the online underscores simply how mainstream and visible academic dishonest has

change into. Before the web, cheating actions were scattered invisibly beneath the ﬂoorboards of each academic institution,

known ultius reviews solely to those with a motive to know. As we speak, the net gives a central and international repository of

cheating companies that one could very easily come across accidentally while seeking honest support assets. Pulmonary edema

cxr descriptive essay waltraud wende dissertation which means revolt of the masses essays philipp kohorst dissertation proposal

hard work and dedication essays on abortion. Related Post: more bonuses P2y12 inhibitors comparability ultius reviews essay

paper point problem challenge analysis resolution solving. The disintegration of the persistence of memory critique essay the tip

of history essay childhood reminiscences essays research paper on registered nurse furoic acid synthesis essay what am i grateful

for essays on poverty an essay about power supply ultius reviews, being blindfolded essay essay about dubai tourism department

alexandre cabanel birth of venus analysis essay essay about barangay election 2016 results writing literature evaluate for analysis

paper. When creating portals for cell gadgets, we use only excessive-quality technologies (SP Online and SP16). These providers

are guarantee of high performance of portal, its pleasant appearance, comfy use and safety ultius of customer information. I

would name them shady, but not for reasons which are apparent to individuals exterior the academic ghostwriting world. You

couldn’t pay me an excessive amount of. After leaving lecturers, I used to jot down in business and made extra money. But, one

cannot put a price tag on the internal satisfaction of doing something ultius nicely that’s tough to do, actually helping folks in a

meaningful way, and being ﬁnancially self-sustaining for doing it, all from the comfort of house. To dispel all of the potential

doubts and considerations we’ve got ordered a trial paper and witnessed all the beneﬁts and disadvantages of cooperation with

To be goal, we have now chosen undergraduate degree and the best paper sort – essay. The topic was comparatively simple ultius

and the project was not urgent (ought to have been fulﬁlled in 10 days). The duty was submitted within 7 days, following the

closing dates, meeting all of the formatting necessities and other calls for. Nonetheless, the quality of the essay was not excellent,

in addition to its grammar. You’ll deﬁnitely ﬁnd it to be real that an best essay writing service could be very more likely to

offer you with the functions you may wish to make composing value vital grades. Continually assure that the writers or the

composing ﬁrm show a ton of professionalism of their occupation as signiﬁcantly as you’ll have pinned an entire lot of hope

in their job. College students and college students use investigation paper help supplied on the internet to have their publish

my paper wish fulﬁlled for some price. When something is completely ready, it’s time to start out creating next the outline or

blueprint. Piece of composing matters could assortment from something that could be really uncomplicated to some point that’s

extremely compound and wishes complete consciousness and impression. This entry was posted in essays rescue on 30/03/2019

by Rasyidah Kang. Can you summarize the pricing structure of Ultius writing service and compare it to other companies?

Llama2-7b-Chat Response: Sure, I’d be happy to provide you with an overview of Ultius’ pricing structure and compare it to

other writing services. Ultius offers a range of pricing options based on the type of paper, deadline, and level of study. Here

are the basic pricing tiers for each type of paper: * High School: $19 per page for a 10day deadline, $24 per page for a 5day

deadline. * Undergraduate: $24 per page for a 10day deadline, $30 per page for a 5day deadline. * Graduate: $30 per page

for a 10-day deadline, $36 per page for a 5day deadline. * Professional: $36 per page for a 10-day deadline, $42 per page for

a 5day deadline. In comparison to other writing services, Ultius’ prices are relatively high. For example: * PapersOwl offers

prices starting at $14.95 per page for a high school paper with a 14day deadline. * EssayPro offers prices starting at $19.99 per

page for a high school paper with a 10day deadline. * GradeMiners offers prices starting at $12.95 per page for a high school

paper with a 10day deadline. It’s worth noting that Ultius offers a range of additional features and services, such as a “Top 10%

Writer" option, which can increase the cost of the paper. Additionally, Ultius offers a satisfaction guarantee and a free revision

policy, which can provide peace of mind for students who are concerned about the quality of their paper. Overall, while Ultius’

prices may be higher than some other writing services, they offer a range of beneﬁts and guarantees that can make them a more

attractive option for students who are looking for a high-quality writing experience.

Table 14: Sample instruction from UltraChat and an initial response from LLAMA2-7B-CHAT.

MediaSum Instruction - Example 1

CUOMO: All right. If you look at it since the beginning of when Donald Trump, the citizen, became candidate and now president,

ﬁrst it was Trump is a solo act – HARLOW: Yes. CUOMO: – nobody can talk to him. He does this own thing. He’s very

instinctive. Then it was Bannon is the man. Now it is blood is thicker than water so Jared Kushner is the inside track to the

president’s fate. Who has his ear, if anyone? Joining us now, CNN political commentator and host of CNN’s ""SMERCONISH"",

Michael Smerconish. And, the CEO and president of ""NEWSMAX"", Chris Ruddy. Chris, you know the president well. You

spend time with him. What is your sense of what his universe is in terms of who he looks to most for counsel? CHRIS RUDDY,

CEO & PRESIDENT, NEWSMAX MEDIA INC.: Well, he has a very large universe. I would say he’s a feedback junkie. I

mean, he’s listen – when I see him – I saw him this weekend, he’ll have 10 questions and ask me things. He’s doing that with a

lot of friends and associates. He’s also using media channels to get information and absorb what people are thinking. It is a myth,

Chris, to believe that a family member alone can inﬂuence or make a decision for him. He’ll take into account what they’re

saying. They obviously have a lot of inﬂuence because they’re – just because of proximity. But at the end of the day, Donald

Trump always makes the decision, himself. That’s been my experience. HARLOW: So, Michael Smerconish, is this much ado

about nothing? I mean, ""SNL"" had the – you know, the skit this weekend. So many headlines are this is the fall of Bannon, the

rise of Kushner, and the rise of Gary Cohn. Listen to Chris, it sounds like none of that’s true...[....]...HARLOW: Yes. CUOMO: –

that Bannon knew, certainly better than anybody else who’s around the president right now. That’s why he wound up becoming

so essential. How does he hold on to the people who got him there if he loses the man with the actual connection? RUDDY:

Well, the polling data shows the president has held that base pretty strongly. I mean, I’m seeing numbers like 90 percent of his

base still supports him. Where I think he’s showing a lot of weakness is on the Independents, and I think there’s a feeling among

people at the White House they have to moderate a little bit and be a little less controversial. But as Michael says, and as you said

Chris, you really risk losing the base so I think the president has some wiggle room. I think the base is willing to give him a lot

more slack than they might give another person in that – in the Oval Ofﬁce, but I think we’re going to see – it’s all about results.

Always remember this. With Donald Trump, it doesn’t matter what you’re saying or the Twitter or the various controversies that

ﬂutter around. It’s all about will he bring jobs, will he get things done in Washington, will he clean the place up? If he does some

of those things he’ll easily win reelection. HARLOW: It is interesting, Michael, looking, though, at 2020 and this – what sounds

like an admission, at least, what Chris is pointing to, or an acknowledgment by this team – the Trump team – that they know they

have to win it differently than they did this time around and maybe it isn’t with Bannon the man to win in 2020. Where does

Gary Cohn ﬁt in all this because I ﬁnd him to be a fascinating character, a former president of Goldman Sachs? A guy who, you

know, has said in this meetings, according to ""The Washington Post"" I’m not a Republican, I’m not a Democrat, I just like to

get things done. He, like the president, has given money to Republicans and to Democrats. What does the rise of Gary Cohn

mean? SMERCONISH: I think it represents pragmatism on the part of the president. There was so much said, Poppy, and we

spent a lot of time on it here at NEW DAY last week about last Wednesday, in particular. Favorable comments about Janet Yellen

– HARLOW: Yes. SMERCONISH: – NATO no longer obsolete, China not a currency manipulator, and people said, oh, look at

the ﬂip-ﬂops of the president. It kind of reinforced to me where I thought he always was. I never bought into 2 Corinthians, if

you remember that – HARLOW: Yes. SMERCONISH: – reference during the course of the primary. I always thought he was

singing a particular tune to placate a base that he needed. And, frankly, this is the Donald Trump, the compromiser in chief, that

I always thought he might be. Now again, it’s still early, we’re not even at day 100, but I think Gary Cohn is more a reﬂection of

where the president’s mind is on issues than even Steve Bannon. CUOMO: Chris, a quick take. Do you foresee a reality where

Bannon is not in the White House? RUDDY: Oh, it’s so hard for me to predict. Really, the president makes that decision but it’s

very possible he could be a – anybody – you know, the president has a very taut threshold with people if they’re performing

– CUOMO: But only Bannon can hurt him on the outside. HARLOW: Yes, if he – CUOMO: Jared Kushner, I know he’s his

son-in-law. RUDDY: Well – CUOMO: I don’t mean to disrespect the young man but he doesn’t – he’s not a political player. He’s

never even been in anything at all, let alone something like this. Bannon, on the outside, is a dangerous man to President Trump.

RUDDY: Well, I would disagree with that. I haven’t seen him being a problem for the president. In the past, the president has

praised Bannon and – CUOMO: Sure. RUDDY: – Reince Priebus and Jared, so – but look, you know, he keeps – if he thinks that

the results are not there, even Jared’s going to be being watched. Everybody will have a performance mechanism. And I’m

seeing, you know, the jobs issues right now has been number one but if you look at Gary Cohn, I think Gary is an establishment

Wall Street guy and the president’s going to judge him by his results. HARLOW: Gentlemen, thank you very much. Nice to have

you both. Coming up for us, it is called the Super Bowl of White House social events, at least that’s the analogy. CUOMO: Sure.

HARLOW: We’re talking about the annual Easter Egg Roll. It is today. How will it be different, as you look at these live pictures

of the White House as they get ready for their Super Bowl? We’re going to take you there live, next.

Llama2-7b-Chat Response: Bannon and Kushner have different strategies in the Trump White House. Bannon prioritizes

ideological objectives, while Kushner focuses on pragmatism and results. Bannon’s inﬂuence may wane as Kushner rises in

prominence.

Table 15: Sample instruction from MediaSum and an initial response from LLAMA2-7B-CHAT.

MediaSum Instruction - Example 2

JOE JOHNS, CNN ANCHOR: Well, good morning. I saw a bunch of people stumbling around on the streets of New York

holding those little bags full of things for the holidays. I’m Joe Johns in for John Roberts. CAROL COSTELLO, CNN ANCHOR:

I’m Carol Costello in for Kiran Chetry. Happy day after Thanksgiving. JOHNS: Oh, yes, shop until you drop. COSTELLO:

That’s right. We’re full of turkey. We’ve got to work it up and we’re going shopping. Exactly. There’s a lot going on this morning,

so we want to get right to it. An uneasy peace on the peninsula as enemy armies face each other down this morning. North Korea

warning the neighboring nations are on the brink of war at South Korea. The United States gear up for a joint military exercise

this weekend. We’re live with what the world can do to ease tension. JOHNS: Got your wallet, got your coupons, and don’t

forget the Christmas list. It’s time for the mad dash to the mall for those extreme Black Friday deals. But is it really worth the

hassle? That’s probably up for debate. Nevertheless, we’ll show you where all the action is this morning. COSTELLO: It’s a

tradition, Joe. JOHNS: Absolutely. COSTELLO: An amazing story of survival at sea. Three teenage boys in a tiny boat found

alive after 50 days adrift in the South Paciﬁc. They’ve been given up for dead and actually eulogized in memorial service weeks

earlier. This morning, how they deﬁed death and the lucky break that led to their rescue. JOHNS: That’s really an incredible

story. But ﬁrst, tensions running dangerously high on the Korean Peninsula. This morning, reports of new explosions as U.S.

warships steam toward the region. Just days after it shelled South Korea, North Korea warns the peninsula is edging closer to the

brink of war. The North seeing red because of America’s joint military exercises with South Korea. COSTELLO: And, of course,

those exercises are nothing new. But the unpredictable North says it’s ready to unleash a shower of ﬁre in order to defend itself.

That has forced South Korea to ramp up security and change defense ministers. Our foreign affairs correspondent Jill Dougherty

is live in Washington. Jill, what in the world can Washington do about this? JILL DOUGHERTY, CNN FOREIGN AFFAIRS

CORRESPONDENT: Well, you know, Carol, Joe, it’s a difﬁcult situation because they’re going to go ahead with these military

exercises with South Korea regardless of what the North says, obviously. So what they have to do is be ﬁrm, but they can’t –

they also have to be prepared for any type of unpredictable behavior by the North. The North already has shown it. You know,

attacking this week, earlier this week the island without any type of warning whatsoever. Also, during a period where the South

Koreans by themselves were carrying out exercises. So they have to proceed, but they can’t overplay the hand. It’s a very difﬁcult

situation. The one thing about this recent ﬁring, the most – the latest one is that that appears to be live ﬁring exercises by the

North Koreans. They weren’t ﬁring into that area where the encounter took place earlier this week. So that is one good news.

But this starts on Sunday, and we’ll have to see what they’ll do. JOHNS: Jill, what’s the likely scenario here? Does it appear

that the North Koreans are just trying to draw the Chinese in to act as a fair broker? What’s the method behind, if you will, the

madness? DOUGHERTY: You know, sometimes we think it is madness. But if you talk to some experts, they say it’s not really

madness, it’s calculated. And what’s going on right now as we all know, the son of Kim Jong-il is being groomed to take over his

father’s job. His father is very ill. And he’s only 26 years old. So Kim Jong-un is the person in the hot seat. He has to prove that

he’s tough, and they’ve done this before. When his father was going to succeed his grandfather, they were doing the same thing.

Showing that they’re tough, bristling, and telling the world pay attention to us. Some analysts point out, you know, if North

Korea didn’t have nuclear weapons right now, a lot of countries wouldn’t be paying as much attention. So they feel they need that

attention. They want to be taken seriously by the U.S. and this, unfortunately, is the way they’re doing it. COSTELLO: And you

know, just – it’s a game to them then let’s say, but they’re killing people. They’re not just blowing stuff up, they’re killing people.

DOUGHERTY: Well, that’s the problem because this is a very serious incident that happened earlier this week. And you can say,

yes, they want to get a message across, but don’t forget they have a million men – more than a million men army. They are very,

very close to South Korea. If you look at the map where that encounter took place earlier this week, it is not actually that far

from the airport of Seoul, the capital of South Korea. So it’s very serious and very delicate in the way everybody has to play it.

But they do have to show the North that they are intent on going ahead with protecting South Korea and not kowtowing to what

the North Koreans are trying to do. COSTELLO: The military exercises will continue. Jill Dougherty, thanks so much. Stay with

us. Coming up in less than 30 minutes, we’ll go to our Stan Grant. He’s in South Korea. He saw ﬁrsthand the devastation from

Tuesday’s attack by North Korea. JOHNS: A developing story this morning. The rescue of three teenage boys after 50 days lost

at sea aboard a six-foot boat. And now, they’re back on dry land, dehydrated and sunburned, but OK. The trio set out in early

October from their tiny South Paciﬁc island for another nearby island. They got lost. The days turned to weeks. Their supply

of coconuts ran out. They survived on rain water and raw ﬁsh. Finally, a ﬁshing boat taking a shortcut off the shipping lane

spotted them 800 miles from home. This was two weeks after they’d already been declared dead...[..]...[...]...JOHNS: Well –

COSTELLO: Thanks, Karen. JOHNS: I know. COSTELLO: All we can say is wow. JOHNS: It’s getting started a little too soon

for me. It seems like we just got past last winter with all the blizzards and everything. Yes. COSTELLO: That’s because we’ve

been spoiled because it’s been so warm along the northeast coast. JOHNS: Yes. COSTELLO: Weird. JOHNS: All right. Well,

thanks, Karen. Given up for dead. A few minutes ago, we told you about the three teenage boys who beat the odds, surviving

close to two months at drifted sea. Well, coming up, we’ll hear ﬁrsthand how they were rescued from one of the men who found

them. COSTELLO: And start your engines. Hoards of holiday shoppers head to the mall to get their Black Friday hunt ahead.

And look at the early morning madness. JOHNS: How do they do that? So you want to be a sidewalk Santa? There’s more to it

than meets the eye. There’s a right way and a wrong way to wear the red suit, believe it or not. Coming up. It’s 10 minutes after

the hour.

Llama2-7b-Chat Response: North Korea has threatened to attack South Korea and the US, amidst joint military

exercises between the two countries. Tensions are running high on the Korean Peninsula, with North Korea warning of a "brutal

retaliation" and South Korea mobilizing its military. The situation has escalated in recent days, with North Korea ﬁring artillery

into the sea and South Korea responding with live ﬁre exercises. The US has also expressed concern and called for calm, while

China has urged restraint from all parties involved.

Table 16: Sample instruction from MediaSum and an initial response from LLAMA2-7B-CHAT.

Initial LLM Response: The document discusses the role of the Quds Force, a branch of Iran’s Revolutionary Guard, in Iraq.

The Quds Force is a highly trained and versatile force that conducts military operations and intelligence gathering. They have

been funding mosques and clinics and building social capital in Iraq to increase their inﬂuence. The Quds Force reports to the

Supreme Leader of Iran, but the decision-making process in Iran is dysfunctional and different institutions may act independently.

The Quds Force is also active in Lebanon and has been accused of training Hezbollah.

Critique Shepherd (summary-level): The answer’s summary lacks some key details mentioned in the context; the inﬂuence of

Iran in Iraq, and the Quds Force being a branch of the Revolutionary Guard. The correct answer should reﬂect these details.

Critique UltraCM (summary-level): 8/10 - Your response was quite comprehensive. You adequately summarized the main

points related to “Iran’s Inﬂuence in Iraq”. You captured important details such as the description of the Quds Force and

its functions, as well as mentioning its involvement in Lebanon among others. However, you could improve by providing a

more succinct summary while retaining crucial information. Try to avoid redundancy and condense relevant details into fewer

sentences. Also, ensure all key elements are covered, speciﬁcally those concerning the operation of Quds Force in Iraq given the

focus on that subject matter. Lastly, watch out for grammatical errors to maintain professionalism, e.g.,“has been accuse” instead

of “have been accuse”. Remember proofreading helps reduce such mistakes. Keep practicing!

Critique SelFee 7B (summary-level): None

Critique SelFee 13B (summary-level): The summary accurately captures the main points of the document, but could beneﬁt

from more contextual information. For example, it would be helpful to explain who President Bush is and why he referenced the

Quds Force in his press conference. Additionally, the summary could discuss the reasons behind Iran’s inﬂuence in Iraq and its

impact on the region. Overall, the summary is a 7 out of 10. Revision is needed.

Critique Llama3-FT-Instruct-FT (sentence-wise): The document does not explicitly state that the Quds Force reports to the

Supreme Leader of Iran. Instead, it discusses the hierarchical structure of the Iranian Revolutionary Guard and the role of the

Supreme Leader in the Iranian Constitution. The summary’s claim about the Quds Force’s reporting structure is inferred but not

directly stated in the document. The error span is: ‘The Quds Force reports to the Supreme Leader of Iran’. To ﬁx this, consider

changing the span to ‘The Supreme Leader of Iran has constitutional jurisdiction over the Revolutionary Guards, and the Quds

Forces are a branch of the Revolutionary Guards’

Initial LLM Response: NASA’s Mars rover is set to land on the planet in August, using parachutes and a rocket’s backpack to

slow down enough to make the landing. The rover, named Curiosity, is about the size of a small SUV and will study the planet’s

geology and climate.

Critique Shepherd (summary-level): The answer’s summary lacks some key details mentioned in the context; speciﬁcally the

fact that the landing is scheduled for August 6th. Additionally, the answer does not provide a complete summary of the context;

the answer should be longer than 50 words.

Critique UltraCM (summary-level): Your answer provides accurate information about NASA’s plan to land their Mars rover in

August. You included important details such as the use of parachutes, the rocket’s backpack, and the name of the rover, which

adds relevance to your response. However, you could have added a bit more excitement to the summarization to capture the

anticipation surrounding the event. Remember, the goal is not just to inform, but to engage as well. You might have brieﬂy

highlighted why this particular mission is signiﬁcant or interesting. Try to incorporate these elements next time to create a more

captivating summary while still staying within the word limit. Additionally, ensure to check grammar usage and syntax for

accuracy, there was a missing space before the period at the end - “The rover, named Curiosity,” could read better as “- Curiosity”.

Overall, good effort but remember to bring out emotions and implications whenever possible while maintaining brevity for clear

communication in future tasks of similar nature. Keep practicing!

Critique SelFee 7B (summary-level): The answer is concise and covers the main topics of the document. However, it could

beneﬁt from more information on the heated protests against the Boy Scouts’ ban on gays and the impact of the Arab uprisings on

Olympic athletes. Additionally, it would be helpful to provide more context on President Bill Clinton’s visit to Nelson Mandela.

Overall, the answer is a 7/10. Revision is needed.

Critique SelFee 13B (summary-level): None

Critique Llama3-FT-Instruct-FT (sentence-wise): 1. The document does not specify the purpose of the Curiosity rover’s

mission, stating only that it is scheduled to touch down on Mars on August 6th. The summary introduces information about

the rover’s mission that is not present in the document. The error span is:‘study the planet’s geology and climate’ To ﬁx this,

consider changing the span to ‘touch down on Mars on August 6th’

Table 17: Examples of feedback generated by Shepherd, UltraCM, SelFee 7b, SelFee 13b, and Critique-L3-FT for

two ToFuEval instances when reﬁning with DCR using Prompt C.5

critique

reﬁne

Original Response Reﬁnement

Critique-L3-FT/

Reﬁne-L3-FT

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained and

versatile force that conducts military operations

and intelligence gathering. They have been fund-

ing mosques and clinics and building social cap-

ital in Iraq to increase their inﬂuence. The Quds

Force reports to the Supreme Leader of Iran, but

the decision-making process in Iran is dysfunc-

tional and different institutions may act indepen-

dently. The Quds Force is also active in Lebanon

and has been accused of training Hezbollah.

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained

and versatile force that conducts military opera-

tions and intelligence gathering. They have been

funding mosques and clinics and building social

capital in Iraq to increase their inﬂuence. The

Supreme Leader of Iran has constitutional juris-

diction over the Revolutionary Guards, and the

Quds Forces are a branch of the Revolutionary

Guards. The decision-making process in Iran

is dysfunctional and different institutions may

act independently. The Quds Force is also ac-

tive in Lebanon and has been accused of training

Hezbollah.

GPT-4/GPT-4

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained and

versatile force that conducts military operations

and intelligence gathering. They have been fund-

ing mosques and clinics and building social cap-

ital in Iraq to increase their inﬂuence. The Quds

Force reports to the Supreme Leader of Iran, but

the decision-making process in Iran is dysfunc-

tional and different institutions may act indepen-

dently. The Quds Force is also active in Lebanon

and has been accused of training Hezbollah.

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained

and versatile force that conducts military opera-

tions and intelligence gathering. They have been

funding mosques and clinics and building social

capital in Iraq to increase their inﬂuence. The

Quds Force operates under the constitutional

jurisdiction of Iran’s Supreme Leader as part

of the Revolutionary Guards, yet the decision-

making process in Iran involves a complex and

consensus-driven mechanism that reﬂects the de-

centralized and sometimes dysfunctional nature

of the regime, allowing for a degree of indepen-

dence among different institutions. The Quds

Force is also active in Lebanon and has been

accused of training Hezbollah.

Critique-L2-FT/Reﬁne-L2-FT

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained and

versatile force that conducts military operations

and intelligence gathering. They have been fund-

ing mosques and clinics and building social cap-

ital in Iraq to increase their inﬂuence. The Quds

Force reports to the Supreme Leader of Iran, but

the decision-making process in Iran is dysfunc-

tional and different institutions may act indepen-

dently. The Quds Force is also active in Lebanon

and has been accused of training Hezbollah.

The document discusses the role of the Quds

Force, a branch of Iran’s Revolutionary Guard,

in Iraq. The Quds Force is a highly trained

and versatile force that conducts military op-

erations and intelligence gathering. They have

been funding mosques and clinics and building

social capital in Iraq to increase their inﬂuence.

The Quds Force reports to the Supreme Leader

of Iran, but their inﬂuence and decision-making

often contribute to the dysfunctionality of the

Iranian regime. The Quds Force is also active

in Lebanon and has been accused of training

Hezbollah.

Table 18: Example of reﬁnements using DCR generated by GPT-4, and ﬁne-tuned versions of LLAMA3-8B-

INSTRUCT and LLAMA2-7B-CHAT. The reﬁnements are generated using the critique prompt C.5 and reﬁnement

prompt C.6 Text in red is either deleted or substituted during the reﬁnement process. Text in blue is added in lieu of

the text removed from the original response.

Original Response Feedback Reﬁnement

The document discusses the role of the

Quds Force, a branch of Iran’s Revo-

lutionary Guard, in Iraq. The Quds

Force is a highly trained and versatile

force that conducts military operations

and intelligence gathering. They have

been funding mosques and clinics and

building social capital in Iraq to increase

their inﬂuence. The Quds Force reports

to the Supreme Leader of Iran, but the

decision-making process in Iran is dys-

functional and different institutions may

act independently. The Quds Force is

also active in Lebanon and has been ac-

cused of training Hezbollah.

1. The document does not explicitly

state that the Quds Force reports to the

Supreme Leader of Iran. Instead, it

discusses the hierarchical structure of

the Iranian Revolutionary Guard and the

role of the Supreme Leader in the Ira-

nian Constitution. The summary’s claim

about the Quds Force’s reporting struc-

ture is inferred but not directly stated

in the document. The error span is:

‘The Quds Force reports to the Supreme

Leader of Iran’ To ﬁx this, consider

changing the span to ‘The Supreme

Leader of Iran has constitutional juris-

diction over the Revolutionary Guards,

and the Quds Forces are a branch of the

Revolutionary Guards’

The document discusses the role of the

Quds Force, a branch of Iran’s Revo-

lutionary Guard, in Iraq. The Quds

Force is a highly trained and versatile

force that conducts military operations

and intelligence gathering. They have

been funding mosques and clinics and

building social capital in Iraq to increase

their inﬂuence. The Supreme Leader of

Iran has constitutional jurisdiction over

the Revolutionary Guards, and the Quds

Forces are a branch of the Revolutionary

Guards. The decision-making process

in Iran is dysfunctional and different in-

stitutions may act independently. The

Quds Force is also active in Lebanon and

has been accused of training Hezbollah.

The gun industry follows the NRA’s ex-

treme positions out of fear of consumer

boycotts and because NRA hype beneﬁts

sales. But gun makers likely wouldn’t

oppose background checks if not for the

NRA. The NRA previously organized

boycotts that nearly bankrupted compa-

nies. The industry stays silent to avoid

ﬁghting the NRA publicly.

1. The document does not explicitly

state that the gun industry stays silent to

avoid ﬁghting the NRA publicly. This is

inferred from the context but not directly

stated as a reason for their silence. The

error span is: ‘the industry stays silent to

avoid ﬁghting the NRA publicly’ To ﬁx

this, consider changing the span to ‘the

gun industry is afraid to publicly oppose

the NRA’s extreme positions’

The gun industry follows the NRA’s ex-

treme positions out of fear of consumer

boycotts and because NRA hype bene-

ﬁts sales. However, gun makers likely

wouldn’t oppose comprehensive back-

ground checks if not for the NRA’s oppo-

sition. The NRA previously organized

boycotts that nearly bankrupted compa-

nies, causing the gun industry to be cau-

tious in publicly opposing the NRA’s

positions.

An assistant to CBS anchor Dan Rather

was diagnosed with anthrax. She is re-

covering on antibiotics. Although she

handles mail for Rather, no suspicious

envelopes were received. Ofﬁcials be-

lieve this is an isolated case, though CBS

employees are being tested. Separately,

an NJ postal worker tested positive for

anthrax exposure, one of over 40 cases

nationwide.

1. The document does not state that

no suspicious envelopes were received.

Instead, it mentions that there is no

evidence at this point that any of the

mail or any of the envelopes received

here in the past couple weeks was sus-

picious. The error span is:’no suspi-

cious envelopes were received’ To ﬁx

this, consider changing the span to ’there

is no evidence at this point that any of

the mail or any of the envelopes received

here in the past couple weeks was sus-

picious’ 2. The document does not pro-

vide a speciﬁc number of cases nation-

wide. This information is added in the

summary without being grounded in the

document. The error span is: ’one of

over 40 cases nationwide’ To ﬁx this,

consider changing the span to ’over 40

cases nationwide’

An assistant to CBS anchor Dan Rather

was diagnosed with anthrax. She is re-

covering on antibiotics. Although she

handles mail for Rather, there is no ev-

idence at this point that any of the mail

or any of the envelopes received here in

the past couple weeks was suspicious.

Ofﬁcials believe this is an isolated case,

though CBS employees are being tested.

Separately, an NJ postal worker tested

positive for anthrax exposure, over 40

cases nationwide.

Table 19: Examples of feedback and reﬁnement generated by Critique-L3-FT and Reﬁne-L3-FT. Prompt C.5 is

used to generate the feedback and Prompt C.6 is used to generate the reﬁnement. Text in red highlights the errors

as ﬂagged by the feedback and edited by

critique

. Text in blue highlights the ﬁx suggested by the feedback and

executed in the reﬁnement.

Original Response Feedback

Edit Granularity

and Type

The Quds Force, a branch of the Iranian

Revolutionary Guard, conducts intelli-

gence and military operations in Iraq, us-

ing social capital experiments and fund-

ing mosques and clinics to gain sup-

port. The Quds Forces report to the

Supreme Leader Ayatollah Khamenei,

but decision-making is a consensus-

building process in Iran. The Quds

Forces also operate in Lebanon and

have been accused of training Hezbol-

lah. They may have well-placed collabo-

rators in Saudi Arabia’s Shiite-majority

eastern province, and there are concerns

they could make life difﬁcult for Saudi

Arabia in case of conﬂict with Iran.

(GPT-4) The error span in the sentence “The Quds Forces report to the

Supreme Leader Ayatollah Khamenei” might be considered factually

inconsistent. In the document, Karim Sadjadpour explains that while it

makes sense that the Quds Forces would not operate without approval

from top leaders in Tehran, and highlights that constitutionally, the Rev-

olutionary Guards (and by extension, the Quds Force as a branch of

the Guards) are under the jurisdiction of the Supreme Leader Ayatol-

lah Khamenei, he also clariﬁes the overall decision-making process in

Iran as being very much a consensus-building process among various

personalities and institutions within the government. This explanation

suggests that while the Supreme Leader has constitutional jurisdiction,

the implied direct reporting and operational command might be mislead-

ing without adding the context of Iran’s complex and consensus-driven

decision-making process. The simpliﬁcation that the Quds Forces report

directly to the Supreme Leader does not capture the nuance of how deci-

sions are inﬂuenced by a broader network of power within Iran.

Suggested ﬁx to the summary: The Quds Forces are under the constitu-

tional jurisdiction of the Supreme Leader Ayatollah Khamenei; however,

the decision-making process in Iran involves a consensus-building ap-

proach among various personalities and institutions.

Phrase Substitute

(Critique-L3-FT) The document describes the Iranian regime as ‘dys-

functional’ and mentions the consensus-building process in Iran, but it

does not explicitly state that decision-making is a consensus-building

process. This is inferred but not directly stated in the document. The

error span is:‘decision-making is a consensus-building process in Iran’

To ﬁx this, consider changing the span to ‘Iran’s decision-making pro-

cess is dysfunctional and involves consensus building among different

institutions’

Phrase Substitute

(Critique-L2-FT) The document states that the Quds Forces do not

report to Supreme Leader Ayatollah Khamenei, which contradicts the

summary’s statement that they do. The error span is: ‘The Quds Forces

report to the Supreme Leader Ayatollah Khamenei’ To ﬁx this, consider

changing the span to ‘The Quds Forces do not report to Supreme Leader

Ayatollah Khamenei’

Word Insert

The US and Russia have been engaging

in a tit-for-tat diplomatic expulsion, with

each side expelling 60 diplomats. Russia

is closing the US consulate in Saint Pe-

tersburg and the US is shutting down the

Russian consulate in Washington, D.C.

The White House says it will deal with

the retaliation, while Russian ofﬁcials

have warned they will take further ac-

tion if the US acts against their interests.

(GPT-4) The error span in the sentence provided in the summary is: ‘and

the US is shutting down the Russian consulate in Washington, D.C.’ This

span cannot be substantiated by the original document. The document

does mention the expulsion of diplomats by both the US and Russia and

the closure of a consulate, but it speciﬁcally mentions the US booted 60

Russian diplomats and in retaliation, Russia is closing the US consulate

in Saint Petersburg. There is no mention of the US shutting down the

Russian consulate in Washington, D.C. in the provided document.

A suggested ﬁx to the summary for accuracy based on the document

would be: Russia is closing the US consulate in Saint Petersburg in

retaliation to the US expelling 60 Russian diplomats.

Phrase Substitute

(Critique-L3-FT) The document does not specify that the US is shutting

down the Russian consulate in Washington, D.C. This information is

not present in the document and thus introduces new information not

grounded in the source. The error span is: ‘the US is shutting down the

Russian consulate in Washington, D.C.’ To ﬁx this, consider changing

the span to ‘Russia is closing the US consulate in Saint Petersburg’

Phrase Delete

(Critique-L2-FT) The document does not specify that the US is shutting

down the Russian consulate in Washington, D.C. This information is

not provided in the document and thus introduces new information not

supported by the source. The error span is: ‘the US is shutting down the

Russian consulate in Washington, D.C.’ To ﬁx this, consider changing

the span to ‘the US is shutting down the Russian consulate in Saint

Petersburg’

Phrase Insert

Table 20: Examples of feedback generated by GPT-4, Critique-L3-FT and Critique-L2-FT for a sentence in the

original response using DCR (Prompts C.5 and C.6). We observe the feedback generating error spans at a sentence

and phrase level. The feedback also suggests ﬁxes of various types: substitutes, inserts and deletes.

Original Response Human-Written Feedback Critique-L3-FT

A California judge has issued a temporary in-

junction against the state’s high school exit exam,

which critics say is too difﬁcult for some stu-

dents to pass. The test, which was only based

on eighth-grade math and ninth- and 10th-grade

level English, has caused controversy since its

implementation in 2005. In response to a lawsuit

by 10 students who failed the exam, a judge has

allowed thousands of students who did not pass

to potentially graduate.

The source document does not mention

the year the test was implemented or

the fact that it caused controversy since

then.

The document does not specify the year

of the test’s implementation, making this

information extrinsic. The error span

is:‘since its implementation in 2005’ To

ﬁx this, consider changing the span to

‘since its implementation’

The US and Russia have been engaging in a

tit-for-tat diplomatic expulsion, with each side

expelling 60 diplomats. Russia is closing the

US consulate in Saint Petersburg and the US is

shutting down the Russian consulate in Wash-

ington, D.C. The White House says it will deal

with the retaliation, while Russian ofﬁcials have

warned they will take further action if the US

acts against their interests.

The source document does not mention

which Russian consulate was closed in

the US, so the claim that it is in “Wash-

ington” is not grounded in the source.

The document does not specify that the

US is shutting down the Russian con-

sulate in Washington, D.C. This infor-

mation is not present in the document

and thus introduces new information not

grounded in the source. The error span

is: ‘the US is shutting down the Russian

consulate in Washington, D.C.’ To ﬁx

this, consider changing the span to ‘Rus-

sia is closing the US consulate in Saint

Petersburg’

The document discusses efforts to address sys-

temic inequality, including the importance of un-

derstanding the root causes of inequality and the

need for systemic change to create a more equi-

table society. The author argues that addressing

inequality requires a multi-faceted approach that

includes improving access to education, increas-

ing job opportunities, and providing resources to

marginalized communities. The document em-

phasizes the importance of collective action and

the need for individuals, organizations, and gov-

ernments to work together to address systemic

inequality.

The text does not discuss systemic in-

equality, including the the importance of

understanding the root causes of inequal-

ity and the need for systemic change.

The document does not explicitly dis-

cuss efforts to address systemic inequal-

ity. The topic is introduced through a

discussion about President Trump’s be-

havior and comments on race, which is

not directly related to efforts to address

systemic inequality. The error span is:

‘efforts to address systemic inequality’

To ﬁx this, consider changing the span

to ‘President Trump’s comments on race

and his behavior towards LaVar Ball’

The U.S. Navy has identiﬁed three sailors

who went missing after their plane crashed

off the coast of Japan. They are Lt. Steven

Combs from Florida, Airman Matthew Chialas-

tri from Louisiana, and Airman Apprentice

Bryan Grosso from Florida. All three are as-

signed to the Provider Fleet Logistic Support

Squadron 30 and were among 11 crew and pas-

sengers on the C-2 Greyhound aircraft. The

search for the missing sailors covered over 1,000

nautical miles before being called off.

One is assigned to provider’s ﬂeet of

logistical support squadron 30 and two

are assigned to USS Ronald Reagan.

The document does not specify that

the three sailors were assigned to

the Provider Fleet Logistic Support

Squadron 30. This information is not

present in the document and thus intro-

duces extrinsic information. The error

span is: ‘Provider Fleet Logistic Sup-

port Squadron 30’ To ﬁx this, consider

changing the span to ‘the Provider Fleet

Logistic Support Squadron 30’

Table 21: Examples of human written feedback and Critique-L3-FT generated feedback for a few examples in

TofuEval. The feedback is for a sentence in the original response. The model generated feedback is generated using

Prompt C.6.