about « all posts

Measuring Faithfulness in Chain-of-Thought Reasoning

Sep 19 2023 · 8 min read
#CoT #Faithfulness

Generated reasoning is faithful to the model’s true reasoning, if it can “accurately represents the reasoning process behind the model’s prediction”. This is particularly important 1) in high-stakes settings, such as medical decision-making, and 2) for gaining a better understanding of how reasoning works in LLMs. This work provides a timely investigation into the faithfulness of CoT reasoning for LLMs, adding to previous research that suggests LLM-generated reasoning may not be faithful.

中文版本:知乎

Table of Contents

Measuring Faithfulness

The authors have several hypotheses of why (zero-shot) CoT fails on faithfulness:

The proposed tests for measuring CoT faithfulness. <strong>Early Answering</strong>: Truncate the original CoT before answering. <strong>Adding Mistakes</strong>: Have a language model add a mistake somewhere in the original CoT and then regenerate the rest of the CoT. <strong>Paraphrasing</strong>: Reword the beginning of the original CoT and then regenerate the rest of the CoT. <strong>Filler Tokens</strong>: Replace the CoT with ellipses.

The proposed tests for measuring CoT faithfulness. Early Answering: Truncate the original CoT before answering. Adding Mistakes: Have a language model add a mistake somewhere in the original CoT and then regenerate the rest of the CoT. Paraphrasing: Reword the beginning of the original CoT and then regenerate the rest of the CoT. Filler Tokens: Replace the CoT with ellipses.

Experimental setup

An example of the used CoT prompt. Underlined text is produced by the model.

An example of the used CoT prompt. Underlined text is produced by the model.

Basic statistics

Untitled

Early Answering: Does Truncating the Chain of Thought Change the Predicted Answer?

The procedure of Early Answering is as follows:

We observe wide variation between tasks:

Untitled

We also see little correlation with the performance gain from CoT. This reveals that faithfulness may not be very relevant to task performance.

AOC indicates area over the curve for the early answering and adding mistakes experiments respectively, weighted by the representation of each CoT length. A higher AOC indicates a lower amount of post-hoc reasoning.

AOC indicates area over the curve for the early answering and adding mistakes experiments respectively, weighted by the representation of each CoT length. A higher AOC indicates a lower amount of post-hoc reasoning.

Adding Mistakes: Does Editing the Chain of Thought Change the Predicted Answer?

Injecting mistakes is another approach to testing post-hoc reasoning. The procedure is as follows:

An add-mistakes sample from AQuA example. The introduced mistake is underlined.

An add-mistakes sample from AQuA example. The introduced mistake is underlined.

The results largely agree with Early Answering. This reinforces our previous findings: the extent of post-hoc reasoning varies considerably between tasks, and it is not strongly correlated with the accuracy improvement conferred by CoT.

Besides, we note that AQuA and LogiQA are with the most faithful reasoning. Possible explanations are:

Untitled

Filler Tokens: Do Uninformative Chain of Thought Tokens Also Improve Performance?

The procedure of Filler Tokens is simple: replacing the CoT with a number of filler tokens. The observations is also clear: no increase in accuracy when adding “…” tokens to the context. This suggests that extra test-time compute along is not used to perform helpful but unstated reasoning.

Untitled

Paraphrasing: Does Rewording the Chain of Thought Change the Predicted Answer?

Recently, there has been studies on how RLHF steers LLMs to generate short answers, providing an implicit length penalty to disguise the content of reasoning. Since we are using a RLHF-finetuned LLM, this might also affects faithfulness of CoT reasoning.

To validate whether encoded reasoning happens, the authors use another pretrained LLM to paraphrase the CoT that could contain encoded information. The procedure is as follows:

A paraphrasing example. Differences from the original chain of thought are underlined in the paraphrased sample.

A paraphrasing example. Differences from the original chain of thought are underlined in the paraphrased sample.

We observe no significant gaps between the paraphrased accuracy and the accuracy of the original CoT. It is therefore unlikely for LLMs to encode hidden information responsible for the accuracy gains of CoT.

Untitled

Scaling Laws

To answer a question correctly, the model can either 1) rely on immediate intuitions or 2) use step-by-step reasoning to arrive at a final answer. Hence, there is an implicit trade-off between task performance and reasoning faithfulness.

In this section, the authors measure the faithfulness by varying model sizes. The authors conduct two sets of experiments: 1) standard tasks, in which the results are obtained by re-running experiments on all the above tasks; 2) addition tasks, which contains synthetic addition tasks with 2/4/8/16 operands of two or three digits.

Standard Tasks

We can observe two trends from the results (or a U-shape trend overall):

To conclude, there seems to be a dilemma for reasoning faithfulness: If the model is too small, it cannot even reason; If the model is too large, it ceases to rely on faithful reasoning to solve problems.

Note: y-axis represents answer inconsistency w/ and w/o CoT. A lower value indicates higher faithfulness. The authors choose this metric since it is highly predictive of overall early answering and adding mistakes results.

Note: y-axis represents answer inconsistency w/ and w/o CoT. A lower value indicates higher faithfulness. The authors choose this metric since it is highly predictive of overall early answering and adding mistakes results.

Addition Tasks

To validate the above conclusion, the authors design another set of synthetic experiments. The synthetic setup allows us to control task difficulty. Two examples are presented above.

Untitled

The results again emphasize the inverse scaling law of faithfulness vs model size. It may be necessary to choose models that are less capable than the maximally capable model available, especially for easier tasks.

Untitled

Conclusion

This study examines the faithfulness of CoT reasoning generated by LLMs. Three hypotheses of the cause of unfaithfulness are tested with comprehensive experiments: 1) post-hoc reasoning, 2) test-time computation, and 3) encoded reasoning. The main takeaways are as follows:

One major drawback of this work (as well as most evaluation studies) is the lack of transparency in the model’s internal reasoning process. Since there is no ground-truth information available, all the results remain hypotheses. To address this issue, we can explore more detailed analysis or mechanisms to test the model’s internal states.

Another limitation of the evaluation is its focus on RLHF models, which may have different levels of reasoning faithfulness compared to pretrained models. An interesting direction is to investigate how RLHF affects reasoning faithfulness.

Finally, the work does not provide any solutions to improve faithfulness. This is partially achieved by their follow-up work, which suggests breaking down CoT reasoning into distinct subproblems and subanswers.