ChatGPT lied on purpose? Harvard University proposed ITI: the authenticity of the model is doubled, and the computational overhead is basically zero. 07/19 Update SLTechnology News&Howtos

ChatGPT lied on purpose? Harvard University proposed ITI: the authenticity of the model is doubled, and the computational overhead is basically zero.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

It is true that there is real information in the internal representation of GPT, and Harvard scholars put forward that ITI guides the output to the direction of fact.

Large language models, such as ChatGPT, often output error messages in answers, which can be misleading to users, a phenomenon also known as model hallucination (hallucination).

Intuitively, the language model must have seen the correct answer in the training, but the factual information was lost in the process of reasoning.

Recently, researchers at Harvard University proposed the reasoning-time intervention (Inference-Time Intervention,ITI) technology, which transforms the model activation (shift) in the reasoning phase and directs the model output to the factual direction. The intervention results significantly improve the performance of the LLaMA model in the TruthfulQA benchmark test, and improve the authenticity of the Alpaca model from 32.5% to 65.1%.

Links to papers: https://arxiv.org/ pdf / 2306.03341.pdf

Code link: https://github.com/ likenneth / honest_llama

The researchers used this technology to develop and open source an "honest LLaMA" model.

ITI can also adjust the intensity of intervention by controlling hyperparameters to balance the authenticity and usefulness of the model; ITI does not modify the original model and basically has no computational overhead; and ITI does not need a large amount of labeling data, only hundreds of samples are needed to determine the direction of authenticity.

The results show that there is factual information in the internal representation of the language model, but sometimes false facts are selected in the generation.

ITI makes answers more authentic. Work has made progress in "understanding the inner workings of LLMs". One of the important themes is that the activation space of language models seems to contain interpretable directions that play a causal role in reasoning.

Based on this idea, the researchers proposed a method to enhance the factuality of the language model, namely reasoning-time intervention, whose basic idea is to determine the direction related to the correct statement in the activation space, and then change the activation in that direction during the reasoning process.

This paper mainly explores how to control the behavior of the model, and uses the open source LLaMA, Alpaca and Vicuna models in the experiment, but this idea applies to all GPT style systems, but it must be able to obtain the internal activation and calculation of the model.

The ITI method also requires a set of marked question-and-answer pairs to determine the attention head and direction related to the model telling the truth.

The basic setting is the dataset selection, and the researchers choose TruthfulQA, which can measure whether the language model is true when generating answers.

The dataset contains a total of 817 questions across 38 categories (for example, logical errors, intrigues, and common confusion points), with an average of 3.2true answers, 4.1false answers, and a gold standard answer supported by trusted online sources; then TruthfulQA's answers are rearranged to get a total of 5918 question-and-answer pairs, each data sample with a binary authenticity label.

It should be emphasized that the data set does not cover the full meaning of the word "truth", and it is impossible to cover it all. Researchers mainly focus on how to avoid "common human misunderstandings", and future research directions will consider expanding the concept and evaluation of authenticity.

In the model architecture, the large language model is mainly the Transformer layer, and the main mechanisms in each layer are multi-head attention (MHA) and multi-layer perceptron (MLP).

In the process of reasoning, each token is first embedded into a high-dimensional space, and the vector is used as the starting point of the residual flow, and finally each token is decoded into a prediction of the next token distribution; in each layer, the MHA consists of several independent linear operations, while the MLP contains all the nonlinear operations in the model.

Detection authenticity in order to improve the authenticity of the neural network, we first need to judge whether there is authenticity or fact in the activation space of the model.

One of the common tools to identify the internal representation of a network is probe, which trains a classifier as a detector to distinguish specific types of inputs or outputs on network activation.

In factual detection, the detector mainly examines the attention head output values that can distinguish between true and false answers.

In each sample in the TruthfulQA, the researchers concatenate the questions / answers, and take the head activation at the last token as the probe data set; then randomly divide the data set into the training set and the verification set according to 4: 1, fit a binary linear classifier on the training set, and use the verification accuracy to measure the relationship between each head and the benchmark data performance.

The experimental results show a dedicated mode across attention heads. for multiple heads in each layer, linear detection can achieve the accuracy of the baseline model, but it still shows the potential for strong performance. for example, the highest accuracy is achieved by the 18th head of layer 14, and the verification accuracy is 83.3%.

In addition, you can see the differences between the layers: the information is mainly processed in the previous layer, and a small amount of attention stands out within each layer.

Through a method similar to principal component analysis (PCA), the dimension in the active space can be reduced to 2 and visualized, and it can be observed that the concept of "reality" exists not only in one direction, but in a subspace.

Reasoning-time intervention the above probe experiments describe how LLM processes fact-related information between and within its attention heads, and proposes a technique to improve the performance of benchmark datasets.

If we intervene in the reasoning process to change the activation to the "real" direction, then the network is likely to provide more realistic answers to the benchmark questions.

First of all, the researchers did not choose to intervene with all the attention heads, because only some attention heads were closely related to authenticity, but only intervened with the results of the first K heads to make them less aggressive.

The second problem is how to determine the active vector used to transform the output of a particular head, because the geometry of true and false statements is very complex, when choosing the direction in which the transformation is activated, you can choose the vector that is orthogonal to the separation hyperplane learned by detection, or he can choose the vector that connects the average of the true and false distribution. Comparative experiments in different intervention directions are listed in the following table.

The Probe weight direction is the direction found by a linear probe, and intervention in this direction is equivalent to a gradient decrease in head activation to maximize the probability that it is predicted to be true.

Mass Mean Shift works by first calculating the average values of real and false activation, and then intervening with vectors that point from the false average to the real mean.

Compare the directions found by consistent search (CCS) in the case of knowing only internally activated pairs of information.

The researchers trained CCS on TruthfulQA to extract a true answer and a wrong answer to each question. Because CCS does not accept tagged input, the directions found are equally likely to become true and false, and then use tags to identify real directions for intervention.

The researchers first ranked the true and false correlations of all attention heads by verifying the detection accuracy on the set. The first K heads are taken as the target set, and then the activation of the training set and the verification set is used to estimate the standard deviation of the activation along the real direction.

ITI is an alternative form of MHA. For unselected attention heads, theta is a zero vector, which is equivalent to the standard deviation of α times the real direction of the activation.

The whole process is repeated autoregressively for each next token prediction and is orthogonal to the selection of decoding algorithm.

There are two key parameters in the formula, namely, the number of attention heads of intervention K and intervention intensity α, but there is no theoretical demonstration about the optimal value at present. The influence of parameters can only be explored through experiments, and the optimal value can be determined by standard hyperparametric scanning.

From the perspective of computational efficiency, no matter how many attention heads are intervened, ITI will only add a constant vector at each layer, so it can be considered that the computational cost of the intervention is close to zero.

The baseline method for comparison in the experimental part is as follows:

1. Supervised fine-tuning (SFT) SFT is the first stage of RLHF, in which researchers use questions as cues, use cross-entropy loss to prompt the model to generate real answers, and punish wrong answers.

However, if only the above operations are used, the cross-entropy loss and KL divergence will increase sharply, so it is necessary to carry out supervised training for Q & An and pre-training for open network texts alternately.

two。 Small sample hint (FSP) some researchers have found that indistribution 50-shot hint is also a competitive baseline method on TruthfulQA compared with contextual distillation and RLHF.

However, because the choice of cue strategy is orthogonal to the reasoning time control method, the researchers compared a small number of sample cues with and without ITI.

3. Instruction fine tuning (IFT) in order to understand how ITI makes the IFT model more realistic, the researchers mainly chose two LaMA-7B-based models (Alpaca and Vicuna) to perform ITI operations.

The researchers first looked for the optimal value of the super parameter to control the intensity of the intervention, and finally determined the Kappa 48 and α = 15.

From the results, the combination of small sample hints and ITI achieved the best results.

In the experiment of applying ITI to the instruction fine-tuning model, finding and interfering with its authenticity direction, we can see that ITI significantly improves the authenticity than the baseline, and can also be applied to fewer sample hints or instruction fine-tuning, but at the cost of relatively low CE loss and KL divergence improvement.

Reference:

Https://the-decoder.com/honest-llama-new-method-could-make-chatgpt-more-truthful/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.