當(dāng)前位置：首頁(yè) > news >正文

舉報(bào)網(wǎng)站怎么做新手怎么做網(wǎng)絡(luò)銷售

news 2025/7/14 3:15:50

舉報(bào)網(wǎng)站怎么做,新手怎么做網(wǎng)絡(luò)銷售,做棋牌網(wǎng)站合法,手機(jī)網(wǎng)站建設(shè)哪里好承接上文：Transformer Encoder-Decoer 結(jié)構(gòu)回顧筆者以huggingface T5 transformer 對(duì)encoder-decoder 模型進(jìn)行了簡(jiǎn)單的回顧。由于筆者最近使用decoder-only模型時(shí)發(fā)現(xiàn)，其使用細(xì)節(jié)和encoder-decoder有著非常大的區(qū)別；而huggingface的接口為…

承接上文：Transformer Encoder-Decoer 結(jié)構(gòu)回顧
筆者以huggingface T5 transformer 對(duì)encoder-decoder 模型進(jìn)行了簡(jiǎn)單的回顧。

由于筆者最近使用decoder-only模型時(shí)發(fā)現(xiàn)，其使用細(xì)節(jié)和encoder-decoder有著非常大的區(qū)別；而huggingface的接口為了實(shí)現(xiàn)統(tǒng)一化，很多接口的使用操作都是以encoder-decoder的用例為主（如T5），導(dǎo)致在使用hugging face運(yùn)行decoder-only模型時(shí)（如GPT，LLaMA），會(huì)遇到很多反直覺的問題。

本篇進(jìn)一步涉及decoder-only的模型，從技術(shù)細(xì)節(jié)上，簡(jiǎn)單列舉一些和encoder-decoder模型使用上的區(qū)別。

以下討論均以huggingface transformer接口為例。

1. 訓(xùn)練時(shí)input與output合并

對(duì)于encoder-decoder模型，我們需要把input和output 分別喂給模型的encoder和decoder。也就是說，像T5這種模型，會(huì)有一個(gè)單獨(dú)的encoder編碼input的上下文信息，由decoder解碼output和計(jì)算loss。簡(jiǎn)而言之，如果是encoder-decoder模型，我們只把 output喂給decoder（用于計(jì)算loss，teacher forcing），這對(duì)于我們大多是人來說是符合直覺的。

但decoder-onyl模型，需要你手動(dòng)地將input和output合并在一起，作為decoder的輸入。因?yàn)?#xff0c;從邏輯上講，對(duì)于decoder-only模型而言，它們并沒有額外的encoder去編碼input的上下文，所以需要把input作為“前文”，讓decoder基于這一段“前文”，把“后文”的output預(yù)測(cè)出來（auto regressive）。因此，在訓(xùn)練時(shí)，input和output是合并在一起喂給decoder-only 模型的（input這段前文必須要有）。這對(duì)于大多數(shù)習(xí)慣了使用encoder-decoder的人來說，是很違反直覺的。

于此相對(duì)應(yīng)的，decoder-only 模型計(jì)算loss時(shí)的“答案”（ground truth reference）也得是input和output的合并（因?yàn)橛?jì)算loss的時(shí)候，輸入token representation得和輸出ground truth reference要對(duì)應(yīng)）。而這樣一來，decoder 的loss就既包含output，又會(huì)涉及input上的預(yù)測(cè)error。由于我們大多數(shù)情況下不希望去懲罰decoder模型在input上的error，一般的做法是，訓(xùn)練時(shí)我們只計(jì)算output上的loss ，即，把input token對(duì)應(yīng)的ground truth全部設(shè)置為-100（cross entropy ignore idx）。

2. 測(cè)試時(shí)，手動(dòng)提取output

encoder-decoder模型的輸出就是很“純粹”的output（模型的預(yù)測(cè)結(jié)果）

但decoder-only模型，在做inference的時(shí)候，模型的輸出就會(huì)既包含output也包含input（因?yàn)閕nput也喂給了decoder）

所以這種情況下，decoder-only 模型我們需要手動(dòng)地把output給分離出來。

如下所示：
在這里插入圖片描述
筆者也很無(wú)語(yǔ)，huggingface的 model.generate() 接口為什么不考慮一下，對(duì)于decoder-only模型，設(shè)置一個(gè)額外參數(shù)，能夠自動(dòng)提取output（用input token的數(shù)量就可以自動(dòng)定位output，不難實(shí)現(xiàn)的）

3. batched inference的速度和準(zhǔn)確度

如果想要批量地進(jìn)行預(yù)測(cè)，簡(jiǎn)單的做法就是把一個(gè)batch的樣本，進(jìn)行tokenization之后，在序列末尾（右邊）pad token，補(bǔ)足長(zhǎng)度差異。這對(duì)于encoder-decoder 模型來說是適用的。

但是對(duì)于decoder-only模型，你需要在訓(xùn)練時(shí)，額外地將tokenizer的pad 位置設(shè)置為左邊：
在這里插入圖片描述
因?yàn)槟阋坏┰O(shè)置為默認(rèn)的右邊，模型在做inference時(shí)，一個(gè)batch的樣本，所有pad token就都在序列末尾。而decoder only模型是auto regressive地生成新token的，最右邊的pad token就很容易影響到模型生成的內(nèi)容。

有人就會(huì)問，這個(gè)時(shí)候和encoder-decoder模型一樣，用attention mask把那些pad tokens都遮掉，不就不會(huì)影響模型生成的內(nèi)容了嗎？

但是很遺憾，對(duì)于decoder-only模型，huggingface model.generate 接口并不支持輸入attention mask（如下面官方api所描述）：
在這里插入圖片描述
所以你如果想batched inference，不得不在訓(xùn)練和測(cè)試的時(shí)候，把tokenizer的pad設(shè)置在左手邊，以降低pad token對(duì)生成內(nèi)容的影響；或者干脆設(shè)置batch size為1 。

經(jīng)過筆者自己的實(shí)驗(yàn)，推理時(shí)batch size==1能夠顯著提升推理準(zhǔn)確度
以下為筆者測(cè)試的性能表現(xiàn)排序：

batch size 為1 （完全沒有pad token的影響），性能最好
batch size不為1，pad token在左側(cè)（pad token影響降低，但還是會(huì)損傷推理性能），在部分任務(wù)上，性能降低較為明顯
batch size不為1，pad token在右側(cè)（pad token對(duì)大量的樣本的預(yù)測(cè)產(chǎn)生極大干擾，模型最后的輸出基本都是亂碼），性能接近于0

總之，當(dāng)前huggingface的生成接口對(duì)于decoder-only模型的支持度并不是非常高，decoder-only模型推理的速度和精度，以及接口使用的便捷程度，都會(huì)遜于encoder-decoder模型。

下面是GPT-4對(duì)于為何decoder-only模型做推理時(shí)不需要attention mask（最后一句話是亮點(diǎn)。。。）：

The instruction from the Hugging Face Transformers documentation regarding the use of input_ids for decoder-only models during generation (like GPT-2, GPT-3) stems from the typical use case and architecture of these models.
Why input_ids?
Decoder-only models like GPT-2 or GPT-3 are designed to generate text sequentially. When generating text, the model uses its previously generated tokens as context to generate the next token. Since these models are autoregressive, they generate one token at a time, and the sequence of generated tokens grows until it reaches a stopping condition (like the max_length or eos_token_id). During this process, the only necessity is to know which tokens have been generated (encoded as input_ids), so the model knows the context based on which it should generate the subsequent token(s).
About attention_mask
While it’s true that attention masks are used to prevent the model from attending to certain positions within the input sequence (for example, padding tokens in a batched input scenario), during the generation phase, the need to specify an attention_mask explicitly is not as critical as during training or evaluation. This is because, during generation, the model is focusing on the tokens it has already generated and the positions it needs to fill next – all of which are relevant and none should be masked out.
That said, for certain generation scenarios or model configurations, you might want to control the attention mechanism explicitly. The Hugging Face Transformers library does allow for passing additional arguments like attention_mask in some contexts, but for the default use case of generating text with decoder-only models, including the attention_mask is not necessary.
Batched Inference
For batched inference, input_ids are necessary to understand the sequence (or sequences) from which to start generating text. If the sequences within a batch have different lengths, padding might be necessary to shape the input tensor appropriately. In such cases, an attention_mask becomes relevant because it allows the model to distinguish between the actual content and the padding. Therefore, while the generation call as you’ve shown doesn’t explicitly mention attention_mask, depending on the specifics of your use case and the model’s implementation, you might still need or want to provide it to ensure proper handling of batched inputs.

其他待補(bǔ)充

。。。

總結(jié)

總而言之，個(gè)人認(rèn)為 huggingface目前的模型接口，對(duì)于decoder-only模型的使用并不是很友好。在使用過程中需要注意很多細(xì)節(jié)，不然會(huì)遇到許多問題，而這些問題，encoder-decoder模型是完全不會(huì)有的。

參考：

官方接口
alpaca-lora

查看全文

http://www.risenshineclean.com/news/62808.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网