當前位置：首頁 > news >正文

做網站具體指什么百度招商客服電話

news 2025/7/1 7:17:45

做網站具體指什么,百度招商客服電話,網站改版對用戶的影響,建設網站公司哪家技術好前言 Sora 問世才不到兩個星期，谷歌的世界模型也來了，能力看似更強大：它生成的虛擬世界自主可控第一部分首個基礎世界模型Genie 1.1 Genie是什么 Genie是第一個以無監(jiān)督方式從未標記的互聯(lián)網視頻中訓練的生成式交互環(huán)境(the first gener…

前言

Sora?問世才不到兩個星期，谷歌的世界模型也來了，能力看似更強大：它生成的虛擬世界自主可控

第一部分首個基礎世界模型Genie

1.1?Genie是什么

Genie是第一個以無監(jiān)督方式從未標記的互聯(lián)網視頻中訓練的生成式交互環(huán)境(the first generative interactive environment trained in an unsupervised manner from unlabelled Internet video)的基礎世界模型

其訓練數據集包含超過200000小時公開可用的互聯(lián)網游戲視頻，盡管沒有動作或文本注釋的訓練(沒有任何動作標簽數據)，但可以通過學習到的潛在動作空間逐幀進行控制(Our approach, Genie, is trained from a large dataset of over 200,000 hours of publicly available Internet gaming videos and, despite training without action or text annotations, is controllable on a frame-by-frame basis via a learned latent action space)

這點其實非常牛，因為互聯(lián)網上視頻太多了，很多都是沒有任何標簽或描述的，有的只是一個個動作、一幀幀畫面，但模型如果能根據已經看到的動作或畫面去預測下一個可能的畫面(已有動作?+ 預測接下來潛在可能的動作 = 下一個畫面?)，之后再把預測的畫面與真實畫面建loss去優(yōu)化預測策略，那說的極端點，哪個視頻不可以用作訓練視頻呢？

總之，因為互聯(lián)網視頻通常沒有關于正在執(zhí)行哪個動作、應該控制圖像哪一部分的標簽，但 Genie 能夠專門從互聯(lián)網視頻中學習細粒度的控制

且盡管所用數據更多是 2D Platformer 游戲游戲和機器人視頻，但可擴展到更大的互聯(lián)網數據集

對于 Genie 而言，它不僅了解觀察到的哪些部分通常是可控的，而且還能推斷出在生成環(huán)境中一致的各種潛在動作

1.2?Genie能干啥

最終，只需要一張圖像就可以創(chuàng)建一個全新的交互環(huán)境，例如，可以使用最先進的文本生成圖像模型來生成起始幀，然后與 Genie 一起生成動態(tài)交互環(huán)境

在如下動圖中，谷歌使用 Imagen2 生成了圖像，再使用 Genie 將它們變?yōu)楝F實：

Genie 能做到的不止如此，它還可以應用到草圖等人類設計相關的創(chuàng)作領域

或者，應用在真實世界的圖像中：

此文，谷歌在 RT1 的無動作視頻上訓練了一個較小的 2.5B 模型。與 Platformers 的情況一樣，具有相同潛在動作序列的軌跡通常會表現出相似的行為。

這表明 Genie 能夠學習一致的動作空間，這可能適合訓練機器人，打造通用化的具身智能

第二部分技術揭秘：論文《Genie: Generative Interactive Environments》

Genie對應的論文為《Genie: Generative Interactive Environments》，其項目主頁為：https://sites.google.com/view/genie-2024/home

論文的共同一作多達 6 人，其中包括華人學者石宇歌Yuge (Jimmy) Shi，她目前是谷歌 DeepMind 研究科學家，2023 年獲得牛津大學機器學習博士學位

2.1 ST-transformer 架構

視頻最多可以包含 𝑂(10^4 ) 個 token，而 Transformer 的二次內存成本對于視頻生成的壓力是比較大的，因此，Genie在所有模型組件中采用內存高效的 ST-transformer 架構

與傳統(tǒng)的Transformer不同，每個token都會與其他所有token進行關注(Unlike a traditional transformer where every token attends to all other)，ST-transformer包含 $L$ ?個時空塊，其中交替出現空間和時間注意力層，然后是標準的前饋層FFW注意力塊

空間層中的自注意力關注每個時間步內的 $1 \times H \times W$ 個token，而時間層中的自注意力關注跨越 $T \times 1 \times 1$ 個token的 $T$ 個時間步
與序列轉換器類似，時間層假設具有因果結構和因果掩碼。重要的是，我們架構中計算復雜性的主導因素(即空間注意力層)與幀數呈線性關系，而不是二次關系，使其在具有一致動態(tài)的長時間交互視頻生成中更加高效
Similar to sequence transformers, the temporal layer assumes a causal structure with a causal mask. Crucially, the dominating factor of computation complexity (i.e. the spatial attention layer) in our architecture scales linearly with the number of frames rather than quadratically, making it much more efficient for video generation with consistent dynamics over extended interactio
此外，在ST塊中，僅包含一個FFW在空間和時間組件之后，省略了post-spatial FFW，以便擴展模型的其他組件(we include only one FFW after both spatial and temporal components, omitting the post-spatial FFW to allow for scaling up other components of the model)

2.2?三個關鍵組件：視頻Tokenizer、潛在動作模型LAM、動態(tài)模型

Genie 包含三個關鍵組件，如下圖所示

視頻Tokenizer，用于將原始視頻的一系列幀轉換為離散 token 𝒛；
潛在動作模型(Latent Action Model，LAM)，用于推斷每對幀之間的潛在動作𝒂
一個自回歸動力學模型MaskGIT，用于在給定潛在動作和過去幀?token的情況下，預測視頻的下一幀

2.2.1 視頻Tokenizer：基于ST-ViViT編碼

在之前研究的基礎上，谷歌將視頻壓縮為離散 token，以降低維度并實現更高質量的視頻生成。實現過程中，谷歌使用了 VQ-VAE，其將視頻的 𝑇 幀 $\boldsymbol{x}_{1: T}=\left(x_{1}, x_{2}, \cdots, x_{T}\right) \in \mathbb{R}^{T \times H \times W \times C}$ 作為輸入，從而為每個幀生成離散表示： $\boldsymbol{z}_{1: T}=\left(z_{1}, z_{2}, \cdots, z_{T}\right) \in \mathbb{I}^{T \times D}$ ，其中𝐷 是離散潛在空間大小。分詞器在整個視頻序列上使用標準的 VQ-VQAE 進行訓練

與之前專注于僅在Tokenizer階段進行空間壓縮的工作不同

Genie在編碼器和解碼器中都使用了ST-transformer來融入時間動態(tài)，從而提高了視頻生成質量
Unlike prior works that focus on spatial-only compression in the tokenization phase, we utilize the ST-transformer in both the encoder and decoder to incorporate temporal dynamics in the encodings, which improves the video generation quality
由于ST-transformer的因果性質，每個離散編碼 $z_{t}$ 包含了視頻 $x_{1: t}$ 中所有先前幀的信息
By the causal nature of the STtransformer, each discrete encoding 𝑧𝑡 contains information from all previously seen frames of the video 𝒙1:𝑡 .
Phenaki也使用了一種具有時間感知的分詞器C-ViViT，但這種架構的計算成本隨著幀數的增加呈二次增長
Phenaki (Villegas et al., 2023) also uses a temporal-aware tokenizer, C-ViViT, but this architecture is compute intensive, as the cost grows quadratically with the number of frames
相比之下，我們基于ST-transformer的tokenizer(ST-ViViT)在計算效率上更高，其成本的主導因素呈線性增長
in comparison, our ST-transformer based tokenizer (ST ViViT) is much more compute efficient with the dominating factor in its cost increasing linearly

2.2.2 潛在動作模型LAM：也基于ST-transformer實現

為了實現可控的視頻生成，谷歌將前一幀所采取的動作作為未來幀預測的條件。然而，此類動作標簽在互聯(lián)網的視頻中可用的很少，并且獲取動作注釋的成本會很高。相反，谷歌以完全無監(jiān)督的方式學習潛在動作

首先，編碼器將所有先前的幀 $\boldsymbol{x}_{1: t}=\left(x_{1}, \cdots x_{t}\right)$ 以及下一幀 $x_{t+1}$ 作為輸入，并輸出相應的一組連續(xù)的潛在動作 $\tilde{\boldsymbol{a}}_{1: t}=\left(\tilde{a}_{1}, \cdots \tilde{a}_{t}\right)$
First, an encoder takes as inputs all previous frames 𝒙1:𝑡 = (𝑥1, · · · 𝑥𝑡) as well as the next frame 𝑥𝑡+1, and outputs a corresponding set of continuous latent actions ?𝒂1:𝑡 = (?𝑎1, · · · ?𝑎𝑡).
然后，解碼器將所有先前的幀 $\boldsymbol{x}_{1: t}=\left(x_{1}, \cdots x_{t}\right)$ 和潛在動作 $a_t$ 作為輸入，并預測下一幀 $x_{t+1}$
A decoder then takes all previous frames and latent actions as input and predicts the next frame 𝑥?𝑡+1

為了訓練模型

我們利用了基于VQ-VAE的目標函數，這使我們能夠將預測的動作數量限制在一個小的離散代碼集合中。我們將VQ?codebook的詞匯大小(如果你不了解什么是VQ codebook，請參見此文的1.2.3節(jié)VAE的改進：VQ-VAE/VQ-VAE2)，即可能的潛在動作的最大數量，限制在一個小的值上，以便實現人類可玩性并進一步強制可控性(在我們的實驗中，我們使用了| 𝐴 | = 8)
To train the model, we leverage a VQ-VAEbased objective, which enables us to limit the number of predicted actions to a small discrete set of codes. We limit the vocabulary size |𝐴| of the VQ codebook, i.e. the maximum number of possible latent actions, to a small value to permit human playability and further enforce controllability (use |𝐴| = 8 in our experiments)
由于解碼器只能訪問歷史記錄和潛在動作， $\tilde{a}_{t}$ 應該編碼過去和未來之間最有意義的變化，以便解碼器能夠成功重構未來的幀(As the decoder only has access to the history and latent action, ?𝑎𝑡 should encode the most meaningful changes between the past and the future for the decoder to successfully reconstruct the future frame)

具體而言，Genie利用上文介紹過的ST-transformer架構來實現潛在動作模型。時間層中的因果掩碼允許我們將整個視頻 $\boldsymbol{x}_{1: T}$ 作為輸入，并生成每個幀之間的所有潛在動作 $\tilde{\boldsymbol{a}}_{1: T-1}$

2.2.3 動力學模型：僅解碼的MaskGIT Transformer

動力學模型是一個僅解碼的MaskGIT transformer

在每個時間步 $t \in[1, T]$ ，它接收分詞視頻 $\boldsymbol{z}_{1: t-1}$ 和停止梯度的潛在動作 $\tilde{\boldsymbol{a}}_{1: t-1}$ ，并預測下一個幀token $\hat{z}_{t}$
At each time step 𝑡 ∈ [1, 𝑇], it takes in the tokenized video 𝒛1:𝑡?1 and stopgrad latent actions ?𝒂1:𝑡?1 and predicts the next frame tokens ?𝑧𝑡
我們再次使用ST-transformer，其因果結構使我們能夠使用所有 $T-1$ 幀 $\boldsymbol{z}_{1: T-1}$ 和潛在動作 $\tilde{\boldsymbol{a}}_{1: T-1}$ 作為輸入，并為所有下一個幀生成預測 $\hat{\boldsymbol{z}}_{2: T}$ ，該模型通過預測token $\hat{\boldsymbol{z}}_{2: T}$ 和真實token $\boldsymbol{z}_{2: T}$ 之間的交叉熵損失進行訓練
We again utilize an ST-transformer,whose causal structure enables us to use tokens from all (𝑇 ? 1) frames 𝒛1:𝑇 ?1 and latent actions ?𝒂1:𝑇 ?1 as input, and generate predictions for all next frames ?𝒛2:𝑇 . The model is trained with a cross entropy loss between the predicted tokens ?𝒛2:𝑇 and ground-truth tokens 𝒛2:𝑇 .
在訓練時，我們根據均勻采樣的伯努利分布掩蓋率隨機屏蔽輸入token $\boldsymbol{z}_{2: T-1}$ ，掩蓋率在0.5和1之間
At train time we randomly mask the input tokens 𝒛2:𝑇 ?1 ac-cording to a Bernoulli distribution masking rate sampled uniformly between 0.5 and 1.

請注意，訓練世界模型的常見做法，包括基于transformer的模型，是將時間 𝑡的動作連接到相應的幀上，然而，他們發(fā)現將潛在的行動作為additive embeddings來處理，對于潛在的行動和動力學模型都有助于提高生成的可控性
Note that a common practice for training world-models, including transformer-based models, is to concate- nate the action at time 𝑡 to the corresponding frame (Micheli et al., 2023; Robine et al., 2023).
However, we found that treating the latent actions as additive embeddings for both the latent action and dynamics models helped to improve the controllability of the generations

2.3 Genie 的推理過程

如下圖所示

圖像使用視頻編碼器進行token，得到 $z_{1}$ ?(The image is tokenized using the video encoder,yielding 𝑧1)
然后玩家通過選擇 $[0,|A|)$ 中的任意整數來指定一個離散的潛在行動 $a_{1}$ 值
The player then specifies a discrete latent action 𝑎1 to take by choosing any integer?value within [0, | 𝐴|).
動力學模型接收幀token $z_{1}$ 和相應的潛在動作 $\tilde{a}_{1}$ ，通過使用離散輸入 $a_{1}$ 從VQ碼本中索引獲得，以預測下一個幀token $z_{2}$ 。這個過程重復進行，以自回歸方式生成序列的其余部分 $\hat{\boldsymbol{z}}_{2: T}$ ，同時繼續(xù)將動作傳遞給模型，而token則通過分詞器的解碼器解碼成視頻幀 $\hat{\boldsymbol{x}}_{2: T}$
The dynamics model takesthe frame tokens 𝑧1 and corresponding latent ac-tion ?𝑎1, which is obtained by indexing into the VQ codebook with the discrete input 𝑎1, to predict the next frame tokens 𝑧2. This process is repeated to generate the rest of the sequence ?𝒛2:𝑇 in an au-toregressive manner as actions continue to be passed to the model, while tokens are decoded into video frames ?𝒙2:𝑇 with the tokenizer’s de-coder.

注意，我們可以通過向模型傳遞起始幀和從視頻中推斷出的動作來重新生成數據集中的真實視頻，或者通過更改動作來生成全新的視頻
Note that we can regenerate ground truth videos from the dataset by passing the model the starting frame and inferred actions from the video, or generate completely new videos (or tra-jectories) by changing the action