當前位置：首頁 > news >正文

桂林網(wǎng)站建設(shè)公司鎮(zhèn)江百度公司

news 2025/7/11 10:47:58

桂林網(wǎng)站建設(shè)公司,鎮(zhèn)江百度公司,陜西省建設(shè)網(wǎng)官網(wǎng),app開發(fā)公司成都摘要一、論文介紹研究背景：視覺Transformer在計算機視覺領(lǐng)域展現(xiàn)出巨大潛力，能夠捕獲長距離依賴關(guān)系，具有高并行性，有利于大型模型的訓練和推理?，F(xiàn)有問題：盡管大量研究設(shè)計了高效的注意力模式，但查詢并…

摘要

一、論文介紹

研究背景：視覺Transformer在計算機視覺領(lǐng)域展現(xiàn)出巨大潛力，能夠捕獲長距離依賴關(guān)系，具有高并行性，有利于大型模型的訓練和推理。
現(xiàn)有問題：盡管大量研究設(shè)計了高效的注意力模式，但查詢并非源自語義區(qū)域的關(guān)鍵值對，強制所有查詢關(guān)注不足的一組令牌可能無法產(chǎn)生最優(yōu)結(jié)果。雙級路由注意力雖由語義關(guān)鍵值對處理查詢，但可能并非在所有情況下都能產(chǎn)生最優(yōu)結(jié)果。
論文目的：提出DeBiFormer，一種帶有可變形雙級路由注意力（DBRA）的視覺Transformer，旨在優(yōu)化查詢-鍵-值交互，自適應選擇語義相關(guān)區(qū)域。

二、創(chuàng)新點

可變形雙級路由注意力（DBRA）：提出一種注意力中注意力架構(gòu)，通過可變形點和雙級路由機制，實現(xiàn)更高效、有意義的注意力分配。
可變形點感知區(qū)域劃分：確保每個可變形點僅與鍵值對的一個小子集進行交互，平衡重要區(qū)域和不太重要區(qū)域之間的注意力分配。
區(qū)域間方法：通過構(gòu)建有向圖建立注意關(guān)系，使用topk操作符和路由索引矩陣保留每個區(qū)域的topk連接。

三、方法

可變形注意力模塊：包含一個偏移網(wǎng)絡，為參考點生成偏移量，創(chuàng)建可變形點，這些點以高靈活性和效率向重要區(qū)域移動，捕獲更多信息性特征。
雙層標記到可變形層標記注意力：利用區(qū)域路由矩陣，對區(qū)域內(nèi)的每個可變形查詢標記執(zhí)行注意力操作，跨越位于topk路由區(qū)域中的所有鍵值對。
DeBiFormer模型架構(gòu)：使用四階段金字塔結(jié)構(gòu)，包含重疊補丁嵌入、補丁合并模塊、DeBiFormer塊等，用于降低輸入空間分辨率，增加通道數(shù)，實現(xiàn)跨位置關(guān)系建模和每個位置的嵌入。

四、模塊作用

可變形雙級路由注意力（DBRA）模塊：優(yōu)化查詢-鍵-值交互，自適應選擇語義相關(guān)區(qū)域，實現(xiàn)更高效和有意義的注意力。通過可變形點和雙級路由機制，提高模型對重要區(qū)域的關(guān)注度，同時減少不太重要區(qū)域的注意力。
3x3深度卷積：在DeBiFormer塊開始時使用，用于隱式編碼相對位置信息，增強模型的局部敏感性。
2-ConvFFN模塊：用于每個位置的嵌入，擴展模型的特征表示能力。

五、實驗結(jié)果

圖像分類：在ImageNet-1K數(shù)據(jù)集上從頭訓練圖像分類模型，驗證了DeBiFormer的有效性。
語義分割：在ADE20K數(shù)據(jù)集上對預訓練的主干網(wǎng)絡進行微調(diào)，DeBiFormer表現(xiàn)出色，證明了其在密集預測任務中的性能。
目標檢測和實例分割：使用DeBiFormer作為Mask RCNN和RetinaNet框架中的主干網(wǎng)絡，在COCO 2017數(shù)據(jù)集上評估其性能。盡管資源有限，但DeBiFormer在大目標上的性能優(yōu)于一些最具競爭力的現(xiàn)有方法。
消融研究：驗證了DBRA和DeBiFormer的top-k選擇的有效性，證明了可變形雙級路由注意力對模型性能的貢獻。

總結(jié)：本文介紹的DeBiFormer是一種專為圖像分類和密集預測任務設(shè)計的新型分層視覺Transformer。通過提出可變形雙級路由注意力（DBRA），優(yōu)化了查詢-鍵-值交互，自適應選擇語義相關(guān)區(qū)域，實現(xiàn)了更高效和有意義的注意力。實驗結(jié)果表明，DeBiFormer在多個計算機視覺任務上均表現(xiàn)出色，為設(shè)計靈活且語義感知的注意力機制提供了見解。

本文使用DeBiFormer模型實現(xiàn)圖像分類任務，模型選擇debi_tiny，在植物幼苗分類任務ACC達到了82%+。

在這里插入圖片描述

通過深入閱讀本文，您將能夠掌握以下關(guān)鍵技能與知識：

數(shù)據(jù)增強的多種策略：包括利用PyTorch的transforms庫進行基本增強，以及進階技巧如CutOut、MixUp、CutMix等，這些方法能顯著提升模型泛化能力。
DeBiFormer模型的訓練實現(xiàn)：了解如何從頭開始構(gòu)建并訓練DeBiFormer（或其他深度學習模型），涵蓋模型定義、數(shù)據(jù)加載、訓練循環(huán)等關(guān)鍵環(huán)節(jié)。
混合精度訓練：學習如何利用PyTorch自帶的混合精度訓練功能，加速訓練過程同時減少內(nèi)存消耗。
梯度裁剪技術(shù)：掌握梯度裁剪的應用，有效防止梯度爆炸問題，確保訓練過程的穩(wěn)定性。
分布式數(shù)據(jù)并行（DP）訓練：了解如何在多GPU環(huán)境下使用PyTorch的分布式數(shù)據(jù)并行功能，加速大規(guī)模模型訓練。
可視化訓練過程：學習如何繪制訓練過程中的loss和accuracy曲線，直觀監(jiān)控模型學習狀況。
評估與生成報告：掌握在驗證集上評估模型性能的方法，并生成詳細的評估報告，包括ACC等指標。
測試腳本編寫：學會編寫測試腳本，對測試集進行預測，評估模型在實際應用中的表現(xiàn)。
學習率調(diào)整策略：理解并應用余弦退火策略動態(tài)調(diào)整學習率，優(yōu)化訓練效果。
自定義統(tǒng)計工具：使用AverageMeter類或其他工具統(tǒng)計和記錄訓練過程中的ACC、loss等關(guān)鍵指標，便于后續(xù)分析。
深入理解ACC1與ACC5：掌握圖像分類任務中ACC1（Top-1準確率）和ACC5（Top-5準確率）的含義及其計算方法。
指數(shù)移動平均（EMA）：學習如何在模型訓練中應用EMA技術(shù)，進一步提升模型在測試集上的表現(xiàn)。

若您在以上任一領(lǐng)域基礎(chǔ)尚淺，感到理解困難，推薦您參考我的專欄“經(jīng)典主干網(wǎng)絡精講與實戰(zhàn)”，該專欄從零開始，循序漸進地講解上述所有知識點，助您輕松掌握深度學習中的這些核心技能。

安裝包

安裝timm

使用pip就行，命令：

pip install timm

mixup增強和EMA用到了timm

安裝einops，執(zhí)行命令：

pip install einops

數(shù)據(jù)增強Cutout和Mixup

為了提高模型的泛化能力和性能，我在數(shù)據(jù)預處理階段加入了Cutout和Mixup這兩種數(shù)據(jù)增強技術(shù)。Cutout通過隨機遮擋圖像的一部分來強制模型學習更魯棒的特征，而Mixup則通過混合兩張圖像及其標簽來生成新的訓練樣本，從而增加數(shù)據(jù)的多樣性。實現(xiàn)這兩種增強需要安裝torchtoolbox。安裝命令：

pip install torchtoolbox

Cutout實現(xiàn)，在transforms中。

from torchtoolbox.transform import Cutout
# 數(shù)據(jù)預處理
transform = transforms.Compose([transforms.Resize((224, 224)),Cutout(),transforms.ToTensor(),transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])

需要導入包：from timm.data.mixup import Mixup，

定義Mixup，和SoftTargetCrossEntropy

  mixup_fn = Mixup(mixup_alpha=0.8, cutmix_alpha=1.0, cutmix_minmax=None,prob=0.1, switch_prob=0.5, mode='batch',label_smoothing=0.1, num_classes=12)criterion_train = SoftTargetCrossEntropy()

Mixup 是一種在圖像分類任務中常用的數(shù)據(jù)增強技術(shù)，它通過將兩張圖像以及其對應的標簽進行線性組合來生成新的數(shù)據(jù)和標簽。
參數(shù)詳解：

mixup_alpha (float): mixup alpha 值，如果 > 0，則 mixup 處于活動狀態(tài)。

cutmix_alpha (float)：cutmix alpha 值，如果 > 0，cutmix 處于活動狀態(tài)。

cutmix_minmax (List[float])：cutmix 最小/最大圖像比率，cutmix 處于活動狀態(tài)，如果不是 None，則使用這個 vs alpha。

如果設(shè)置了 cutmix_minmax 則cutmix_alpha 默認為1.0

prob (float): 每批次或元素應用 mixup 或 cutmix 的概率。

switch_prob (float): 當兩者都處于活動狀態(tài)時切換cutmix 和mixup 的概率。

mode (str): 如何應用 mixup/cutmix 參數(shù)（每個’batch’，‘pair’（元素對），‘elem’（元素）。

correct_lam (bool): 當 cutmix bbox 被圖像邊框剪裁時應用。 lambda 校正

label_smoothing (float)：將標簽平滑應用于混合目標張量。

num_classes (int): 目標的類數(shù)。

EMA

EMA（Exponential Moving Average）在深度學習中是一種用于模型參數(shù)優(yōu)化的技術(shù)，它通過計算參數(shù)的指數(shù)移動平均值來平滑模型的學習過程。這種方法有助于提高模型的穩(wěn)定性和泛化能力，特別是在訓練后期。以下是關(guān)于EMA的總結(jié)，表達進行了優(yōu)化：

EMA概述

EMA是一種加權(quán)移動平均技術(shù)，其中每個新的平均值都是前一個平均值和當前值的加權(quán)和。在深度學習中，EMA被用于模型參數(shù)的更新，以減緩參數(shù)在訓練過程中的快速波動，從而得到更加平滑和穩(wěn)定的模型表現(xiàn)。

工作原理

在訓練過程中，除了維護當前模型的參數(shù)外，還額外保存一份EMA參數(shù)。每個訓練步驟或每隔一定步驟，根據(jù)當前模型參數(shù)和EMA參數(shù)，按照指數(shù)衰減的方式更新EMA參數(shù)。具體來說，EMA參數(shù)的更新公式通常如下：

$model_parameters \text{EMA}_{\text{new}} = \text{decay} \times \text{EMA}_{\text{old}} + (1 - \text{decay}) \times \text{model\_parameters}$
其中，decay是一個介于0和1之間的超參數(shù)，控制著舊EMA值和新模型參數(shù)值之間的權(quán)重分配。較大的decay值意味著EMA更新時更多地依賴于舊值，即平滑效果更強。

應用優(yōu)勢

穩(wěn)定性：EMA通過平滑參數(shù)更新過程，減少了模型在訓練過程中的波動，使得模型更加穩(wěn)定。
泛化能力：由于EMA參數(shù)是歷史參數(shù)的平滑版本，它往往能捕捉到模型訓練過程中的全局趨勢，因此在測試或評估時，使用EMA參數(shù)往往能獲得更好的泛化性能。
快速收斂：雖然EMA本身不直接加速訓練過程，但通過穩(wěn)定模型參數(shù)，它可能間接地幫助模型更快地收斂到更優(yōu)的解。

使用場景

EMA在深度學習中的使用場景廣泛，特別是在需要高度穩(wěn)定性和良好泛化能力的任務中，如圖像分類、目標檢測等。在訓練大型模型時，EMA尤其有用，因為它可以幫助減少過擬合的風險，并提高模型在未見數(shù)據(jù)上的表現(xiàn)。

具體實現(xiàn)如下：


import logging
from collections import OrderedDict
from copy import deepcopy
import torch
import torch.nn as nn_logger = logging.getLogger(__name__)class ModelEma:def __init__(self, model, decay=0.9999, device='', resume=''):# make a copy of the model for accumulating moving average of weightsself.ema = deepcopy(model)self.ema.eval()self.decay = decayself.device = device  # perform ema on different device from model if setif device:self.ema.to(device=device)self.ema_has_module = hasattr(self.ema, 'module')if resume:self._load_checkpoint(resume)for p in self.ema.parameters():p.requires_grad_(False)def _load_checkpoint(self, checkpoint_path):checkpoint = torch.load(checkpoint_path, map_location='cpu')assert isinstance(checkpoint, dict)if 'state_dict_ema' in checkpoint:new_state_dict = OrderedDict()for k, v in checkpoint['state_dict_ema'].items():# ema model may have been wrapped by DataParallel, and need module prefixif self.ema_has_module:name = 'module.' + k if not k.startswith('module') else kelse:name = knew_state_dict[name] = vself.ema.load_state_dict(new_state_dict)_logger.info("Loaded state_dict_ema")else:_logger.warning("Failed to find state_dict_ema, starting from loaded model weights")def update(self, model):# correct a mismatch in state dict keysneeds_module = hasattr(model, 'module') and not self.ema_has_modulewith torch.no_grad():msd = model.state_dict()for k, ema_v in self.ema.state_dict().items():if needs_module:k = 'module.' + kmodel_v = msd[k].detach()if self.device:model_v = model_v.to(device=self.device)ema_v.copy_(ema_v * self.decay + (1. - self.decay) * model_v)

加入到模型中。

#初始化
if use_ema:model_ema = ModelEma(model_ft,decay=model_ema_decay,device='cpu',resume=resume)# 訓練過程中，更新完參數(shù)后，同步update shadow weights
def train():optimizer.step()if model_ema is not None:model_ema.update(model)# 將model_ema傳入驗證函數(shù)中
val(model_ema.ema, DEVICE, test_loader)

針對沒有預訓練的模型，容易出現(xiàn)EMA不上分的情況，這點大家要注意啊！

項目結(jié)構(gòu)

DeBiFormer_Demo
├─data1
│  ├─Black-grass
│  ├─Charlock
│  ├─Cleavers
│  ├─Common Chickweed
│  ├─Common wheat
│  ├─Fat Hen
│  ├─Loose Silky-bent
│  ├─Maize
│  ├─Scentless Mayweed
│  ├─Shepherds Purse
│  ├─Small-flowered Cranesbill
│  └─Sugar beet
├─models
│  └─debiformer.py
├─mean_std.py
├─makedata.py
├─train.py
└─test.py

mean_std.py：計算mean和std的值。
makedata.py：生成數(shù)據(jù)集。
train.py：訓練models文件下DeBiFormer的模型
models：來源官方代碼。

計算mean和std

在深度學習中，特別是在處理圖像數(shù)據(jù)時，計算數(shù)據(jù)的均值（mean）和標準差（standard deviation, std）并進行歸一化（Normalization）是加速模型收斂、提高模型性能的關(guān)鍵步驟之一。這里我將詳細解釋這兩個概念，并討論它們?nèi)绾螏椭Ｐ蛯W習。

均值（Mean）

均值是所有數(shù)值加和后除以數(shù)值的個數(shù)得到的平均值。在圖像處理中，我們通常對每個顏色通道（如RGB圖像的三個通道）分別計算均值。這意味著，如果我們的數(shù)據(jù)集包含多張圖像，我們會計算所有圖像在R通道上的像素值的均值，同樣地，我們也會計算G通道和B通道的均值。

標準差（Standard Deviation, Std）

標準差是衡量數(shù)據(jù)分布離散程度的統(tǒng)計量。它反映了數(shù)據(jù)點與均值的偏離程度。在計算圖像數(shù)據(jù)的標準差時，我們也是針對每個顏色通道分別進行的。標準差較大的顏色通道意味著該通道上的像素值變化較大，而標準差較小的通道則相對較為穩(wěn)定。

歸一化（Normalization）

歸一化是將數(shù)據(jù)按比例縮放，使之落入一個小的特定區(qū)間，通常是[0, 1]或[-1, 1]。在圖像處理中，我們通常會使用計算得到的均值和標準差來進行歸一化，公式如下：

$\text{Normalized Value} = \frac{\text{Original Value} - \text{Mean}}{\text{Std}}$

注意，在某些情況下，為了簡化計算并確保數(shù)據(jù)非負，我們可能會選擇將數(shù)據(jù)縮放到[0, 1]區(qū)間，這時使用的是最大最小值歸一化，而不是基于均值和標準差的歸一化。但在這里，我們主要討論基于均值和標準差的歸一化，因為它能保留數(shù)據(jù)的分布特性。

為什么需要歸一化？

加速收斂：歸一化后的數(shù)據(jù)具有相似的尺度，這有助于梯度下降算法更快地找到最優(yōu)解，因為不同特征的梯度更新將在同一數(shù)量級上，從而避免了某些特征因尺度過大或過小而導致的訓練緩慢或梯度消失/爆炸問題。
提高精度：歸一化可以改善模型的泛化能力，因為它使得模型更容易學習到特征之間的相對關(guān)系，而不是被特征的絕對大小所影響。
穩(wěn)定性：歸一化后的數(shù)據(jù)更加穩(wěn)定，減少了訓練過程中的波動，有助于模型更加穩(wěn)定地收斂。

如何計算和使用mean和std

計算全局mean和std：在整個數(shù)據(jù)集上計算mean和std。這通常是在訓練開始前進行的，并使用這些值來歸一化訓練集、驗證集和測試集。
使用庫函數(shù)：許多深度學習框架（如PyTorch、TensorFlow等）提供了計算mean和std的便捷函數(shù)，并可以直接用于數(shù)據(jù)集的歸一化。
動態(tài)調(diào)整：在某些情況下，特別是當數(shù)據(jù)集非常大或持續(xù)更新時，可能需要動態(tài)地計算mean和std。這通常涉及到在訓練過程中使用移動平均（如EMA）來更新這些統(tǒng)計量。

計算并使用數(shù)據(jù)的mean和std進行歸一化是深度學習中的一項基本且重要的預處理步驟，它對于加速模型收斂、提高模型性能和穩(wěn)定性具有重要意義。新建mean_std.py,插入代碼：

from torchvision.datasets import ImageFolder
import torch
from torchvision import transformsdef get_mean_and_std(train_data):train_loader = torch.utils.data.DataLoader(train_data, batch_size=1, shuffle=False, num_workers=0,pin_memory=True)mean = torch.zeros(3)std = torch.zeros(3)for X, _ in train_loader:for d in range(3):mean[d] += X[:, d, :, :].mean()std[d] += X[:, d, :, :].std()mean.div_(len(train_data))std.div_(len(train_data))return list(mean.numpy()), list(std.numpy())if __name__ == '__main__':train_dataset = ImageFolder(root=r'data1', transform=transforms.ToTensor())print(get_mean_and_std(train_dataset))

數(shù)據(jù)集結(jié)構(gòu)：

運行結(jié)果：

([0.3281186, 0.28937867, 0.20702125], [0.09407319, 0.09732835, 0.106712654])

把這個結(jié)果記錄下來，后面要用！

生成數(shù)據(jù)集

我們整理還的圖像分類的數(shù)據(jù)集結(jié)構(gòu)是這樣的

data
├─Black-grass
├─Charlock
├─Cleavers
├─Common Chickweed
├─Common wheat
├─Fat Hen
├─Loose Silky-bent
├─Maize
├─Scentless Mayweed
├─Shepherds Purse
├─Small-flowered Cranesbill
└─Sugar beet

pytorch和keras默認加載方式是ImageNet數(shù)據(jù)集格式，格式是

├─data
│  ├─val
│  │   ├─Black-grass
│  │   ├─Charlock
│  │   ├─Cleavers
│  │   ├─Common Chickweed
│  │   ├─Common wheat
│  │   ├─Fat Hen
│  │   ├─Loose Silky-bent
│  │   ├─Maize
│  │   ├─Scentless Mayweed
│  │   ├─Shepherds Purse
│  │   ├─Small-flowered Cranesbill
│  │   └─Sugar beet
│  └─train
│      ├─Black-grass
│      ├─Charlock
│      ├─Cleavers
│      ├─Common Chickweed
│      ├─Common wheat
│      ├─Fat Hen
│      ├─Loose Silky-bent
│      ├─Maize
│      ├─Scentless Mayweed
│      ├─Shepherds Purse
│      ├─Small-flowered Cranesbill
│      └─Sugar beet

新增格式轉(zhuǎn)化腳本makedata.py,插入代碼：

import glob
import os
import shutilimage_list=glob.glob('data1/*/*.png')
print(image_list)
file_dir='data'
if os.path.exists(file_dir):print('true')#os.rmdir(file_dir)shutil.rmtree(file_dir)#刪除再建立os.makedirs(file_dir)
else:os.makedirs(file_dir)from sklearn.model_selection import train_test_split
trainval_files, val_files = train_test_split(image_list, test_size=0.3, random_state=42)
train_dir='train'
val_dir='val'
train_root=os.path.join(file_dir,train_dir)
val_root=os.path.join(file_dir,val_dir)
for file in trainval_files:file_class=file.replace("\\","/").split('/')[-2]file_name=file.replace("\\","/").split('/')[-1]file_class=os.path.join(train_root,file_class)if not os.path.isdir(file_class):os.makedirs(file_class)shutil.copy(file, file_class + '/' + file_name)for file in val_files:file_class=file.replace("\\","/").split('/')[-2]file_name=file.replace("\\","/").split('/')[-1]file_class=os.path.join(val_root,file_class)if not os.path.isdir(file_class):os.makedirs(file_class)shutil.copy(file, file_class + '/' + file_name)

完成上面的內(nèi)容就可以開啟訓練和測試了。

DeBiFormer代碼

import numpy as np
from collections import defaultdict
import matplotlib.pyplot as plt
from timm.models.registry import register_model
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import torchvisionfrom torch import Tensor
from typing import Tuple
import numbers
from timm.models.layers import to_2tuple, trunc_normal_
from einops import rearrange
import gc
import torch
import torch.nn as nn
from einops import rearrange
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from collections import OrderedDict
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from einops.layers.torch import Rearrange
from fairscale.nn.checkpoint import checkpoint_wrapper
from timm.models import register_model
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from timm.models.vision_transformer import _cfgclass LayerNorm2d(nn.Module):def __init__(self, channels):super().__init__()self.ln = nn.LayerNorm(channels)def forward(self, x):x = rearrange(x, "N C H W -> N H W C")x = self.ln(x)x = rearrange(x, "N H W C -> N C H W")return xdef init_linear(m):if isinstance(m, (nn.Conv2d, nn.Linear)):nn.init.kaiming_normal_(m.weight)if m.bias is not None: nn.init.zeros_(m.bias)elif isinstance(m, nn.LayerNorm):nn.init.constant_(m.bias, 0)nn.init.constant_(m.weight, 1.0)def to_4d(x,h,w):return rearrange(x, 'b (h w) c -> b c h w',h=h,w=w)#def to_4d(x,s,h,w):
#    return rearrange(x, 'b (s h w) c -> b c s h w',s=s,h=h,w=w)def to_3d(x):return rearrange(x, 'b c h w -> b (h w) c')#def to_3d(x):
#    return rearrange(x, 'b c s h w -> b (s h w) c')class Partial:def __init__(self, module, *args, **kwargs):self.module = moduleself.args = argsself.kwargs = kwargsdef __call__(self, *args_c, **kwargs_c):return self.module(*args_c, *self.args, **kwargs_c, **self.kwargs)class LayerNormChannels(nn.Module):def __init__(self, channels):super().__init__()self.norm = nn.LayerNorm(channels)def forward(self, x):x = x.transpose(1, -1)x = self.norm(x)x = x.transpose(-1, 1)return xclass LayerNormProxy(nn.Module):def __init__(self, dim):super().__init__()self.norm = nn.LayerNorm(dim)def forward(self, x):x = rearrange(x, 'b c h w -> b h w c')x = self.norm(x)return rearrange(x, 'b h w c -> b c h w')class BiasFree_LayerNorm(nn.Module):def __init__(self, normalized_shape):super(BiasFree_LayerNorm, self).__init__()if isinstance(normalized_shape, numbers.Integral):normalized_shape = (normalized_shape,)normalized_shape = torch.Size(normalized_shape)assert len(normalized_shape) == 1self.weight = nn.Parameter(torch.ones(normalized_shape))self.normalized_shape = normalized_shapedef forward(self, x):sigma = x.var(-1, keepdim=True, unbiased=False)return x / torch.sqrt(sigma+1e-5) * self.weightclass WithBias_LayerNorm(nn.Module):def __init__(self, normalized_shape):super(WithBias_LayerNorm, self).__init__()if isinstance(normalized_shape, numbers.Integral):normalized_shape = (normalized_shape,)normalized_shape = torch.Size(normalized_shape)assert len(normalized_shape) == 1self.weight = nn.Parameter(torch.ones(normalized_shape))self.bias = nn.Parameter(torch.zeros(normalized_shape))self.normalized_shape = normalized_shapedef forward(self, x):mu = x.mean(-1, keepdim=True)sigma = x.var(-1, keepdim=True, unbiased=False)return (x - mu) / torch.sqrt(sigma+1e-5) * self.weight + self.biasclass LayerNorm(nn.Module):def __init__(self, dim, LayerNorm_type):super(LayerNorm, self).__init__()if LayerNorm_type =='BiasFree':self.body = BiasFree_LayerNorm(dim)else:self.body = WithBias_LayerNorm(dim)def forward(self, x):h, w = x.shape[-2:]return to_4d(self.body(to_3d(x)), h, w)#class LayerNorm(nn.Module):
#    def __init__(self, dim, LayerNorm_type):
#        super(LayerNorm, self).__init__()
#        if LayerNorm_type =='BiasFree':
#            self.body = BiasFree_LayerNorm(dim)
#        else:
#            self.body = WithBias_LayerNorm(dim)
#    def forward(self, x):
#        s, h, w = x.shape[-3:]
#        return to_4d(self.body(to_3d(x)),s, h, w)class DWConv(nn.Module):def __init__(self, dim=768):super(DWConv, self).__init__()self.dwconv = nn.Conv2d(dim, dim, 3, 1, 1, bias=True, groups=dim)def forward(self, x):"""x: NHWC tensor"""x = x.permute(0, 3, 1, 2) #NCHWx = self.dwconv(x)x = x.permute(0, 2, 3, 1) #NHWCreturn xclass ConvFFN(nn.Module):def __init__(self, dim=768):super(DWConv, self).__init__()self.dwconv = nn.Conv2d(dim, dim, 1, 1, 0)def forward(self, x):"""x: NHWC tensor"""x = x.permute(0, 3, 1, 2) #NCHWx = self.dwconv(x)x = x.permute(0, 2, 3, 1) #NHWCreturn xclass Attention(nn.Module):"""vanilla attention"""def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):super().__init__()self.num_heads = num_headshead_dim = dim // num_heads# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weightsself.scale = qk_scale or head_dim ** -0.5self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)def forward(self, x):"""args:x: NHWC tensorreturn:NHWC tensor"""_, H, W, _ = x.size()x = rearrange(x, 'n h w c -> n (h w) c')#######################################B, N, C = x.shapeqkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)attn = (q @ k.transpose(-2, -1)) * self.scaleattn = attn.softmax(dim=-1)attn = self.attn_drop(attn)x = (attn @ v).transpose(1, 2).reshape(B, N, C)x = self.proj(x)x = self.proj_drop(x)#######################################x = rearrange(x, 'n (h w) c -> n h w c', h=H, w=W)return xclass AttentionLePE(nn.Module):"""vanilla attention"""def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., side_dwconv=5):super().__init__()self.num_heads = num_headshead_dim = dim // num_heads# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weightsself.scale = qk_scale or head_dim ** -0.5self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Linear(dim, dim)self.proj_drop = nn.Dropout(proj_drop)self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \lambda x: torch.zeros_like(x)def forward(self, x):"""args:x: NHWC tensorreturn:NHWC tensor"""_, H, W, _ = x.size()x = rearrange(x, 'n h w c -> n (h w) c')#######################################B, N, C = x.shapeqkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)lepe = self.lepe(rearrange(x, 'n (h w) c -> n c h w', h=H, w=W))lepe = rearrange(lepe, 'n c h w -> n (h w) c')attn = (q @ k.transpose(-2, -1)) * self.scaleattn = attn.softmax(dim=-1)attn = self.attn_drop(attn)x = (attn @ v).transpose(1, 2).reshape(B, N, C)x = x + lepex = self.proj(x)x = self.proj_drop(x)#######################################x = rearrange(x, 'n (h w) c -> n h w c', h=H, w=W)return xclass nchwAttentionLePE(nn.Module):"""Attention with LePE, takes nchw input"""def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., side_dwconv=5):super().__init__()self.num_heads = num_headsself.head_dim = dim // num_headsself.scale = qk_scale or self.head_dim ** -0.5self.qkv = nn.Conv2d(dim, dim*3, kernel_size=1, bias=qkv_bias)self.attn_drop = nn.Dropout(attn_drop)self.proj = nn.Conv2d(dim, dim, kernel_size=1)self.proj_drop = nn.Dropout(proj_drop)self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \lambda x: torch.zeros_like(x)def forward(self, x:torch.Tensor):"""args:x: NCHW tensorreturn:NCHW tensor"""B, C, H, W = x.size()q, k, v = self.qkv.forward(x).chunk(3, dim=1) # B, C, H, Wattn = q.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2) @ \k.view(B, self.num_heads, self.head_dim, H*W)attn = torch.softmax(attn*self.scale, dim=-1)attn = self.attn_drop(attn)# (B, nhead, HW, HW) @ (B, nhead, HW, head_dim) -> (B, nhead, HW, head_dim)output:torch.Tensor = attn @ v.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2)output = output.permute(0, 1, 3, 2).reshape(B, C, H, W)output = output + self.lepe(v)output = self.proj_drop(self.proj(output))return outputclass TopkRouting(nn.Module):"""differentiable topk routing with scalingArgs:qk_dim: int, feature dimension of query and keytopk: int, the 'topk'qk_scale: int or None, temperature (multiply) of softmax activationwith_param: bool, wether inorporate learnable params in routing unitdiff_routing: bool, wether make routing differentiablesoft_routing: bool, wether make output value multiplied by routing weights"""def __init__(self, qk_dim, topk=4, qk_scale=None, param_routing=False, diff_routing=False):super().__init__()self.topk = topkself.qk_dim = qk_dimself.scale = qk_scale or qk_dim ** -0.5self.diff_routing = diff_routing# TODO: norm layer before/after linear?self.emb = nn.Linear(qk_dim, qk_dim) if param_routing else nn.Identity()# routing activationself.routing_act = nn.Softmax(dim=-1)def forward(self, query:Tensor, key:Tensor)->Tuple[Tensor]:"""Args:q, k: (n, p^2, c) tensorReturn:r_weight, topk_index: (n, p^2, topk) tensor"""if not self.diff_routing:query, key = query.detach(), key.detach()query_hat, key_hat = self.emb(query), self.emb(key) # per-window pooling -> (n, p^2, c)attn_logit = (query_hat*self.scale) @ key_hat.transpose(-2, -1) # (n, p^2, p^2)topk_attn_logit, topk_index = torch.topk(attn_logit, k=self.topk, dim=-1) # (n, p^2, k), (n, p^2, k)r_weight = self.routing_act(topk_attn_logit) # (n, p^2, k)return r_weight, topk_indexclass KVGather(nn.Module):def __init__(self, mul_weight='none'):super().__init__()assert mul_weight in ['none', 'soft', 'hard']self.mul_weight = mul_weightdef forward(self, r_idx:Tensor, r_weight:Tensor, kv:Tensor):"""r_idx: (n, p^2, topk) tensorr_weight: (n, p^2, topk) tensorkv: (n, p^2, w^2, c_kq+c_v)Return:(n, p^2, topk, w^2, c_kq+c_v) tensor"""# select kv according to routing indexn, p2, w2, c_kv = kv.size()topk = r_idx.size(-1)# print(r_idx.size(), r_weight.size())# FIXME: gather consumes much memory (topk times redundancy), write cuda kernel?topk_kv = torch.gather(kv.view(n, 1, p2, w2, c_kv).expand(-1, p2, -1, -1, -1), # (n, p^2, p^2, w^2, c_kv) without mem cpydim=2,index=r_idx.view(n, p2, topk, 1, 1).expand(-1, -1, -1, w2, c_kv) # (n, p^2, k, w^2, c_kv))if self.mul_weight == 'soft':topk_kv = r_weight.view(n, p2, topk, 1, 1) * topk_kv # (n, p^2, k, w^2, c_kv)elif self.mul_weight == 'hard':raise NotImplementedError('differentiable hard routing TBA')# else: #'none'#     topk_kv = topk_kv # do nothingreturn topk_kvclass QKVLinear(nn.Module):def __init__(self, dim, qk_dim, bias=True):super().__init__()self.dim = dimself.qk_dim = qk_dimself.qkv = nn.Linear(dim, qk_dim + qk_dim + dim, bias=bias)def forward(self, x):q, kv = self.qkv(x).split([self.qk_dim, self.qk_dim+self.dim], dim=-1)return q, kv# q, k, v = self.qkv(x).split([self.qk_dim, self.qk_dim, self.dim], dim=-1)# return q, k, vclass QKVConv(nn.Module):def __init__(self, dim, qk_dim, bias=True):super().__init__()self.dim = dimself.qk_dim = qk_dimself.qkv = nn.Conv2d(dim,  qk_dim + qk_dim + dim, 1, 1, 0)def forward(self, x):q, kv = self.qkv(x).split([self.qk_dim, self.qk_dim+self.dim], dim=1)return q, kvclass BiLevelRoutingAttention(nn.Module):"""n_win: number of windows in one side (so the actual number of windows is n_win*n_win)kv_per_win: for kv_downsample_mode='ada_xxxpool' only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.topk: topk for window filteringparam_attention: 'qkvo'-linear for q,k,v and o, 'none': param free attentionparam_routing: extra linear for routingdiff_routing: wether to set routing differentiablesoft_routing: wether to multiply soft routing weights """def __init__(self, dim, num_heads=8, n_win=7, qk_dim=None, qk_scale=None,kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='identity',topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, side_dwconv=3,auto_pad=False):super().__init__()# local attention settingself.dim = dimself.n_win = n_win  # Wh, Wwself.num_heads = num_headsself.qk_dim = qk_dim or dimassert self.qk_dim % num_heads == 0 and self.dim % num_heads==0, 'qk_dim and dim must be divisible by num_heads!'self.scale = qk_scale or self.qk_dim ** -0.5################side_dwconv (i.e. LCE in ShuntedTransformer)###########self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \lambda x: torch.zeros_like(x)################ global routing setting #################self.topk = topkself.param_routing = param_routingself.diff_routing = diff_routingself.soft_routing = soft_routing# routerassert not (self.param_routing and not self.diff_routing) # cannot be with_param=True and diff_routing=Falseself.router = TopkRouting(qk_dim=self.qk_dim,qk_scale=self.scale,topk=self.topk,diff_routing=self.diff_routing,param_routing=self.param_routing)if self.soft_routing: # soft routing, always diffrentiable (if no detach)mul_weight = 'soft'elif self.diff_routing: # hard differentiable routingmul_weight = 'hard'else:  # hard non-differentiable routingmul_weight = 'none'self.kv_gather = KVGather(mul_weight=mul_weight)# qkv mapping (shared by both global routing and local attention)self.param_attention = param_attentionif self.param_attention == 'qkvo':self.qkv = QKVLinear(self.dim, self.qk_dim)self.wo = nn.Linear(dim, dim)elif self.param_attention == 'qkv':self.qkv = QKVLinear(self.dim, self.qk_dim)self.wo = nn.Identity()else:raise ValueError(f'param_attention mode {self.param_attention} is not surpported!')self.kv_downsample_mode = kv_downsample_modeself.kv_per_win = kv_per_winself.kv_downsample_ratio = kv_downsample_ratioself.kv_downsample_kenel = kv_downsample_kernelif self.kv_downsample_mode == 'ada_avgpool':assert self.kv_per_win is not Noneself.kv_down = nn.AdaptiveAvgPool2d(self.kv_per_win)elif self.kv_downsample_mode == 'ada_maxpool':assert self.kv_per_win is not Noneself.kv_down = nn.AdaptiveMaxPool2d(self.kv_per_win)elif self.kv_downsample_mode == 'maxpool':assert self.kv_downsample_ratio is not Noneself.kv_down = nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()elif self.kv_downsample_mode == 'avgpool':assert self.kv_downsample_ratio is not Noneself.kv_down = nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()elif self.kv_downsample_mode == 'identity': # no kv downsamplingself.kv_down = nn.Identity()elif self.kv_downsample_mode == 'fracpool':# assert self.kv_downsample_ratio is not None# assert self.kv_downsample_kenel is not None# TODO: fracpool# 1. kernel size should be input size dependent# 2. there is a random factor, need to avoid independent sampling for k and v raise NotImplementedError('fracpool policy is not implemented yet!')elif kv_downsample_mode == 'conv':# TODO: need to consider the case where k != v so that need two downsample modulesraise NotImplementedError('conv policy is not implemented yet!')else:raise ValueError(f'kv_down_sample_mode {self.kv_downsaple_mode} is not surpported!')# softmax for local attentionself.attn_act = nn.Softmax(dim=-1)self.auto_pad=auto_paddef forward(self, x, ret_attn_mask=False):"""x: NHWC tensorReturn:NHWC tensor"""# NOTE: use padding for semantic segmentation###################################################if self.auto_pad:N, H_in, W_in, C = x.size()pad_l = pad_t = 0pad_r = (self.n_win - W_in % self.n_win) % self.n_winpad_b = (self.n_win - H_in % self.n_win) % self.n_winx = F.pad(x, (0, 0, # dim=-1pad_l, pad_r, # dim=-2pad_t, pad_b)) # dim=-3_, H, W, _ = x.size() # padded sizeelse:N, H, W, C = x.size()#assert H%self.n_win == 0 and W%self.n_win == 0 ##################################################### patchify, (n, p^2, w, w, c), keep 2d window as we need 2d pooling to reduce kv sizex = rearrange(x, "n (j h) (i w) c -> n (j i) h w c", j=self.n_win, i=self.n_win)#################qkv projection#################### q: (n, p^2, w, w, c_qk)# kv: (n, p^2, w, w, c_qk+c_v)# NOTE: separte kv if there were memory leak issue caused by gatherq, kv = self.qkv(x) # pixel-wise qkv# q_pix: (n, p^2, w^2, c_qk)# kv_pix: (n, p^2, h_kv*w_kv, c_qk+c_v)q_pix = rearrange(q, 'n p2 h w c -> n p2 (h w) c')kv_pix = self.kv_down(rearrange(kv, 'n p2 h w c -> (n p2) c h w'))kv_pix = rearrange(kv_pix, '(n j i) c h w -> n (j i) (h w) c', j=self.n_win, i=self.n_win)q_win, k_win = q.mean([2, 3]), kv[..., 0:self.qk_dim].mean([2, 3]) # window-wise qk, (n, p^2, c_qk), (n, p^2, c_qk)##################side_dwconv(lepe)################### NOTE: call contiguous to avoid gradient warning when using ddplepe = self.lepe(rearrange(kv[..., self.qk_dim:], 'n (j i) h w c -> n c (j h) (i w)', j=self.n_win, i=self.n_win).contiguous())lepe = rearrange(lepe, 'n c (j h) (i w) -> n (j h) (i w) c', j=self.n_win, i=self.n_win)############ gather q dependent k/v #################r_weight, r_idx = self.router(q_win, k_win) # both are (n, p^2, topk) tensorskv_pix_sel = self.kv_gather(r_idx=r_idx, r_weight=r_weight, kv=kv_pix) #(n, p^2, topk, h_kv*w_kv, c_qk+c_v)k_pix_sel, v_pix_sel = kv_pix_sel.split([self.qk_dim, self.dim], dim=-1)# kv_pix_sel: (n, p^2, topk, h_kv*w_kv, c_qk)# v_pix_sel: (n, p^2, topk, h_kv*w_kv, c_v)######### do attention as normal ####################k_pix_sel = rearrange(k_pix_sel, 'n p2 k w2 (m c) -> (n p2) m c (k w2)', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_kq//m) transpose here?v_pix_sel = rearrange(v_pix_sel, 'n p2 k w2 (m c) -> (n p2) m (k w2) c', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_v//m)q_pix = rearrange(q_pix, 'n p2 w2 (m c) -> (n p2) m w2 c', m=self.num_heads) # to BMLC tensor (n*p^2, m, w^2, c_qk//m)# param-free multihead attentionattn_weight = (q_pix * self.scale) @ k_pix_sel # (n*p^2, m, w^2, c) @ (n*p^2, m, c, topk*h_kv*w_kv) -> (n*p^2, m, w^2, topk*h_kv*w_kv)attn_weight = self.attn_act(attn_weight)out = attn_weight @ v_pix_sel # (n*p^2, m, w^2, topk*h_kv*w_kv) @ (n*p^2, m, topk*h_kv*w_kv, c) -> (n*p^2, m, w^2, c)out = rearrange(out, '(n j i) m (h w) c -> n (j h) (i w) (m c)', j=self.n_win, i=self.n_win,h=H//self.n_win, w=W//self.n_win)out = out + lepe# output linearout = self.wo(out)# NOTE: use padding for semantic segmentation# crop padded regionif self.auto_pad and (pad_r > 0 or pad_b > 0):out = out[:, :H_in, :W_in, :].contiguous()if ret_attn_mask:return out, r_weight, r_idx, attn_weightelse:return outclass TransformerMLPWithConv(nn.Module):def __init__(self, channels, expansion, drop):super().__init__()self.dim1 = channelsself.dim2 = channels * expansionself.linear1 = nn.Sequential(nn.Conv2d(self.dim1, self.dim2, 1, 1, 0),# nn.GELU(),# nn.BatchNorm2d(self.dim2, eps=1e-5))self.drop1 = nn.Dropout(drop, inplace=True)self.act = nn.GELU()# self.bn = nn.BatchNorm2d(self.dim2, eps=1e-5)self.linear2 = nn.Sequential(nn.Conv2d(self.dim2, self.dim1, 1, 1, 0),# nn.BatchNorm2d(self.dim1, eps=1e-5))self.drop2 = nn.Dropout(drop, inplace=True)self.dwc = nn.Conv2d(self.dim2, self.dim2, 3, 1, 1, groups=self.dim2)def forward(self, x):x = self.linear1(x)x = self.drop1(x)x = x + self.dwc(x)x = self.act(x)# x = self.bn(x)x = self.linear2(x)x = self.drop2(x)return xclass DeBiLevelRoutingAttention(nn.Module):"""n_win: number of windows in one side (so the actual number of windows is n_win*n_win)kv_per_win: for kv_downsample_mode='ada_xxxpool' only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.topk: topk for window filteringparam_attention: 'qkvo'-linear for q,k,v and o, 'none': param free attentionparam_routing: extra linear for routingdiff_routing: wether to set routing differentiablesoft_routing: wether to multiply soft routing weights"""def __init__(self, dim, num_heads=8, n_win=7, qk_dim=None, qk_scale=None,kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='identity',topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, side_dwconv=3,auto_pad=False, param_size='small'):super().__init__()# local attention settingself.dim = dimself.n_win = n_win  # Wh, Wwself.num_heads = num_headsself.qk_dim = qk_dim or dim#############################################################if param_size=='tiny':if self.dim == 64 :self.n_groups = 1self.top_k_def = 16   # 2    128self.kk = 9self.stride_def = 8self.expain_ratio = 3self.q_size=to_2tuple(56)if self.dim == 128 :self.n_groups = 2self.top_k_def = 16   # 4    256self.kk = 7self.stride_def = 4self.expain_ratio = 3self.q_size=to_2tuple(28)if self.dim == 256 :self.n_groups = 4self.top_k_def = 4   # 8    512self.kk = 5self.stride_def = 2self.expain_ratio = 3self.q_size=to_2tuple(14)if self.dim == 512 :self.n_groups = 8self.top_k_def = 49   # 8    512self.kk = 3self.stride_def = 1self.expain_ratio = 3self.q_size=to_2tuple(7)
#############################################################if param_size=='small':if self.dim == 64 :self.n_groups = 1self.top_k_def = 16   # 2    128self.kk = 9self.stride_def = 8self.expain_ratio = 3self.q_size=to_2tuple(56)if self.dim == 128 :self.n_groups = 2self.top_k_def = 16   # 4    256self.kk = 7self.stride_def = 4self.expain_ratio = 3self.q_size=to_2tuple(28)if self.dim == 256 :self.n_groups = 4self.top_k_def = 4   # 8    512self.kk = 5self.stride_def = 2self.expain_ratio = 3self.q_size=to_2tuple(14)if self.dim == 512 :self.n_groups = 8self.top_k_def = 49   # 8    512self.kk = 3self.stride_def = 1self.expain_ratio = 1self.q_size=to_2tuple(7)
#############################################################if param_size=='base':if self.dim == 96 :self.n_groups = 1self.top_k_def = 16   # 2    128self.kk = 9self.stride_def = 8self.expain_ratio = 3self.q_size=to_2tuple(56)if self.dim == 192 :self.n_groups = 2self.top_k_def = 16   # 4    256self.kk = 7self.stride_def = 4self.expain_ratio = 3self.q_size=to_2tuple(28)if self.dim == 384 :self.n_groups = 3self.top_k_def = 4   # 8    512self.kk = 5self.stride_def = 2self.expain_ratio = 3self.q_size=to_2tuple(14)if self.dim == 768 :self.n_groups = 6self.top_k_def = 49   # 8    512self.kk = 3self.stride_def = 1self.expain_ratio = 3self.q_size=to_2tuple(7)self.q_h, self.q_w = self.q_sizeself.kv_h, self.kv_w = self.q_h // self.stride_def, self.q_w // self.stride_defself.n_group_channels = self.dim // self.n_groupsself.n_group_heads = self.num_heads // self.n_groupsself.n_group_channels = self.dim // self.n_groupsself.offset_range_factor = -1self.head_channels = dim // num_headsself.n_group_heads = self.num_heads // self.n_groups#assert self.qk_dim % num_heads == 0 and self.dim % num_heads==0, 'qk_dim and dim must be divisible by num_heads!'self.scale = qk_scale or self.qk_dim ** -0.5self.rpe_table = nn.Parameter(torch.zeros(self.num_heads, self.q_h * 2 - 1, self.q_w * 2 - 1))trunc_normal_(self.rpe_table, std=0.01)################side_dwconv (i.e. LCE in ShuntedTransformer)###########self.lepe1 = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=self.stride_def, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \lambda x: torch.zeros_like(x)################ global routing setting #################self.topk = topkself.param_routing = param_routingself.diff_routing = diff_routingself.soft_routing = soft_routing# router#assert not (self.param_routing and not self.diff_routing) # cannot be with_param=True and diff_routing=Falseself.router = TopkRouting(qk_dim=self.qk_dim,qk_scale=self.scale,topk=self.topk,diff_routing=self.diff_routing,param_routing=self.param_routing)if self.soft_routing: # soft routing, always diffrentiable (if no detach)mul_weight = 'soft'elif self.diff_routing: # hard differentiable routingmul_weight = 'hard'else:  # hard non-differentiable routingmul_weight = 'none'self.kv_gather = KVGather(mul_weight=mul_weight)# qkv mapping (shared by both global routing and local attention)self.param_attention = param_attentionif self.param_attention == 'qkvo':#self.qkv = QKVLinear(self.dim, self.qk_dim)self.qkv_conv = QKVConv(self.dim, self.qk_dim)#self.wo = nn.Linear(dim, dim)elif self.param_attention == 'qkv':#self.qkv = QKVLinear(self.dim, self.qk_dim)self.qkv_conv = QKVConv(self.dim, self.qk_dim)#self.wo = nn.Identity()else:raise ValueError(f'param_attention mode {self.param_attention} is not surpported!')self.kv_downsample_mode = kv_downsample_modeself.kv_per_win = kv_per_winself.kv_downsample_ratio = kv_downsample_ratioself.kv_downsample_kenel = kv_downsample_kernelif self.kv_downsample_mode == 'ada_avgpool':assert self.kv_per_win is not Noneself.kv_down = nn.AdaptiveAvgPool2d(self.kv_per_win)elif self.kv_downsample_mode == 'ada_maxpool':assert self.kv_per_win is not Noneself.kv_down = nn.AdaptiveMaxPool2d(self.kv_per_win)elif self.kv_downsample_mode == 'maxpool':assert self.kv_downsample_ratio is not Noneself.kv_down = nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()elif self.kv_downsample_mode == 'avgpool':assert self.kv_downsample_ratio is not Noneself.kv_down = nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()elif self.kv_downsample_mode == 'identity': # no kv downsamplingself.kv_down = nn.Identity()elif self.kv_downsample_mode == 'fracpool':raise NotImplementedError('fracpool policy is not implemented yet!')elif kv_downsample_mode == 'conv':raise NotImplementedError('conv policy is not implemented yet!')else:raise ValueError(f'kv_down_sample_mode {self.kv_downsaple_mode} is not surpported!')self.attn_act = nn.Softmax(dim=-1)self.auto_pad=auto_pad##########################################################################################self.proj_q = nn.Conv2d(dim, dim,kernel_size=1, stride=1, padding=0)self.proj_k = nn.Conv2d(dim, dim,kernel_size=1, stride=1, padding=0)self.proj_v = nn.Conv2d(dim, dim,kernel_size=1, stride=1, padding=0)self.proj_out = nn.Conv2d(dim, dim,kernel_size=1, stride=1, padding=0)self.unifyheads1 = nn.Conv2d(dim, dim,kernel_size=1, stride=1, padding=0)self.conv_offset_q = nn.Sequential(nn.Conv2d(self.n_group_channels, self.n_group_channels, (self.kk,self.kk), (self.stride_def,self.stride_def), (self.kk//2,self.kk//2), groups=self.n_group_channels, bias=False),LayerNormProxy(self.n_group_channels),nn.GELU(),nn.Conv2d(self.n_group_channels, 1, 1, 1, 0, bias=False),)### FFNself.norm = nn.LayerNorm(dim, eps=1e-6)self.norm2 = nn.LayerNorm(dim, eps=1e-6)self.mlp =TransformerMLPWithConv(dim, self.expain_ratio, 0.)@torch.no_grad()def _get_ref_points(self, H_key, W_key, B, dtype, device):ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_key - 0.5, H_key, dtype=dtype, device=device),torch.linspace(0.5, W_key - 0.5, W_key, dtype=dtype, device=device))ref = torch.stack((ref_y, ref_x), -1)ref[..., 1].div_(W_key).mul_(2).sub_(1)ref[..., 0].div_(H_key).mul_(2).sub_(1)ref = ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2return ref@torch.no_grad()def _get_q_grid(self, H, W, B, dtype, device):ref_y, ref_x = torch.meshgrid(torch.arange(0, H, dtype=dtype, device=device),torch.arange(0, W, dtype=dtype, device=device),indexing='ij')ref = torch.stack((ref_y, ref_x), -1)ref[..., 1].div_(W - 1.0).mul_(2.0).sub_(1.0)ref[..., 0].div_(H - 1.0).mul_(2.0).sub_(1.0)ref = ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2return refdef forward(self, x, ret_attn_mask=False):dtype, device = x.dtype, x.device"""x: NHWC tensorReturn:NHWC tensor"""
# NOTE: use padding for semantic segmentation
###################################################if self.auto_pad:N, H_in, W_in, C = x.size()pad_l = pad_t = 0pad_r = (self.n_win - W_in % self.n_win) % self.n_winpad_b = (self.n_win - H_in % self.n_win) % self.n_winx = F.pad(x, (0, 0, # dim=-1pad_l, pad_r, # dim=-2pad_t, pad_b)) # dim=-3_, H, W, _ = x.size() # padded sizeelse:N, H, W, C = x.size()assert H%self.n_win == 0 and W%self.n_win == 0 ##print("X_in")#print(x.shape)####################################################q=self.proj_q_def(x)x_res = rearrange(x, "n h w c -> n c h w")
#################qkv projection###################q,kv = self.qkv_conv(x.permute(0, 3, 1, 2))q_bi = rearrange(q, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win)kv = rearrange(kv, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win)q_pix = rearrange(q_bi, 'n p2 h w c -> n p2 (h w) c')kv_pix = self.kv_down(rearrange(kv, 'n p2 h w c -> (n p2) c h w'))kv_pix = rearrange(kv_pix, '(n j i) c h w -> n (j i) (h w) c', j=self.n_win, i=self.n_win)##################side_dwconv(lepe)################### NOTE: call contiguous to avoid gradient warning when using ddplepe1 = self.lepe1(rearrange(kv[..., self.qk_dim:], 'n (j i) h w c -> n c (j h) (i w)', j=self.n_win, i=self.n_win).contiguous())#################################################################   Offset Qq_off = rearrange(q, 'b (g c) h w -> (b g) c h w', g=self.n_groups, c=self.n_group_channels)offset_q = self.conv_offset_q(q_off).contiguous() # B * g 2 Sg HWgHk, Wk = offset_q.size(2), offset_q.size(3)n_sample = Hk * Wkif self.offset_range_factor > 0:offset_range = torch.tensor([1.0 / Hk, 1.0 / Wk], device=device).reshape(1, 2, 1, 1)offset_q = offset_q.tanh().mul(offset_range).mul(self.offset_range_factor)offset_q = rearrange(offset_q, 'b p h w -> b h w p') # B * g 2 Hg Wg -> B*g Hg Wg 2reference = self._get_ref_points(Hk, Wk, N, dtype, device)if self.offset_range_factor >= 0:pos_k = offset_q + referenceelse:pos_k = (offset_q + reference).clamp(-1., +1.)x_sampled_q = F.grid_sample(input=x_res.reshape(N * self.n_groups, self.n_group_channels, H, W),grid=pos_k[..., (1, 0)], # y, x -> x, ymode='bilinear', align_corners=True) # B * g, Cg, Hg, Wgq_sampled = x_sampled_q.reshape(N, C, Hk, Wk)########　　Bi-LEVEL Gatheringif self.auto_pad:q_sampled=q_sampled.permute(0, 2, 3, 1)Ng, Hg, Wg, Cg = q_sampled.size()pad_l = pad_t = 0pad_rg = (self.n_win - Wg % self.n_win) % self.n_winpad_bg = (self.n_win - Hg % self.n_win) % self.n_winq_sampled = F.pad(q_sampled, (0, 0, # dim=-1pad_l, pad_rg, # dim=-2pad_t, pad_bg)) # dim=-3_, Hg, Wg, _ = q_sampled.size() # padded sizeq_sampled=q_sampled.permute(0, 3, 1, 2)lepe1 = F.pad(lepe1.permute(0, 2, 3, 1), (0, 0, # dim=-1pad_l, pad_rg, # dim=-2pad_t, pad_bg)) # dim=-3lepe1=lepe1.permute(0, 3, 1, 2)pos_k = F.pad(pos_k, (0, 0, # dim=-1pad_l, pad_rg, # dim=-2pad_t, pad_bg)) # dim=-3queries_def = self.proj_q(q_sampled)  #Linnear projectionqueries_def = rearrange(queries_def, "n c (j h) (i w) -> n (j i) h w c", j=self.n_win, i=self.n_win).contiguous()q_win, k_win = queries_def.mean([2, 3]), kv[..., 0:(self.qk_dim)].mean([2, 3])r_weight, r_idx = self.router(q_win, k_win)kv_gather = self.kv_gather(r_idx=r_idx, r_weight=r_weight, kv=kv_pix)  # (n, p^2, topk, h_kv*w_kv, c )k_gather, v_gather = kv_gather.split([self.qk_dim, self.dim], dim=-1)###     Bi-level Routing MHAk = rearrange(k_gather, 'n p2 k hw (m c) -> (n p2) m c (k hw)', m=self.num_heads)v = rearrange(v_gather, 'n p2 k hw (m c) -> (n p2) m (k hw) c', m=self.num_heads)q_def = rearrange(queries_def,  'n p2 h w (m c)-> (n p2) m (h w) c',m=self.num_heads)attn_weight = (q_def * self.scale) @ kattn_weight = self.attn_act(attn_weight)out = attn_weight @ vout_def = rearrange(out, '(n j i) m (h w) c -> n (m c) (j h) (i w)', j=self.n_win, i=self.n_win, h=Hg//self.n_win, w=Wg//self.n_win).contiguous()out_def = out_def + lepe1out_def = self.unifyheads1(out_def)out_def = q_sampled + out_defout_def = out_def + self.mlp(self.norm2(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)) # (N, C, H, W)#####################################################################################################　　　Deformable Gathering
#############################################################################################  out_def = self.norm(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)k = self.proj_k(out_def)v = self.proj_v(out_def)k_pix_sel = rearrange(k, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)v_pix_sel = rearrange(v, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)q_pix = rearrange(q, 'n (m c) h w -> (n m) c (h w)', m=self.num_heads)attn = torch.einsum('b c m, b c n -> b m n', q_pix, k_pix_sel) # B * h, HW, Nsattn = attn.mul(self.scale)### Biasrpe_table = self.rpe_tablerpe_bias = rpe_table[None, ...].expand(N, -1, -1, -1)q_grid = self._get_q_grid(H, W, N, dtype, device)displacement = (q_grid.reshape(N * self.n_groups, H * W, 2).unsqueeze(2) - pos_k.reshape(N * self.n_groups, Hg*Wg, 2).unsqueeze(1)).mul(0.5)attn_bias = F.grid_sample(input=rearrange(rpe_bias, 'b (g c) h w -> (b g) c h w', c=self.n_group_heads, g=self.n_groups),grid=displacement[..., (1, 0)],mode='bilinear', align_corners=True) # B * g, h_g, HW, Nsattn_bias = attn_bias.reshape(N * self.num_heads, H * W, Hg*Wg)attn = attn + attn_bias### attn = F.softmax(attn, dim=2)out = torch.einsum('b m n, b c n -> b c m', attn, v_pix_sel)out = out.reshape(N,C,H,W).contiguous()out = self.proj_out(out).permute(0,2,3,1)############################################################################################## NOTE: use padding for semantic segmentation# crop padded regionif self.auto_pad and (pad_r > 0 or pad_b > 0):out = out[:, :H_in, :W_in, :].contiguous()if ret_attn_mask:return out, r_weight, r_idx, attn_weightelse:return outdef get_pe_layer(emb_dim, pe_dim=None, name='none'):if name == 'none':return nn.Identity()else:raise ValueError(f'PE name {name} is not surpported!')class Block(nn.Module):def __init__(self, dim, drop_path=0., layer_scale_init_value=-1,num_heads=8, n_win=7, qk_dim=None, qk_scale=None,kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='ada_avgpool',topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, mlp_ratio=4, param_size='small',mlp_dwconv=False,side_dwconv=5, before_attn_dwconv=3, pre_norm=True, auto_pad=False):super().__init__()qk_dim = qk_dim or dim# modulesif before_attn_dwconv > 0:self.pos_embed1 = nn.Conv2d(dim, dim,  kernel_size=before_attn_dwconv, padding=1, groups=dim)self.pos_embed2 = nn.Conv2d(dim, dim,  kernel_size=before_attn_dwconv, padding=1, groups=dim)else:self.pos_embed = lambda x: 0self.norm1 = nn.LayerNorm(dim, eps=1e-6) # important to avoid attention collapsing#if topk > 0:if topk == 4:self.attn1 = BiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=1, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,auto_pad=auto_pad)self.attn2 = DeBiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=topk, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,auto_pad=auto_pad,param_size=param_size)elif topk == 8:self.attn1 = BiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=4, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,auto_pad=auto_pad)self.attn2 = DeBiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=topk, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,uto_pad=auto_pad,param_size=param_size)elif topk == 16:self.attn1 = BiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=16, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,auto_pad=auto_pad)self.attn2 = DeBiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=topk, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,uto_pad=auto_pad,param_size=param_size)elif topk == -1:self.attn = Attention(dim=dim)elif topk == -2:self.attn1 = DeBiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=49, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,uto_pad=auto_pad,param_size=param_size)self.attn2 = DeBiLevelRoutingAttention(dim=dim, num_heads=num_heads, n_win=n_win, qk_dim=qk_dim,qk_scale=qk_scale, kv_per_win=kv_per_win, kv_downsample_ratio=kv_downsample_ratio,kv_downsample_kernel=kv_downsample_kernel, kv_downsample_mode=kv_downsample_mode,topk=49, param_attention=param_attention, param_routing=param_routing,diff_routing=diff_routing, soft_routing=soft_routing, side_dwconv=side_dwconv,uto_pad=auto_pad,param_size=param_size)elif topk == 0:self.attn = nn.Sequential(Rearrange('n h w c -> n c h w'), # compatiabilitynn.Conv2d(dim, dim, 1), # pseudo qkv linearnn.Conv2d(dim, dim, 5, padding=2, groups=dim), # pseudo attentionnn.Conv2d(dim, dim, 1), # pseudo out linearRearrange('n c h w -> n h w c'))self.norm2 = nn.LayerNorm(dim, eps=1e-6)self.mlp1 = TransformerMLPWithConv(dim, mlp_ratio, 0.)self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()self.norm3 = nn.LayerNorm(dim, eps=1e-6)self.norm4 = nn.LayerNorm(dim, eps=1e-6)self.mlp2 =TransformerMLPWithConv(dim, mlp_ratio, 0.)# tricks: layer scale & pre_norm/post_normif layer_scale_init_value > 0:self.use_layer_scale = Trueself.gamma1 = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)self.gamma2 = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)self.gamma3 = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)self.gamma4 = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)else:self.use_layer_scale = Falseself.pre_norm = pre_normdef forward(self, x):"""x: NCHW tensor"""# conv pos embeddingx = x + self.pos_embed1(x)# permute to NHWC tensor for attention & mlpx = x.permute(0, 2, 3, 1) # (N, C, H, W) -> (N, H, W, C)# attention & mlpif self.pre_norm:if self.use_layer_scale:x = x + self.drop_path1(self.gamma1 * self.attn1(self.norm1(x))) # (N, H, W, C)x = x + self.drop_path1(self.gamma2 * self.mlp1(self.norm2(x))) # (N, H, W, C)# conv pos embeddingx = x + self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x = x + self.drop_path2(self.gamma3 * self.attn2(self.norm3(x))) # (N, H, W, C)x = x + self.drop_path2(self.gamma4 * self.mlp2(self.norm4(x))) # (N, H, W, C)else:x = x + self.drop_path1(self.attn1(self.norm1(x))) # (N, H, W, C)x = x + self.drop_path1(self.mlp1(self.norm2(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# conv pos embeddingx = x + self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x = x + self.drop_path2(self.attn2(self.norm3(x))) # (N, H, W, C)x = x + self.drop_path2(self.mlp2(self.norm4(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)else: # https://kexue.fm/archives/9009if self.use_layer_scale:x = self.norm1(x + self.drop_path1(self.gamma1 * self.attn1(x))) # (N, H, W, C)x = self.norm2(x + self.drop_path1(self.gamma2 * self.mlp1(x))) # (N, H, W, C)# conv pos embeddingx = x + self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x = self.norm3(x + self.drop_path2(self.gamma3 * self.attn2(x))) # (N, H, W, C)x = self.norm4(x + self.drop_path2(self.gamma4 * self.mlp2(x))) # (N, H, W, C)else:x = self.norm1(x + self.drop_path1(self.attn1(x))) # (N, H, W, C)x = x + self.drop_path1(self.mlp1(self.norm2(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# conv pos embeddingx = x + self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x = self.norm3(x + self.drop_path2(self.attn2(x))) # (N, H, W, C)x = x + self.drop_path2(self.mlp2(self.norm4(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# permute backx = x.permute(0, 3, 1, 2) # (N, H, W, C) -> (N, C, H, W)return xclass DeBiFormer(nn.Module):def __init__(self, depth=[3, 4, 8, 3], in_chans=3, num_classes=1000, embed_dim=[64, 128, 320, 512],head_dim=64, qk_scale=None, representation_size=None,drop_path_rate=0., drop_rate=0.,use_checkpoint_stages=[],########n_win=7,kv_downsample_mode='ada_avgpool',kv_per_wins=[2, 2, -1, -1],topks=[8, 8, -1, -1],side_dwconv=5,layer_scale_init_value=-1,qk_dims=[None, None, None, None],param_routing=False, diff_routing=False, soft_routing=False,pre_norm=True,pe=None,pe_stages=[0],before_attn_dwconv=3,auto_pad=False,#-----------------------kv_downsample_kernels=[4, 2, 1, 1],kv_downsample_ratios=[4, 2, 1, 1], # -> kv_per_win = [2, 2, 2, 1]mlp_ratios=[4, 4, 4, 4],param_attention='qkvo',param_size='small',mlp_dwconv=False):"""Args:depth (list): depth of each stageimg_size (int, tuple): input image sizein_chans (int): number of input channelsnum_classes (int): number of classes for classification headembed_dim (list): embedding dimension of each stagehead_dim (int): head dimensionmlp_ratio (int): ratio of mlp hidden dim to embedding dimqkv_bias (bool): enable bias for qkv if Trueqk_scale (float): override default qk scale of head_dim ** -0.5 if setrepresentation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if setdrop_rate (float): dropout rateattn_drop_rate (float): attention dropout ratedrop_path_rate (float): stochastic depth ratenorm_layer (nn.Module): normalization layerconv_stem (bool): whether use overlapped patch stem"""super().__init__()self.num_classes = num_classesself.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models############ downsample layers (patch embeddings) ######################self.downsample_layers = nn.ModuleList()# NOTE: uniformer uses two 3*3 conv, while in many other transformers this is one 7*7 convstem = nn.Sequential(nn.Conv2d(in_chans, embed_dim[0] // 2, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)),nn.BatchNorm2d(embed_dim[0] // 2),nn.GELU(),nn.Conv2d(embed_dim[0] // 2, embed_dim[0], kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)),nn.BatchNorm2d(embed_dim[0]),)if (pe is not None) and 0 in pe_stages:stem.append(get_pe_layer(emb_dim=embed_dim[0], name=pe))if use_checkpoint_stages:stem = checkpoint_wrapper(stem)self.downsample_layers.append(stem)for i in range(3):downsample_layer = nn.Sequential(nn.Conv2d(embed_dim[i], embed_dim[i+1], kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)),nn.BatchNorm2d(embed_dim[i+1]))if (pe is not None) and i+1 in pe_stages:downsample_layer.append(get_pe_layer(emb_dim=embed_dim[i+1], name=pe))if use_checkpoint_stages:downsample_layer = checkpoint_wrapper(downsample_layer)self.downsample_layers.append(downsample_layer)##########################################################################self.stages = nn.ModuleList() # 4 feature resolution stages, each consisting of multiple residual blocksnheads= [dim // head_dim for dim in qk_dims]dp_rates=[x.item() for x in torch.linspace(0, drop_path_rate, sum(depth))]cur = 0for i in range(4):stage = nn.Sequential(*[Block(dim=embed_dim[i], drop_path=dp_rates[cur + j],layer_scale_init_value=layer_scale_init_value,topk=topks[i],num_heads=nheads[i],n_win=n_win,qk_dim=qk_dims[i],qk_scale=qk_scale,kv_per_win=kv_per_wins[i],kv_downsample_ratio=kv_downsample_ratios[i],kv_downsample_kernel=kv_downsample_kernels[i],kv_downsample_mode=kv_downsample_mode,param_attention=param_attention,param_size=param_size,param_routing=param_routing,diff_routing=diff_routing,soft_routing=soft_routing,mlp_ratio=mlp_ratios[i],mlp_dwconv=mlp_dwconv,side_dwconv=side_dwconv,before_attn_dwconv=before_attn_dwconv,pre_norm=pre_norm,auto_pad=auto_pad) for j in range(depth[i])],)if i in use_checkpoint_stages:stage = checkpoint_wrapper(stage)self.stages.append(stage)cur += depth[i]##########################################################################self.norm = nn.BatchNorm2d(embed_dim[-1])# Representation layerif representation_size:self.num_features = representation_sizeself.pre_logits = nn.Sequential(OrderedDict([('fc', nn.Linear(embed_dim, representation_size)),('act', nn.Tanh())]))else:self.pre_logits = nn.Identity()# Classifier headself.head = nn.Linear(embed_dim[-1], num_classes) if num_classes > 0 else nn.Identity()self.reset_parameters()def reset_parameters(self):for m in self.parameters():if isinstance(m, (nn.Linear, nn.Conv2d)):nn.init.kaiming_normal_(m.weight)nn.init.zeros_(m.bias)@torch.jit.ignoredef no_weight_decay(self):return {'pos_embed', 'cls_token'}def get_classifier(self):return self.headdef reset_classifier(self, num_classes, global_pool=''):self.num_classes = num_classesself.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()def forward_features(self, x):for i in range(4):x = self.downsample_layers[i](x) # res = (56, 28, 14, 7), wins = (64, 16, 4, 1)x = self.stages[i](x)x = self.norm(x)x = self.pre_logits(x)return xdef forward(self, x):x = self.forward_features(x)x = x.flatten(2).mean(-1)x = self.head(x)return x            @register_model
def debi_tiny(pretrained=False, pretrained_cfg=None, **kwargs):model = DeBiFormer(depth=[1, 1, 4, 1],embed_dim=[64, 128, 256, 512], mlp_ratios=[3, 3, 3, 3],param_size='tiny',drop_path_rate=0.,  #Drop rate#------------------------------n_win=7,kv_downsample_mode='identity',kv_per_wins=[-1, -1, -1, -1],topks=[4, 8, 16, -2],side_dwconv=5,before_attn_dwconv=3,layer_scale_init_value=-1,qk_dims=[64, 128, 256, 512],head_dim=32,param_routing=False, diff_routing=False, soft_routing=False,pre_norm=True,pe=None)return model@register_model
def debi_small(pretrained=False, pretrained_cfg=None, **kwargs):model = DeBiFormer(depth=[2, 2, 9, 3],embed_dim=[64, 128, 256, 512], mlp_ratios=[3, 3, 3, 2],param_size='small',drop_path_rate=0.3,  #Drop rate#------------------------------n_win=7,kv_downsample_mode='identity',kv_per_wins=[-1, -1, -1, -1],topks=[4, 8, 16, -2],side_dwconv=5,before_attn_dwconv=3,layer_scale_init_value=-1,qk_dims=[64, 128, 256, 512],head_dim=32,param_routing=False, diff_routing=False, soft_routing=False,pre_norm=True,pe=None)return model@register_model
def debi_base(pretrained=False, pretrained_cfg=None, **kwargs):model = DeBiFormer(depth=[2, 2, 9, 2],embed_dim=[96, 192, 384, 768], mlp_ratios=[3, 3, 3, 3],param_size='base',drop_path_rate=0.4,  #Drop rate#------------------------------n_win=7,kv_downsample_mode='identity',kv_per_wins=[-1, -1, -1, -1],topks=[4, 8, 16, -2],side_dwconv=5,before_attn_dwconv=3,layer_scale_init_value=-1,qk_dims=[96, 192, 384, 768],head_dim=32,param_routing=False, diff_routing=False, soft_routing=False,pre_norm=True,pe=None)return modelif __name__ == '__main__':from mmcv.cnn.utils import flops_countermodel = DeBiFormer(depth=[2, 2, 9, 1],embed_dim=[64, 128, 256, 512], mlp_ratios=[3, 3, 3, 2],#------------------------------n_win=7,kv_downsample_mode='identity',kv_per_wins=[-1, -1, -1, -1],topks=[4, 8, 16, -2],side_dwconv=5,before_attn_dwconv=3,layer_scale_init_value=-1,qk_dims=[64, 128, 256, 512],head_dim=32,param_routing=False, diff_routing=False, soft_routing=False,pre_norm=True,pe=None)input_shape = (3, 224, 224)flops_counter.get_model_complexity_info(model, input_shape)

查看全文

http://www.risenshineclean.com/news/55760.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

桂林網(wǎng)站建設(shè)公司鎮(zhèn)江百度公司

摘要

一、論文介紹

二、創(chuàng)新點

三、方法

四、模塊作用

五、實驗結(jié)果

安裝包

安裝timm

數(shù)據(jù)增強Cutout和Mixup

EMA

EMA概述

工作原理

應用優(yōu)勢

使用場景

項目結(jié)構(gòu)

計算mean和std

均值（Mean）

標準差（Standard Deviation, Std）

歸一化（Normalization）

為什么需要歸一化？

如何計算和使用mean和std

生成數(shù)據(jù)集

DeBiFormer代碼

相關(guān)文章：

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

摘要

一、論文介紹

二、創(chuàng)新點

三、方法

四、模塊作用

五、實驗結(jié)果

安裝包

安裝timm

數(shù)據(jù)增強Cutout和Mixup

EMA

EMA概述

工作原理

應用優(yōu)勢

使用場景

項目結(jié)構(gòu)

計算mean和std

均值（Mean）

標準差（Standard Deviation, Std）

歸一化（Normalization）

為什么需要歸一化？

如何計算和使用mean和std

生成數(shù)據(jù)集

DeBiFormer代碼

相關(guān)文章：

一、論文介紹

二、創(chuàng)新點

三、方法

四、模塊作用

五、實驗結(jié)果