網(wǎng)站開發(fā)開源框架石家莊網(wǎng)站建設(shè)案例
論文:https://arxiv.org/pdf/1512.03385.pdf
Deep Residual Learning for Image Recognition
本模塊主要是閱讀論文,會做簡單的翻譯(至少滿足我自己能看明白)。
Introduction
由上圖可見,在20層和56層的網(wǎng)絡(luò)上訓(xùn)練的訓(xùn)練誤差和測試誤差的變化,可以看到層數(shù)加深不一定能帶來性能上的提升,甚至更糟了。這就引出了文章的疑問(有些和直覺相反的結(jié)論):為什么不是層數(shù)約多,結(jié)果越好呢?文中給出的解釋是:梯度消失/爆炸問題從一開始就阻礙了收斂,雖然這一問題已經(jīng)通過normalized initialization和intermediate normalization layers在很大程度上得到了解決,這使得幾十層的網(wǎng)絡(luò)可以收斂,但是當(dāng)層數(shù)逐漸增加,出現(xiàn)了“退化”問題(degradation)。這里的“退化”指的是,隨著網(wǎng)絡(luò)的加深,accuracy先逐漸升高到達(dá)飽和,然后迅速衰退。文中指出這一問題并非由過擬合導(dǎo)致(并不是模型過于復(fù)雜)。這一問題也說明了,我們并不能簡單的認(rèn)為通過堆疊層數(shù)來優(yōu)化模型。
論文中通過引入一個deep residual learning框架(深度殘差)。并不使用簡單堆疊層數(shù)來獲得一個滿意的潛在映射,而是讓這些層復(fù)合殘差映射。
定義:
H(x):滿意的潛在映射
F(x):堆疊的非線性層產(chǎn)生的映射,并且滿足關(guān)系F(x) := H(x) - x
原始映射:F(x) + x
假設(shè)優(yōu)化殘差映射要比優(yōu)化原始映射更容易。極端點來說,對于一個恒等映射,將殘差推到0要比吧非線性層推到0更簡單(不記得在哪里聽到過,理論上,通過增加層數(shù)的方法來改善模型非常符合直覺,因為我們直覺上覺得可以無限的加y=x這樣的變換,來無限制的增加層數(shù)。但是就像論文中描寫的那樣,我們并沒有得到更高的準(zhǔn)確度,而是出現(xiàn)了退化現(xiàn)象,這可能是因為DL太善于計算非線性了,反而沒辦法在線性上給一個比較好的表現(xiàn),我覺得這是一個直覺上比較好的解釋,故在此記錄)。
F(x) + x這一形式在前饋神經(jīng)網(wǎng)絡(luò)中可以采用“shortcut connections”來實現(xiàn)。這里所謂的“shortcut connections”指的是跳過一層或多層。在本文中,這種連接僅僅是做“identity mapping”(恒等映射:在數(shù)學(xué)中,指一個函數(shù)將每個元素映射到其自身,即輸入和輸出相等的映射。),這一做法并不會增加額外的參數(shù),也不會增加計算的復(fù)雜度(反正y=x也沒什么影響,也沒必要更新),這就使得整個網(wǎng)絡(luò)仍然可以采用反向傳播SGD進(jìn)行end-to-end的訓(xùn)練。文章中提到,采用的152層的網(wǎng)絡(luò)雖然比之前的VGG網(wǎng)絡(luò)要深,但是實際上并沒有那么復(fù)雜。
Deep Residual Learning
Residual Learning
在前面的定義中(H(x):滿意的潛在映射)。讓我們先把H(x)看做是由幾個堆疊的層擬合的潛在映射(這里的幾個不一定指的是整個網(wǎng)絡(luò)),那么此時x表示的就是這些層中第一層的輸入。假設(shè)多個非線性層可以逐漸逼近residual function(即H(x)- x,假設(shè)輸入和輸出的dimension相同,這里看起來有點autoencoder那個感覺)。因此,與其期待用堆疊的層去近似H(x),不如讓這些層去擬合residual function F(x) := H(x) - x。盡管這兩種形式都應(yīng)該能夠漸進(jìn)地逼近所需的函數(shù)(如假設(shè)的那樣),但學(xué)習(xí)的難易程度可能有所不同。
簡單來說就是,不再直接的擬合函數(shù),而是擬合殘差函數(shù),并且這么做的原因是這樣更容易學(xué)。
如果添加的層可以構(gòu)造為identity mapping(恒等映射),那么較深的模型的訓(xùn)練誤差不應(yīng)該大于較淺的模型。退化問題表明,在逼近多個非線性層的恒等映射時可能存在困難。使用residual方法之后,如果恒等映射是最優(yōu)的,那么就可以簡單的把多個非線性層的權(quán)重向0逼近,這就近似于恒等變換。
Identity Mapping by Shortcuts
對每個堆疊在一起的層(every few stacked layers)使用residual learning,構(gòu)建出building block。將這個building block定義為:
y = F(x, {W_i}) + x
x, y:對應(yīng)的堆疊在一起的層的輸入和輸出。
F(x, {W_i}):學(xué)到的residual mapping(我們最后想得到的是H(x),但是這里學(xué)到的是F(x, {W_i}),也就是H(x) - x,但是沒有關(guān)系,我們最后輸出的是F(x, {W_i}) + x,也就是H(x) - x + x,這種思路很像小時候做那種數(shù)列找規(guī)律求和,雖然直接算很難算,但是可以加上一項之后先算出來結(jié)果,最后再把加上來的項去掉)
在figure 2中,我們有兩層,也就是,這里σ代表ReLU,為了簡化,這里省略了bias。F+x采用shortcut connection和element-wise addition(簡單來說就是直接拽過來加上,既然要拽過來直接加,那么一定要滿足維度的一致)。這種方法既不引入額外的參數(shù),也不增加計算復(fù)雜度。那么如果我們要改變輸入輸出的通道數(shù)時,可以執(zhí)行一個線性投影:
F(x, {W_i})可以表示多層的卷積。
Network Architectures
卷積層大多具有3×3濾波器,并遵循兩個簡單的設(shè)計規(guī)則:(i)對于相同的輸出特征圖大小,各層具有相同數(shù)量的濾波器;(ii)如果特征圖大小減半,則濾波器的數(shù)量增加一倍,以保持每層的時間復(fù)雜度。我們通過步長為2的卷積層直接執(zhí)行下采樣。網(wǎng)絡(luò)以一個全局平均池化層和一個帶有softmax的1000路全連接層結(jié)束。圖3(中)加權(quán)層總數(shù)為34層。
對中間的plain network,增加shortcut connections,就能改成residual版本。當(dāng)輸入輸出維度相同時,可以直接使用identity shortcut(Eqn.(1))(y = F(x, {W_i}) + x)
當(dāng)維度增加時(圖3中的虛線快捷方式),我們考慮兩個方案:
(A)快捷方式仍然執(zhí)行恒等映射,為增加維度填充額外的0。這個方案不引入額外的參數(shù);
(B) 中的投影shortcut用于匹配維度(通過1×1卷積完成)。
對于這兩個方案,當(dāng)快捷鍵跨越兩個大小的特征映射時,它們的步幅為2。
Implementation
從圖像或其水平翻轉(zhuǎn)中隨機(jī)采樣224×224裁剪,并減去每像素平均值。使用中的標(biāo)準(zhǔn)顏色增強(qiáng)。在每次卷積之后和激活之前采用批歸一化(BN)。初始化權(quán)重,并從頭開始訓(xùn)練所有的plain/residual網(wǎng)絡(luò)。使用SGD的小批量大小為256。學(xué)習(xí)率從0.1開始,當(dāng)誤差趨于平穩(wěn)時除以10,模型的訓(xùn)練次數(shù)可達(dá)60 × 104次。
Experiments
Deeper Bottleneck Architectures
對于每個殘差函數(shù)F,我們使用3層而不是2層的(圖5)。這三層是1×1, 3×3和1×1卷積,其中1×1層負(fù)責(zé)減少然后增加(恢復(fù))維度,使3×3層成為輸入/輸出維度較小的瓶頸。圖5給出了一個例子,其中兩種設(shè)計具有相似的時間復(fù)雜度。無參數(shù)標(biāo)識快捷方式對于瓶頸體系結(jié)構(gòu)尤其重要。如果將圖5(右)中的標(biāo)識快捷方式替換為投影,可以看出,由于shortcut連接到兩個高維端點,時間復(fù)雜度和模型尺寸都增加了一倍。
關(guān)于論文的理解
卷積后特征圖尺寸變化:H_out = (H_in + 2P - K) / S + 1
轉(zhuǎn)載神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)小記錄20——ResNet50模型的復(fù)現(xiàn)詳解_resnet50復(fù)現(xiàn)-CSDN博客末尾的resnet50結(jié)構(gòu)圖
同時針對之前論文中的結(jié)構(gòu)圖,要說明的是,有部分內(nèi)容上面的結(jié)構(gòu)圖和論文都沒有直接說明,比如conv1_x中需要加padding,否則(224-7)/2+1是沒辦法成112的,這里padding=3,conv2_x中stride=1,其他stride=2(比如56->28,(56-1)/2+1=28)
注意,在每一個小的block里面有些直接計算發(fā)現(xiàn)數(shù)字不對的,都是加了padding。
比如,conv2_x里面3×3那個就加了padding=1
代碼實現(xiàn)
1
參考:https://www.youtube.com/watch?v=DkNIBBBvcPs
import torch
import torch.nn as nnclass block(nn.Module):def __init__(self, inchannels, out_channels, identity_downsample=None, stride=1):super(block, self).__init__()# 每一個resnet的block的輸入和輸出的通道數(shù)的比值都是1/4,也就是說通道數(shù)擴(kuò)大了4倍self.expension = 4self.conv1 = nn.Conv2d(inchannels, outchannels, kernel_size=1, stride=1, padding=0)self.bn1 = nn.BatchNorm2d(out_channels)self.conv2 = nn.Conv2d(outchannels, outchannels, kernel_size=3, stride=stride, padding=1)self.bn2 = nn.BatchNorm2d(out_channels)self.conv3 = nn.Conv2d(outchannels, outchannels*self.expansion, kernel_size=1, stride=1, padding=0)self.bn3 = nn.BatchNorm2d(out_channels*self.expansion)self.relu = nn.ReLU()self.identity_downsample = identity_downsampledef forward(self, x):identity = xx = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.conv2(x)x = self.bn2(x)x = self.relu(x)x = self.conv3(x)x = self.bn3(x)if self.identity_downsample is not None:identity = self.identity_downsample(identity)# 方案A or 方案Bx += identityx = self.relu(x)return xclass ResNet(nn.Module):def __init__(self, block, layers, image_channels, num_classes):super(ResNet, self).__init__()# 在res50中block的堆疊是3 4 6 3# conv1_x# 剛剛輸入的時候channel是3,在這里conv一下轉(zhuǎn)成64self.in_channels = 64self.conv1 = nn.Conv2d(image_channels, 64, kernel_size=7, stride=2, padding=3)self.bn1 = nn.BatchNorm2d(64)self.relu = nn.ReLU()# conv2_xself.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)# 理論上我們在這里就可以開始一層一層寫了,比如:# self.layer1 = ...# self.layer2 = ...# 直接定義一個函數(shù)替我們寫self.layer1 = self._make_layer(block, layer[0], out_channel=64, stride=1)self.layer2 = self._make_layer(block, layer[1], out_channel=128, stride=2)self.layer3 = self._make_layer(block, layer[2], out_channel=256, stride=2)self.layer4 = self._make_layer(block, layer[3], out_channel=512, stride=2)self.avgpool = nn.AdaptiveAvgPool2d((1,1))self.fc = nn.Linear(512*4, num_classes)def forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.maxpool(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.layer4(x)x = self.avgpool(x)x = x.reshape(x.shape[0], -1)x = self.fc(x)return xdef _make_layer(self, block, num_residual_block, out_channels, stride):identity_downsample = Nonelayers = []# 如果不能直接相加(channel數(shù)量變化了)# 比如說conv2_x中的第一個block,輸入64輸出256,這種顯然沒辦法把64硬加到256上if stride != 1 or self.inchannels != out_channels*4:identity_downsample = nn.Sequential(nn.Conv2(self.in_channels, out_channels*4, kernel_size=1,stride=1),nn.BatchNorm2d(out_channels*4))layers.append(block(self.inchannels, out_channels, identity_downsample, stride))self.inchannels = out_channels*4for i in range(num_residual_block-1):layers.append(block(self.in_channels, out_channels))return nn.Sequential(*layers)def ResNet50(img_channels, num_classes=1000):return ResNet(block, [3,4,6,3], img_channels, num_classes)
2
神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)小記錄20——ResNet50模型的復(fù)現(xiàn)詳解_resnet50復(fù)現(xiàn)-CSDN博客
https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
ResNet50 with PyTorch | Kaggle
GitHub - JayPatwardhan/ResNet-PyTorch: Basic implementation of ResNet 50, 101, 152 in PyTorch
3
Writing ResNet from Scratch in PyTorch
與1中的類似,引用如下:
class ResidualBlock(nn.Module):def __init__(self, in_channels, out_channels, stride = 1, downsample = None):super(ResidualBlock, self).__init__()self.conv1 = nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size = 3, stride = stride, padding = 1),nn.BatchNorm2d(out_channels),nn.ReLU())self.conv2 = nn.Sequential(nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = 1, padding = 1),nn.BatchNorm2d(out_channels))self.downsample = downsampleself.relu = nn.ReLU()self.out_channels = out_channelsdef forward(self, x):residual = xout = self.conv1(x)out = self.conv2(out)if self.downsample:residual = self.downsample(x)out += residualout = self.relu(out)return outclass ResNet(nn.Module):def __init__(self, block, layers, num_classes = 10):super(ResNet, self).__init__()self.inplanes = 64self.conv1 = nn.Sequential(nn.Conv2d(3, 64, kernel_size = 7, stride = 2, padding = 3),nn.BatchNorm2d(64),nn.ReLU())self.maxpool = nn.MaxPool2d(kernel_size = 3, stride = 2, padding = 1)self.layer0 = self._make_layer(block, 64, layers[0], stride = 1)self.layer1 = self._make_layer(block, 128, layers[1], stride = 2)self.layer2 = self._make_layer(block, 256, layers[2], stride = 2)self.layer3 = self._make_layer(block, 512, layers[3], stride = 2)self.avgpool = nn.AvgPool2d(7, stride=1)self.fc = nn.Linear(512, num_classes)def _make_layer(self, block, planes, blocks, stride=1):downsample = Noneif stride != 1 or self.inplanes != planes:downsample = nn.Sequential(nn.Conv2d(self.inplanes, planes, kernel_size=1, stride=stride),nn.BatchNorm2d(planes),)layers = []layers.append(block(self.inplanes, planes, stride, downsample))self.inplanes = planesfor i in range(1, blocks):layers.append(block(self.inplanes, planes))return nn.Sequential(*layers)def forward(self, x):x = self.conv1(x)x = self.maxpool(x)x = self.layer0(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.avgpool(x)x = x.view(x.size(0), -1)x = self.fc(x)return xnum_classes = 10
num_epochs = 20
batch_size = 16
learning_rate = 0.01model = ResNet(ResidualBlock, [3, 4, 6, 3]).to(device)# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.001, momentum = 0.9) # Train the model
total_step = len(train_loader)
Layers in PyTorch
Now coming to the different types of layers available in PyTorch that are useful to us:
nn.Conv2d
: These are the convolutional layers that accepts the number of input and output channels as arguments, along with kernel size for the filter. It also accepts any strides or padding if we want to apply thosenn.BatchNorm2d
: This applies batch normalization to the output from the convolutional layernn.ReLU
: This is a type of ?activation function applied to various outputs in the networknn.MaxPool2d
?: This applies max pooling to the output with the kernel size givennn.Dropout
: This is used to apply dropout to the output with a given probabilitynn.Linear
: This is basically a fully connected layernn.Sequential
: This is technically not a type of layer but it helps in combining different operations that are part of the same step
看起來和之前的resnet50不太一樣的原因是這里是34層的。
因此這里沒有313這樣的結(jié)構(gòu)了。但是本質(zhì)上是一樣的。
————————————————————————————
一些題外話,雖然在很多地方看到說resnet已經(jīng)是很老的模型了,但是相比于之前的CNN方法而言,在方法上確實是非常厲害的創(chuàng)新,雖然現(xiàn)在似乎CNN已經(jīng)被調(diào)侃的像上世紀(jì)的產(chǎn)物了orz。似乎現(xiàn)在已經(jīng)是transformer的天下了……