在线网站做气泡图建立一个网站需要人员

张小明 2026/1/11 7:43:58
在线网站做气泡图,建立一个网站需要人员,达州住房与建设局网站,主播网站怎么建立原文#xff1a;towardsdatascience.com/how-to-train-a-vision-transformer-vit-from-scratch-f26641f26af2 嗨#xff0c;大家好#xff01;对于那些还不认识我的人来说#xff0c;我叫弗朗索瓦#xff0c;我是 Meta 的研究科学家。我对解释高级人工智能概念并使其更易于…原文towardsdatascience.com/how-to-train-a-vision-transformer-vit-from-scratch-f26641f26af2嗨大家好对于那些还不认识我的人来说我叫弗朗索瓦我是 Meta 的研究科学家。我对解释高级人工智能概念并使其更易于理解充满热情。今天让我们深入了解计算机视觉领域最重大的贡献之一视觉 Transformer (ViT)。本文重点介绍了自发布以来视觉 Transformer 的最先进实现。为了完全理解 ViT 的工作原理我强烈建议阅读我关于理论基础的另一篇帖子视觉 Transformer 的终极指南如何从头开始训练 VIThttps://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/cb4b6029679c9aa4d031080f4ea9ad42.pngViT 架构图片来自 原文1. 注意力层https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ef0e7fbdbbec052c3f903192be6218f0.png注意力层图片由作者提供让我们从 Transformer 编码器最著名的构建块开始注意力层。classAttention(nn.Module):def__init__(self,dim,heads8,dim_head64,dropout0.):super().__init__()inner_dimdim_head*heads# Calculate the total inner dimension based on the number of attention heads and the dimension per head# Determine if a final projection layer is needed based on the number of heads and dimension per headproject_outnot(heads1anddim_headdim)self.headsheads# Store the number of attention headsself.scaledim_head**-0.5# Scaling factor for the attention scores (inverse of sqrt(dim_head))self.normnn.LayerNorm(dim)# Layer normalization to stabilize training and improve convergenceself.attendnn.Softmax(dim-1)# Softmax layer to compute attention weights (along the last dimension)self.dropoutnn.Dropout(dropout)# Dropout layer for regularization during training# Linear layer to project input tensor into queries, keys, and valuesself.to_qkvnn.Linear(dim,inner_dim*3,biasFalse)# Conditional projection layer after attention, to project back to the original dimension if requiredself.to_outnn.Sequential(nn.Linear(inner_dim,dim),# Linear layer to project concatenated head outputs back to the original input dimensionnn.Dropout(dropout)# Dropout layer for regularization)ifproject_outelsenn.Identity()# Use Identity (no change) if no projection is neededdefforward(self,x):xself.norm(x)# Apply normalization to the input tensor# Apply the linear layer to get queries, keys, and values, then split into 3 separate tensorsqkvself.to_qkv(x).chunk(3,dim-1)# Chunk the tensor into 3 parts along the last dimension: (query, key, value)# Reshape each chunk tensor from (batch_size, num_patches, inner_dim) to (batch_size, num_heads, num_patches, dim_head)q,k,vmap(lambdat:rearrange(t,b n (h d) - b h n d,hself.heads),qkv)# Calculate dot products between queries and keys, scale by the inverse square root of dimensiondotstorch.matmul(q,k.transpose(-1,-2))*self.scale# Shape: (batch_size, num_heads, num_patches, num_patches)# Apply softmax to get attention weightsattnself.attend(dots)# Shape: (batch_size, num_heads, num_patches, num_patches)attnself.dropout(attn)# Multiply attention weights by values to get the outputouttorch.matmul(attn,v)# Shape: (batch_size, num_heads, num_patches, dim_head)# Rearrange the output tensor to (batch_size, num_patches, inner_dim)outrearrange(out,b h n d - b n (h d))# Combine heads dimension with the output dimension# Project the output back to the original input dimension if neededoutself.to_out(out)# Shape: (batch_size, num_patches, dim)returnout# Return the final output tensor关键点inner_dim: 是dim_head和头数number的乘积。为了矢量化并加快计算速度我们在张量乘积之前合并这两个维度。为了计算速度我们不需要分别初始化 Q、K、V我们可以将它们连接到一个大张量中称为 self.to_qkv。这样我们可以一次性计算所有内容。einops是一个非常有用的库可以通过指定维度来重新排列张量大小。它非常直观。例如如果你有一个维度为 (batch_size,n_tokens,number_headsxhead_dim) 的张量并且你想将最后一个维度拆分为 (batch_size,n_tokens,number_heads,head_dim)你可以使用 Einops.rarrange(qvk,b n (h d) → b n h d,h num_heads)这对于跟踪你正在操作的维度非常有用2. 前馈网络 (FFN)https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/fdb66c1f8fd915973cc8a08b90a4b45d.png前馈网络图片由作者提供接下来我们添加第二个 Transformer 块前馈网络。classFFN(nn.Module):def__init__(self,dim,hidden_dim,dropout0.):super().__init__()self.netnn.Sequential(# norm - linear - activation - dropout - linear - dropout# we first norm with a layer normnn.LayerNorm(dim),nn.Linear(dim,hidden_dim),# we project in a higher dimension hidden_dimnn.GELU(),# we apply the GELU activation functionnn.Dropout(dropout),# we apply dropoutnn.Linear(hidden_dim,dim),# we project back to the original dimension dimnn.Dropout(dropout)# we apply dropout)defforward(self,x):returnself.net(x)这里没有什么复杂的。你只需要理解 FFN 是两个 MLP 的连续通常第一个 MLP 将数据投影到更高的维度第二个 MLP 将其投影回输入维度这就是为什么我们有dim和hidden dim关键点dim: 输入标记的维度。hidden_dim:FFN 的中间维度。GELU:一种激活函数。虽然原始论文使用 ReLU但由于其更平滑的过渡GELU 已变得更加流行。3. Transformer 编码器L 个 Transformer 层的堆叠https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5a2259b75e0d5a76dd274c09ef65da7e.pngTransformer 编码器图像由作者提供在注意力层和前馈网络就位后我们可以组装一个 Transformer 层。Transformer 编码器本质上是一个 L 个 Transformer 层的堆叠。记住Transformer 层就像乐高积木一样——输入维度与输出维度相同所以你可以堆叠尽可能多的或者你的内存允许的。不要忘记残差连接对于保持梯度流和使优化更平滑非常重要。classTransformer(nn.Module):def__init__(self,dim,depth,heads,dim_head,mlp_dim_ratio,dropout):super().__init__()self.normnn.LayerNorm(dim)self.layersnn.ModuleList([])mlp_dimmlp_dim_ratio*dimfor_inrange(depth):self.layers.append(nn.ModuleList([Attention(dimdim,headsheads,dim_headdim_head,dropoutdropout),FFN(dimdim,hidden_dimmlp_dim,dropoutdropout)]))defforward(self,x):forattn,ffninself.layers:xattn(x)x xffn(x)xreturnself.norm(x)组装最终的 ViT我们已经完成了最困难的任务现在我们可以组装完整的视觉 Transformer。我们主要需要添加 3 个组件将图像转换为图像块然后转换为向量。添加位置嵌入添加CLS标记https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/793214ef1880c42e62377c2910750d7e.png图像块化图像由作者提供https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/f9b65adf7b4694ec9e1c4d981b4df801.png将图像转换为图像块图像由作者提供首先我们定义一个简单的实用函数帮助我们将标量转换为元组。defpair(t): Converts a single value into a tuple of two values. If t is already a tuple, it is returned as is. Args: t: A single value or a tuple. Returns: A tuple where both elements are t if t is not a tuple. returntifisinstance(t,tuple)else(t,t)现在我们已经准备好编写 ViT 的代码了让我们从几个合理性检查开始我们需要检查我们是否正确地将图像分割成一定数量的图像块这个数量是一个整数。换句话说我们需要检查image_height和image_width是否可以被patch_dimension整除。classViT(nn.Module):def__init__(self,*,image_size,patch_size,num_classes,dim,depth,heads,mlp_dim_ratio,poolcls,channels3,dim_head64,dropout0.): Initializes a Vision Transformer (ViT) model. Args: image_size (int or tuple): Size of the input image (height, width). patch_size (int or tuple): Size of each patch (height, width). num_classes (int): Number of output classes. dim (int): Dimension of the embedding space. depth (int): Number of transformer layers. heads (int): Number of attention heads. mlp_dim (int): Dimension of the feedforward network. pool (str): Pooling strategy (cls or mean). channels (int): Number of input channels (e.g., 3 for RGB images). dim_head (int): Dimension of each attention head. dropout (float): Dropout rate. super().__init__()# Convert image size and patch size to tuples if they are single valuesimage_height,image_widthpair(image_size)patch_height,patch_widthpair(patch_size)# Ensure that the image dimensions are divisible by the patch sizeassertimage_height%patch_height0andimage_width%patch_width0,Image dimensions must be divisible by the patch size.# Calculate the number of patches and the dimension of each patchnum_patches(image_height//patch_height)*(image_width//patch_width)patch_dimchannels*patch_height*patch_width下一步是将图像块转换为嵌入。记住在这里一个图像有 C 3 个维度。我们需要展开这个维度并将每个图像块压缩成维度 _patch_size x patch_size x c.*# Define the patch embedding layerself.to_patch_embeddingnn.Sequential(# Rearrange the input tensor to (batch_size, num_patches, patch_dim)Rearrange(b c (h p1) (w p2) - b (h w) (p1 p2 c),p1patch_height,p2patch_width),nn.LayerNorm(patch_dim),# Normalize each patchnn.Linear(patch_dim,dim),# Project patches to embedding dimensionnn.LayerNorm(dim)# Normalize the embedding)然后我们需要定义CLS标记和位置嵌入。CLS 标记有助于将整个图像表示为一个单一的向量位置嵌入有助于模型对标记具有空间感知。它们都是学习参数随机初始化。# Ensure the pooling strategy is validassertpoolin{cls,mean},pool type must be either cls (cls token) or mean (mean pooling)# Define CLS token and positional embeddingsself.cls_tokennn.Parameter(torch.randn(1,1,dim))# Learnable class tokenself.pos_embeddingnn.Parameter(torch.randn(1,num_patches1,dim))# Positional embeddings for patches and class token现在我们只需要定义我们之前定义的 transformer 层并添加一个分类头# Define the transformer encoderself.transformerTransformer(dim,depth,heads,dim_head,mlp_dim_ratio,dropout)# Pooling strategy (cls token or mean of patches)self.poolpool# Identity layer (no change to the tensor)self.to_latentnn.Identity()# Classification headself.mlp_headnn.Linear(dim,num_classes)前向传递我们已经初始化了 ViT 的所有组件现在我们只需要按正确的顺序调用它们进行前向传递。我们首先将输入图像转换为图像块并将每个图像块展开成一个向量。然后我们重复CLS标记沿着批次维度并在维度 1 上连接这是序列长度。确实我们学习了一个向量的参数但需要将其连接到每个图像上这就是为什么我们需要扩展一个维度。然后我们将位置嵌入添加到每个标记上。defforward(self,img): Forward pass through the Vision Transformer model. Args: img (Tensor): Input image tensor of shape (batch_size, channels, height, width). Returns: dict: A dictionary containing the class token, feature map, and classification result. # Convert image to patch embeddingsxself.to_patch_embedding(img)# Shape: (batch_size, num_patches, dim)b,n,_x.shape# Get batch size, number of patches, and embedding dimension# Repeat class token for each image in the batchcls_tokensrepeat(self.cls_token,1 1 d - b 1 d,bb)# Concatenate class token with patch embeddingsxtorch.cat((cls_tokens,x),dim1)# Add positional embeddings to the inputxself.pos_embedding[:,:(n1)]# Apply dropout for regularizationxself.dropout(x)然后我们应用Transformer 编码器。我们主要用它来构建包含 3 个内容的输出**CLS 标记图像的单个向量表示。特征图图像每个图像块的向量表示分类头逻辑可选这用于分类任务。请注意视觉 Transformer 可以用于不同的任务但分类是最初使用的任务。# Pass through transformer encoderxself.transformer(x)# Shape: (batch_size, num_patches 1, dim)# Extract class token and feature mapcls_tokenx[:,0]# Extract class tokenfeature_mapx[:,1:]# Remaining tokens are feature map# Apply pooling operation: cls token or mean of patchespooled_outputcls_tokenifself.poolclselsefeature_map.mean(dim1)# Apply the identity transformation (no change to the tensor)pooled_outputself.to_latent(pooled_output)# Apply the classification head to the pooled outputclassification_resultself.mlp_head(pooled_output)# Return a dictionary with the required componentsreturn{cls_token:cls_token,# Class tokenfeature_map:feature_map,# Feature map (patch embeddings)classification_head_logits:classification_result# Final classification result}总结一下以下是 ViT 的最终代码。您可以在本 github 仓库中找到其更新版本GitHub – FrancoisPorcher/awesome-ai-tutorials: The best collection of AI tutorials to make you a…classViT(nn.Module):def__init__(self,*,image_size,patch_size,num_classes,dim,depth,heads,mlp_dim_ratio,poolcls,channels3,dim_head64,dropout0.): Initializes a Vision Transformer (ViT) model. Args: image_size (int or tuple): Size of the input image (height, width). patch_size (int or tuple): Size of each patch (height, width). num_classes (int): Number of output classes. dim (int): Dimension of the embedding space. depth (int): Number of transformer layers. heads (int): Number of attention heads. mlp_dim (int): Dimension of the feedforward network. pool (str): Pooling strategy (cls or mean). channels (int): Number of input channels (e.g., 3 for RGB images). dim_head (int): Dimension of each attention head. dropout (float): Dropout rate. super().__init__()# Convert image size and patch size to tuples if they are single valuesimage_height,image_widthpair(image_size)patch_height,patch_widthpair(patch_size)# Ensure that the image dimensions are divisible by the patch sizeassertimage_height%patch_height0andimage_width%patch_width0,Image dimensions must be divisible by the patch size.# Calculate the number of patches and the dimension of each patchnum_patches(image_height//patch_height)*(image_width//patch_width)patch_dimchannels*patch_height*patch_width# Define the patch embedding layerself.to_patch_embeddingnn.Sequential(# Rearrange the input tensor to (batch_size, num_patches, patch_dim)Rearrange(b c (h p1) (w p2) - b (h w) (p1 p2 c),p1patch_height,p2patch_width),nn.LayerNorm(patch_dim),# Normalize each patchnn.Linear(patch_dim,dim),# Project patches to embedding dimensionnn.LayerNorm(dim)# Normalize the embedding)# Ensure the pooling strategy is validassertpoolin{cls,mean},pool type must be either cls (cls token) or mean (mean pooling)# Define CLS token and positional embeddingsself.cls_tokennn.Parameter(torch.randn(1,1,dim))# Learnable class tokenself.pos_embeddingnn.Parameter(torch.randn(1,num_patches1,dim))# Positional embeddings for patches and class tokenself.dropoutnn.Dropout(dropout)# Dropout for regularization# Define the transformer encoderself.transformerTransformer(dim,depth,heads,dim_head,mlp_dim_ratio,dropout)# Pooling strategy (cls token or mean of patches)self.poolpool# Identity layer (no change to the tensor)self.to_latentnn.Identity()# Classification headself.mlp_headnn.Linear(dim,num_classes)defforward(self,img): Forward pass through the Vision Transformer model. Args: img (Tensor): Input image tensor of shape (batch_size, channels, height, width). Returns: dict: A dictionary containing the class token, feature map, and classification result. # Convert image to patch embeddingsxself.to_patch_embedding(img)# Shape: (batch_size, num_patches, dim)b,n,_x.shape# Get batch size, number of patches, and embedding dimension# Repeat class token for each image in the batchcls_tokensrepeat(self.cls_token,1 1 d - b 1 d,bb)# Concatenate class token with patch embeddingsxtorch.cat((cls_tokens,x),dim1)# Add positional embeddings to the inputxself.pos_embedding[:,:(n1)]# Apply dropout for regularizationxself.dropout(x)# Pass through transformer encoderxself.transformer(x)# Shape: (batch_size, num_patches 1, dim)# Extract class token and feature mapcls_tokenx[:,0]# Extract class tokenfeature_mapx[:,1:]# Remaining tokens are feature map# Apply pooling operation: cls token or mean of patchespooled_outputcls_tokenifself.poolclselsefeature_map.mean(dim1)# Apply the identity transformation (no change to the tensor)pooled_outputself.to_latent(pooled_output)# Apply the classification head to the pooled outputclassification_resultself.mlp_head(pooled_output)# Return a dictionary with the required componentsreturn{cls_token:cls_token,# Class tokenfeature_map:feature_map,# Feature map (patch embeddings)classification_head_logits:classification_result# Final classification result}恭喜您已经从头开始构建了一个视觉 Transformer感谢阅读在您离开之前想要了解更多精彩的教程请查看我在 Github 上的 AI 教程汇编GitHub – FrancoisPorcher/awesome-ai-tutorials: The best collection of AI tutorials to make you a…You should get my articles in your inbox.Subscribe here.如果您想访问 Medium 上的优质文章您只需每月支付 5 美元即可。如果您通过我的链接注册您只需支付部分费用无需额外费用即可支持我。如果您觉得这篇文章有见地且有益请考虑关注我并为我点赞以获取更深入的内容您的支持帮助我继续创作有助于我们共同理解的内容。参考文献•“An Image is Worth 16×16 Words”由 Alexey Dosovitskiy 等人2021发表。您可以在 arXiv 上阅读完整论文。
版权声明:本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!

建个网站视频江苏个人网站备案

Android TTS架构革新:多引擎融合与智能语音合成技术解析 【免费下载链接】tts-server-android 这是一个Android系统TTS应用,内置微软演示接口,可自定义HTTP请求,可导入其他本地TTS引擎,以及根据中文双引号的简单旁白/对…

张小明 2026/1/9 2:21:57 网站建设

iapp用网站做软件代码wordpress 数据库类

抖音无水印下载:从零开始完全指南 【免费下载链接】douyin_downloader 抖音短视频无水印下载 win编译版本下载:https://www.lanzous.com/i9za5od 项目地址: https://gitcode.com/gh_mirrors/dou/douyin_downloader 写作目标 为抖音视频下载工具撰…

张小明 2026/1/7 21:43:54 网站建设

手表网站建站wap浏览器模拟

OPC-UA客户端GUI工具完整使用指南:从入门到精通 【免费下载链接】opcua-client-gui OPC-UA GUI Client 项目地址: https://gitcode.com/gh_mirrors/op/opcua-client-gui OPC-UA客户端GUI工具是一款基于Python开发的免费开源图形界面客户端,专门用…

张小明 2026/1/9 23:19:09 网站建设

南昌企业网站建设哪家好专业刷单网站建设

Vue 3 从基础到高阶全攻略 探索 Vue 3 的无限可能 🚀 引言 Vue 3 作为当前最流行的前端框架之一,带来了许多令人振奋的新特性和性能改进。从组合式 API 到更好的 TypeScript 支持,从更小的打包体积到更快的渲染速度,Vue 3 为前端…

张小明 2026/1/7 21:43:52 网站建设

网站建设及推广图片表白网页制作源码

Kotaemon视频内容摘要生成实验记录 在音视频内容爆炸式增长的今天,如何从数小时的讲座、会议或教学录像中快速提取核心信息,已成为知识工作者和企业面临的一大挑战。传统做法依赖人工听看并撰写摘要,效率低、成本高;而直接使用大语…

张小明 2026/1/7 19:47:28 网站建设

网站开发维护协议分销 社交 电商系统

快速体验 打开 InsCode(快马)平台 https://www.inscode.net输入框内输入如下内容: 使用AI生成一个JMeter性能测试脚本,测试目标是一个电商网站的登录接口。要求包括:1. 模拟100个并发用户;2. 持续运行5分钟;3. 记录响…

张小明 2026/1/7 21:43:55 网站建设