conda命令的使用

查看版本

conda –version

conda -V

安装包

conda install package-name（==version）

卸载包

conda remove package-name

conda uninstall 库名（该库以及依赖库一起删除）

conda uninstall 库名 –force （仅卸载该库）

查看安装了哪些库(包)

conda list

查看有哪些虚拟环境

conda env list

激活环境

conda activate env-name

conda deactivate

如果激活环境时出现了如下报错：

CommandNotFoundError: Your shell has not been properly configured to use ‘conda activate’. To initialize your shell, run…

原因：未正确退出环境

解决：

source activate
conda deactivate

创建环境

conda create -n your_env_name python=x.x

删除虚拟环境

conda remove -n env_name –all

transformers库的安装与使用

安装

环境：

系统Ubuntu 18.04.6 LTS

Python 3.9.13

pytorch：2.0.1+cu117（CUDA：11.7）

tensorflow：2.6.0

pytorch官网链接：PyTorch

填坑：一开始只在安装了pytorch的虚拟环境中安装transformers，遇到好多报错，解决了什么openssl的问题，但是还是不行，所以又在安装了tensorflow的虚拟环境中安装，可以了，但是发现好多代码都是基于pytorch上面使用transformers，然后发现有些帖子是既安装了pytorch又安装了tensorflow，然后安装transformers就可以了，所以实施，在原有的tensorflow上面先安装pytorch

pip3 install torch torchvision torchaudio

或者使用conda安装（安装时注意pytorch官网上cuda版本的确定）

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

然后通过conda安装transformers

pip install transformers # 安装最新的版本
pip install transformers == 4.0 # 安装指定版本
# 如果你是conda的话
conda install -c huggingface transformers  # 4.0以后的版本才会有（选用最后一种方式）

最后测试通过（无报错即可）

from transformers import pipeline

使用

文章连接

Huggingface 超详细介绍 - 知乎 (zhihu.com)

Hugging face 在github上开源了一个Transformers库，目前共享了超过100000个预训练模型，10000个数据集，变成了机器学习届的github。

网址 Models - Hugging Face

内容

dataset数据集以及数据集的下载地址

models各个预训练模型

course免费的nlp课程英文

docs 文档

Bert模型

导入

import torch
from transformers import BertModel, BertTokenizer, BertConfig
# 首先要import进来
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
config = BertConfig.from_pretrained('bert-base-chinese')
config.update(&#123;'output_hidden_states':True&#125;) # 这里直接更改模型配置
model = BertModel.from_pretrained("bert-base-chinese",config=config)

huggingface官网在国外，自动下载比较费时，默认下载地址

1、windows模型保存在C:\Users[用户名].cache\torch\transformers\ 目录下，根据模型的不同下载的东西也不相同

2、Linux模型保存路径在 /home/zy/.cache/huggingface/hub/models–bert-base-chinese 目录下

如果自动下载总是中断的话，可以考虑国内的源，或者手动下载之后指定位置（huggingface官网，选择models菜单，然后搜索自己想要的模型，把里面的文件下载下来，其中体积较大的有tf的有torch的，根据自己需要下载）

import transformers
MODEL_PATH = r"D:\\test\\bert-base-chinese"
# 导入模型
tokenizer = transformers.BertTokenizer.from_pretrained(r"D:\\test\\bert-base-chinese\\bert-base-chinese-vocab.txt") 
# 导入配置文件
model_config = transformers.BertConfig.from_pretrained(MODEL_PATH)
# 修改配置
model_config.output_hidden_states = True
model_config.output_attentions = True
# 通过配置和路径导入模型
model = transformers.BertModel.from_pretrained(MODEL_PATH,config = model_config)

使用

tokenizer

它实例化了BertTokenizer类，它是基于wordPiece方法的，参数有：

( vocab_file：存放词典的地址，do_lower_case = True：是否都变成小写，默认为true，do_basic_tokenize = True：做wordPiece之前是否要做basic tokenize，never_split = None，unk_token = ‘[UNK]’，sep_token = ‘[SEP]’，pad_token = ‘[PAD]’，cls_token = ‘[CLS]’，mask_token = ‘[MASK]’，tokenize_chinese_chars = True，strip_accents = None，**kwargs )

# 上文的示例代码已经实例话了，这里不重复了；
print(tokenizer.encode("生活的真谛是美和爱"))  # 对于单个句子编码
print(tokenizer.encode_plus("生活的真谛是美和爱","说的太好了")) # 对于一组句子编码
# 输出结果如下：
[101, 4495, 3833, 4638, 4696, 6465, 3221, 5401, 1469, 4263, 102]
&#123;'input_ids': [101, 4495, 3833, 4638, 4696, 6465, 3221, 5401, 1469, 4263, 102, 6432, 4638, 1922, 1962, 749, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]&#125;

# 也可以直接这样用
sentences = ['网络安全开发分为三个层级',
             '车辆系统层级网络安全开发',
             '车辆功能层级网络安全开发',
             '车辆零部件层级网络安全开发',
             '测试团队根据车辆网络安全目标制定测试技术要求及测试计划',
             '测试团队在网络安全团队的支持下，完成确认测试并编制测试报告',
             '在车辆确认结果的基础上，基于合理的理由，确认在设计和开发阶段识别出的所有风险均已被接受',]
test1 = tokenizer(sentences)

print(test1)  # 对列表encoder
# 输出结果如下：
&#123;'input_ids': [[101, 5381, 5317, 2128, 1059, 2458, 1355, 1146, 711, 676, 702, 2231, 5277, 102], [101, 6756, 6775, 5143, 5320, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 6756, 6775, 1216, 5543, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 6756, 6775, 7439, 6956, 816, 2231, 5277, 5381, 5317, 2128, 1059, 2458, 1355, 102], [101, 3844, 6407, 1730, 7339, 3418, 2945, 6756, 6775, 5381, 5317, 2128, 1059, 4680, 3403, 1169, 2137, 3844, 6407, 2825, 3318, 6206, 3724, 1350, 3844, 6407, 6369, 1153, 102], [101, 3844, 6407, 1730, 7339, 1762, 5381, 5317, 2128, 1059, 1730, 7339, 4638, 3118, 2898, 678, 8024, 2130, 2768, 4802, 6371, 3844, 6407, 2400, 5356, 1169, 3844, 6407, 2845, 1440, 102], [101, 1762, 6756, 6775, 4802, 6371, 5310, 3362, 4638, 1825, 4794, 677, 8024, 1825, 754, 1394, 4415, 4638, 4415, 4507, 8024, 4802, 6371, 1762, 6392, 6369, 1469, 2458, 1355, 7348, 3667, 6399, 1166, 1139, 4638, 2792, 3300, 7599, 7372, 1772, 2347, 6158, 2970, 1358, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]&#125;
# 101是[CLS] 102是[SEP]
print(tokenizer("网络安全开发分为三个层级"))  # 对单个句子encoder

model

model实例化了BertModel类，除了初始的Bert、GPT等基本模型，针对不同的下游任务，定义了BertForQuestionAnswering，BertForMultiChoice，BertForNextSentencePrediction 以及 BertForSequenceClassification 等下游任务模型。模型导出时将生成 config.json 和 pytorch_model.bin 参数文件，一个是配置文件一个是torch训练后save的文件。

例如使用bert-base-uncased模型来做MLM任务

from transformers import pipeline
# 运行该段代码要保障你的电脑能够上网，会自动下载预训练模型，大概420M
unmasker = pipeline("fill-mask",model = "bert-base-uncased")  # 这里引入了一个任务叫fill-mask，该任务使用了base的bert模型
unmasker("The goal of life is [MASK].", top_k=5) # 输出mask的指，对应排名最前面的5个，也可以设置其他数字
# 输出结果如下，似乎都不怎么有效哈。
[&#123;'score': 0.10933303833007812,
  'token': 2166,
  'token_str': 'life',
  'sequence': 'the goal of life is life.'&#125;,
 &#123;'score': 0.03941883146762848,
  'token': 7691,
  'token_str': 'survival',
  'sequence': 'the goal of life is survival.'&#125;,
 &#123;'score': 0.032930608838796616,
  'token': 2293,
  'token_str': 'love',
  'sequence': 'the goal of life is love.'&#125;,
 &#123;'score': 0.030096106231212616,
  'token': 4071,
  'token_str': 'freedom',
  'sequence': 'the goal of life is freedom.'&#125;,
 &#123;'score': 0.024967126548290253,
  'token': 17839,
  'token_str': 'simplicity',
  'sequence': 'the goal of life is simplicity.'&#125;]