Qwen-VL 是阿里云研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Qwen-VL 可以以图像、文本、检测框作为输入,并以文本和检测框作为输出。Qwen-VL 系列模型性能强大,具备多语言对话、多图交错对话等能力,并支持中文开放域定位和细粒度图像识别与理解。
目前,提供了Qwen-VL和Qwen-VL-Chat两个模型,分别为预训练模型和Chat模型。如果想了解更多关于模型的信息,请点击链接查看我们的技术备忘录。
需要执行如下的命令:
1 2 |
pip install matplotlib tiktoken pip install transformers_stream_generator |
1. 单卡加载 Qwen-VL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
from transformers import AutoModelForCausalLM, AutoTokenizer import torch import time torch.manual_seed(1234) MODEL_NAME = "Qwen/Qwen-VL-Chat" # 检测可用的GPU数量 NUM_GPUS = torch.cuda.device_count() print(f"NUM_GPUS: {NUM_GPUS}") # 获取起始时间戳 start_time = time.time() # 加载分词器和模型,指定设备映射和数据类型 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) # use bf16 model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, bf16=True, device_map="cuda:0") model = model.eval() end_time = time.time() elapsed_time = end_time - start_time print(f"Load Model Time: {elapsed_time} seconds") print(model) start_time2 = time.time() # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, {'text': '这是什么'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种可能是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,似乎在和人类击掌。两人之间充满了信任和爱。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出"击掌"的检测框', history=history) end_time2 = time.time() elapsed_time2 = end_time2 - start_time2 print(response) # <ref>击掌</ref><box>(517,508),(589,611)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('1.jpg') else: print("no box") print(f"Total Generation Time: {elapsed_time2} seconds") |
运行结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
time python test05-VL.py NUM_GPUS: 8 Loading checkpoint shards: 100%|█████████████| 10/10 [19:32<00:00, 117.30s/it] Load Model Time: 1177.1278638839722 seconds QWenLMHeadModel( (transformer): QWenModel( (wte): Embedding(151936, 4096) (drop): Dropout(p=0.0, inplace=False) (rotary_emb): RotaryEmbedding() (h): ModuleList( (0-31): 32 x QWenBlock( (ln_1): RMSNorm() (attn): QWenAttention( (c_attn): Linear(in_features=4096, out_features=12288, bias=True) (c_proj): Linear(in_features=4096, out_features=4096, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): RMSNorm() (mlp): QWenMLP( (w1): Linear(in_features=4096, out_features=11008, bias=False) (w2): Linear(in_features=4096, out_features=11008, bias=False) (c_proj): Linear(in_features=11008, out_features=4096, bias=False) ) ) ) (ln_f): RMSNorm() (visual): VisionTransformer( (conv1): Conv2d(3, 1664, kernel_size=(14, 14), stride=(14, 14), bias=False) (ln_pre): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (transformer): TransformerBlock( (resblocks): ModuleList( (0-47): 48 x VisualAttentionBlock( (ln_1): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (ln_2): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (attn): VisualAttention( (in_proj): Linear(in_features=1664, out_features=4992, bias=True) (out_proj): Linear(in_features=1664, out_features=1664, bias=True) ) (mlp): Sequential( (c_fc): Linear(in_features=1664, out_features=8192, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=8192, out_features=1664, bias=True) ) ) ) ) (attn_pool): Resampler( (kv_proj): Linear(in_features=1664, out_features=4096, bias=False) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=4096, out_features=4096, bias=True) ) (ln_q): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) (ln_kv): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) ) (ln_post): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) ) ) (lm_head): Linear(in_features=4096, out_features=151936, bias=False) ) 图中是一名女子在沙滩上和狗玩耍,旁边的狗是一只拉布拉多犬,它们处于沙滩上。 <ref>击掌</ref><box>(523,512),(588,605)</box> Total Generation Time: 4.424967527389526 seconds real 19m45.588s user 0m12.085s sys 0m29.553s |
代码中的图片
击掌图片:
2. 多卡加载 Qwen-VL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
from transformers import AutoModelForCausalLM, AutoTokenizer import torch import time torch.manual_seed(1234) MODEL_NAME = "Qwen/Qwen-VL-Chat" # 定义一个函数来自动配置在多GPU环境下模型各部分的设备分布 def auto_configure_device_map(num_gpus: int): num_trans_layers = 32 # 定义Transformer模型的层数 per_gpu_layers = num_trans_layers / num_gpus # 计算每个GPU应承担的层数 # 初始化设备映射字典,指定一些特定模块应该放置的GPU编号 device_map = { 'transformer.wte': 0, # 嵌入层放在第一个GPU上 'transformer.visual': num_gpus-1, # 最后一个正则化层放在最后一个GPU上 'transformer.ln_f': num_gpus-1, # 最后一个正则化层放在最后一个GPU上 'lm_head': num_gpus-1 # 语言模型头(用于预测下一个词的层)放在最后一个GPU上 } # 将Transformer模型的每一层分配给一个GPU for i in range(num_trans_layers): device_map[f'transformer.h.{i}'] = int(i//per_gpu_layers) return device_map # 检测可用的GPU数量 NUM_GPUS = torch.cuda.device_count() print(f"NUM_GPUS: {NUM_GPUS}") # 如果有可用的GPU,则基于GPU数量自动配置设备映射;否则不使用设备映射 device_map = auto_configure_device_map(NUM_GPUS) if NUM_GPUS > 0 else None # 获取起始时间戳 start_time = time.time() # 加载分词器和模型,指定设备映射和数据类型 tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) # use bf16 model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, bf16=True, device_map=device_map) model = model.eval() end_time = time.time() elapsed_time = end_time - start_time print(f"Load Model Time: {elapsed_time} seconds") print(model) start_time2 = time.time() # 1st dialogue turn query = tokenizer.from_list_format([ {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, {'text': '这是什么'}, ]) response, history = model.chat(tokenizer, query=query, history=None) print(response) # 图中是一名年轻女子在沙滩上和她的狗玩耍,狗的品种可能是拉布拉多。她们坐在沙滩上,狗的前腿抬起来,似乎在和人类击掌。两人之间充满了信任和爱。 # 2nd dialogue turn response, history = model.chat(tokenizer, '输出"击掌"的检测框', history=history) end_time2 = time.time() elapsed_time2 = end_time2 - start_time2 print(response) # <ref>击掌</ref><box>(517,508),(589,611)</box> image = tokenizer.draw_bbox_on_latest_picture(response, history) if image: image.save('2.jpg') else: print("no box") print(f"Total Generation Time: {elapsed_time2} seconds") |
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
time python test05-VL-2.py NUM_GPUS: 8 Loading checkpoint shards: 100%|███████████| 10/10 [48:26<00:00, 290.68s/it] Load Model Time: 2910.5123233795166 seconds QWenLMHeadModel( (transformer): QWenModel( (wte): Embedding(151936, 4096) (drop): Dropout(p=0.0, inplace=False) (rotary_emb): RotaryEmbedding() (h): ModuleList( (0-31): 32 x QWenBlock( (ln_1): RMSNorm() (attn): QWenAttention( (c_attn): Linear(in_features=4096, out_features=12288, bias=True) (c_proj): Linear(in_features=4096, out_features=4096, bias=False) (attn_dropout): Dropout(p=0.0, inplace=False) ) (ln_2): RMSNorm() (mlp): QWenMLP( (w1): Linear(in_features=4096, out_features=11008, bias=False) (w2): Linear(in_features=4096, out_features=11008, bias=False) (c_proj): Linear(in_features=11008, out_features=4096, bias=False) ) ) ) (ln_f): RMSNorm() (visual): VisionTransformer( (conv1): Conv2d(3, 1664, kernel_size=(14, 14), stride=(14, 14), bias=False) (ln_pre): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (transformer): TransformerBlock( (resblocks): ModuleList( (0-47): 48 x VisualAttentionBlock( (ln_1): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (ln_2): LayerNorm((1664,), eps=1e-06, elementwise_affine=True) (attn): VisualAttention( (in_proj): Linear(in_features=1664, out_features=4992, bias=True) (out_proj): Linear(in_features=1664, out_features=1664, bias=True) ) (mlp): Sequential( (c_fc): Linear(in_features=1664, out_features=8192, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=8192, out_features=1664, bias=True) ) ) ) ) (attn_pool): Resampler( (kv_proj): Linear(in_features=1664, out_features=4096, bias=False) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=4096, out_features=4096, bias=True) ) (ln_q): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) (ln_kv): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) ) (ln_post): LayerNorm((4096,), eps=1e-06, elementwise_affine=True) ) ) (lm_head): Linear(in_features=4096, out_features=151936, bias=False) ) 图中是一名女子在沙滩上和狗玩耍,旁边的狗是一只拉布拉多犬,它们处于沙滩上。 <ref>击掌</ref><box>(523,512),(588,605)</box> Total Generation Time: 38.64623808860779 seconds real 49m14.595s user 0m34.058s sys 0m50.434s |