本教程使用了Qwen1.5-0.5B-Chat 作为嵌入的小模型
使用 Qwen1.5-1.8B-Chat 作为推理模型
当然,你也可以使用同一个模型或是其他模型
1. 嵌入模型
构建文本嵌入的模型有很多,不同的模型适用于不同的任务和需求。在选择模型时,您应该考虑您的具体应用场景,比如处理语言的种类、任务的复杂性、以及您的系统资源。以下是一些常用于构建文本嵌入的模型类别和具体模型:
1.1. 预训练语言模型(Pre-trained Language Models)
这些模型通常基于大规模文本数据集预训练,能够捕捉丰富的语言特征。
- BERT(Bidirectional Encoder Representations from Transformers)
- 适用于广泛的自然语言处理任务,能够捕获上下文中的双向关系。
- 变种包括 RoBERTa、DistilBERT、ALBERT 等。
- GPT(Generative Pre-trained Transformer)
- 基于 Transformer 的自回归模型,适合生成任务。
- 各种版本如 GPT-2, GPT-3 等。
- T5(Text-To-Text Transfer Transformer)
- 将所有文本任务转换为文本到文本的问题。
- XLNet
- 结合了自回归和自编码特点,优于 BERT 的性能在某些任务上。
1.2. 句子嵌入模型(Sentence Embedding Models)
专门设计用来直接输出句子或段落级别的嵌入。
- Sentence-BERT(SBERT)
- 一个 BERT 的修改版本,优化了句子级别的相似性比较。
- Universal Sentence Encoder
- Google 开发,支持多种语言,适合大范围的任务。
1.3. 专门化嵌入模型
针对特定应用或领域优化的模型。
- FastText
- 由 Facebook 开发,特别适用于包括稀有词汇在内的文本处理。
- ELMo(Embeddings from Language Models)
- 使用双向 LSTM 网络结构,可以生成上下文相关的词嵌入。
1.4. 适应性嵌入模型
能够通过微调适应特定任务或领域的模型。
- AdaptBERT
- 通过对特定领域数据的微调,提高模型在该领域的性能。
1.5. 多语言和跨语言模型
支持多种语言,适合跨语言的应用。
- mBERT(Multilingual BERT)
- 支持多种语言的 BERT 版本。
- XLM
- 针对跨语言理解优化的模型。
选择模型时,除了考虑模型的性能,还需要考虑实现的复杂度、运行时的资源需求、以及是否需要支持特定的语言或领域。一般来说,预训练模型能提供良好的通用性和性能,但在特定情况下可能需要进一步的调整或优化。
2. 安装必要的库
1 2 3 4 5 6 7 |
!python -m pip install --upgrade pip !pip install -U torch !pip install -U transformers !pip install -U sentence_transformers !pip install -U numpy !pip install -U faiss-cpu #!pip install faiss-gpu |
3. 导入库
1 2 3 4 5 |
import torch from transformers import AutoTokenizer, AutoModelForCausalLM from sentence_transformers import SentenceTransformer import faiss import numpy as np |
4. 加载模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# 小模型用于创建嵌入 embedder = SentenceTransformer('Qwen/Qwen1.5-0.5B-Chat') # 支持多语言 # embedder = SentenceTransformer('bert-base-multilingual-cased') # 大模型用于生成 tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen1.5-1.8B-Chat') device = "cpu" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen1.5-1.8B-Chat", torch_dtype="auto", device_map="cpu" ) |
5. 加载文件
这个例子使用了保罗·格雷厄姆(Paul Graham)的文章“我做了什么”的文本。
获取它的最简单方法是通过此链接下载它并将其保存在 data
目录下,文件名为:paul_graham_essay.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Open the file file_path = "data/paul_graham_essay.txt" # Initialize an empty list to store the embeddings embeddings_list = [] documents = [] with open(file_path, "r", encoding="utf-8") as file: document = "" for line in file: # 当累积超过1000字符或遇到空行时,处理文本 if len(document) + len(line) > 1000 or line.strip() == "": document = document.strip() if document: # 确保文本非空 embeddings = embedder.encode(document) embeddings_list.append(embeddings) print("document:", document) print(f"embeddings:{len(embeddings)},", embeddings) documents.append(document) document = "" document += line.strip() + " " document = document.strip() if document: # 处理文件最后的内容 embeddings = embedder.encode(document) embeddings_list.append(embeddings) documents.append(document) |
里面的 1000, 可以修改为更大的值,比如10000,这样内容可能更精确,当然取得内容的大小,可能影响推理的输入长度。
6. 创建FAISS索引,并产生提问
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# 创建FAISS索引 if embeddings_list: embeddings_array = np.vstack(embeddings_list) index = faiss.IndexFlatL2(embeddings_array.shape[1]) index.add(embeddings_array.astype('float32')) # 用户问题处理与推理 question = "What is the theme of the document? " #question = "这份文档的主题是什么?" query_embedding = embedder.encode([question])[0].astype('float32') # 检索最相关的几个文档段落 combined_segments = "" k = 3 # 你希望检索的相关文档数量 D, I = index.search(np.array([query_embedding]), k=k) print("Top", k, "most relevant document segments:") for idx, segment_index in enumerate(I[0]): most_relevant_segment = documents[segment_index] print(f"{idx+1}: {most_relevant_segment}\n") combined_segments += " " + most_relevant_segment prompt = combined_segments + "\n\n###\n\n" + question + "\n\n用中文回答" messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print("Answer to the question:", response) else: print("No embeddings found. Please check your data.") |
对于英文文档,应该以英文提示,中文文档以中文提示。
如果应中文提示英文的文档,需要把嵌入修改一下
1 |
embedder = SentenceTransformer('bert-base-multilingual-cased') |
对于英文文档,如果需要中文回答,可以在后面强制要求:
1 |
prompt = combined_segments + "\n\n###\n\n" + question + "\n\n用中文回答" |
你希望检索的相关文档数量,通过设置 k 来实现
1 |
k = 3 # 你希望检索的相关文档数量 |
下面是输出的结果:
1 2 3 4 5 6 7 8 9 10 |
Top 3 most relevant document segments: 1: Over the next several years I wrote lots of essays about all kinds of different topics. O'Reilly reprinted a collection of them as a book, called Hackers & Painters after one of the essays in it. I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office. 2: AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. 3: We invited about 20 of the 225 groups to interview in person, and from those we picked 8 to fund. They were an impressive group. That first batch included reddit, Justin Kan and Emmett Shear, who went on to found Twitch, Aaron Swartz, who had already helped write the RSS spec and would a few years later become a martyr for open access, and Sam Altman, who would later become the second president of YC. I don't think it was entirely luck that the first batch was so good. You had to be pretty bold to sign up for a weird thing like the Summer Founders Program instead of a summer job at a legit place like Microsoft or Goldman Sachs. Answer to the question: 这篇文档的主题是关于作者在不同时期撰写和研究不同主题,包括他在艾萨克·阿西莫夫的《月球与深渊》(The Moon is a Harsh Mistress)中创造的智能计算机迈克以及他的编程语言SHRDLU。此外,他还通过教授SHRDLU更多的单词来发展AI技术,并参与了第一届Summer Founders Program的发起人会议,其中包括Reddit、Justin Kan、埃米特·谢拉(Emmet Shear)、Twitch创始人Aaron Swartz、他之前帮助编写RSS协议并将在几年后成为开放获取的先驱,以及Sam Altman,后来成为YC的第二任校长。 尽管最初的集锦包括了一些成功者,如Reddit、Justin Kan和埃米特·谢拉等,但作者认为他们的成功并不是完全偶然的,而是源于他对AI技术的勇气和决心。他明确表示要参加一个像是微软或高盛这样的合法工作而非传统的暑期实习项目,这为他进入AI领域铺平了道路。最终,作者选择创建SHRDLU作为人工智能的研究工具,并邀请大约225个来自各种行业的小组进行面对面访谈,从中挑选出8个资助小组。这些小组包括知名的技术公司,如Twitch、AMD和雅虎,他们在随后的发展历程中扮演了重要的角色。因此,这篇文档的主题可以概括为“作者在AI领域的探索和推动,以及其创建和影响人工智能技术的策略”。 |