DeepSeek

DeepSeek可以做什么

对话、语言理解、计算推理、代码生成补全,文本生成,支持文件上传、读取文件及图片中的文字内容。

知识处理体系:通用/专业领域问答,图集推理,逻辑推理。

自然语言处理:文本分析,语义理解,实体识别、故事创作。

交互能力:上下文理解、多轮对话、任务执行

方案规划:从数据的可视化流程优化,建议生成

image-20250306162029150

推理模型:在传统大语言模型基础上强化推理、逻辑分析和决策能力,借助强化学习、神经符号推理、元学习等手段,思维链。例如:DeepSeek-R1, GPT-o3

非推理模型:侧重于语言成生成、上下文理解和自然语言处理。需要对大量文本数据训练以掌握语言规律,概率生成,没有思考过程(思维链拆解)。DeepSeek-V3, GPT-3,GPT-4,BERT

image-20250306164225321

提示语是用户输入AI系统的指令或信息,引导AI生成特定的输出或执行特定任务。

  • 指令:告诉AI执行什么任务
  • 上下文:提供背景信息,帮助AI理解任务和执行任务
  • 期望:明确或隐含的表达期望AI输出内容

image-20250306165330271

image-20250306165751552

架构

特点:混合专家模型(MoE),MLA

image-20250308134621177

算法:强化学习

状态S:输入prompt

动作A:输出response(即LLM输出下一个token)

奖励R:根据prompt+response进行奖励模型打分

整体目标:给定prompt,调整policy,生成符合人类喜好(RM偏序信号)的response

image-20250308153656529

量化版:模型压缩技术,例如:参数精度FP32压缩为INT8导致模型推理精度降低。
蒸馏版:大模型知识迁移到小模型:让小模型学习大模型的输出特征。image-20250306175349025
token/tokenizer
image-20250309162530739

image-20250309162749279

硬件配置

硬件配置

在线检测

DeepSeek接口指南

API文档

私有化业务领域

传统过程:问题-》大模型(原生)-》答案

  1. rag: 问题-》私有知识库(查询)-》prompt(查询结果+问题)-》大模型(原生)-》答案

    1)简单的RAG或者Agent使用dify
    2)复杂的RAG或者Agent使用LangGraph/LangChain

image-20250310154846077

2. fine tuning:尽量选较小的模型

1)100条高质量标注数据在1.5B的模型上有可以有较好效果
2) 高质量:a. 数据覆盖全业务场景,b. 数据均衡,c.数据和业务场景关联性强。

数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
数据描述文件dataset_info.json
"dataset_name": {
"hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
"ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url and file_name)",
"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
"file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)",
"formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
"ranking": "whether the dataset is a preference dataset or not. (default: False)",
"subset": "the name of the subset. (optional, default: None)",
"split": "the name of dataset split to be used. (optional, default: train)",
"folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
"num_samples": "the number of samples in the dataset to be used. (optional, default: None)",
"columns (optional)": {
"prompt": "the column name in the dataset containing the prompts. (default: instruction)",
"query": "the column name in the dataset containing the queries. (default: input)",
"response": "the column name in the dataset containing the responses. (default: output)",
"history": "the column name in the dataset containing the histories. (default: None)",
"messages": "the column name in the dataset containing the messages. (default: conversations)",
"system": "the column name in the dataset containing the system prompts. (default: None)",
"tools": "the column name in the dataset containing the tool description. (default: None)",
"images": "the column name in the dataset containing the image inputs. (default: None)",
"videos": "the column name in the dataset containing the videos inputs. (default: None)",
"audios": "the column name in the dataset containing the audios inputs. (default: None)",
"chosen": "the column name in the dataset containing the chosen answers. (default: None)",
"rejected": "the column name in the dataset containing the rejected answers. (default: None)",
"kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
},
"tags (optional, used for the sharegpt format)": {
"role_tag": "the key in the message represents the identity. (default: from)",
"content_tag": "the key in the message represents the content. (default: value)",
"user_tag": "the value of the role_tag represents the user. (default: human)",
"assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
"observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
"function_tag": "the value of the role_tag represents the function call. (default: function_call)",
"system_tag": "the value of the role_tag represents the system prompt. (default: system, can override system column)"
}
}

Alpaca Format

  1. 监督式对话
    数据集:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    [
    {
    "instruction": "Classify the following sentence as either an opinion or fact.",
    "input": "A hot dog is just one of many types of processed meat you can eat.",
    "output": "Fact."
    },
    {
    "instruction": "Format and organize the following unstructured text information into a table.",
    "input": "Name: Peter\nAge: 43\nGender: Male\nOccupation: Doctor",
    "output": "| Category | Information |\n|------------|-------------|\n| Name | Peter |\n| Age | 43 |\n| Gender | Male |\n| Occupation | Doctor |"
    },
    {
    "instruction": "Given the following array, print out maximum value in descending order. \n\n[4, 8, 0, 9, 15]",
    "input": "[4, 8, 0, 9, 15]",
    "output": "15\n9\n8\n4\n0",
    "history": [
    ["print out minimum value in increasing order", "0\n4\n8\n9\n15"]
    ]
    }
    {
    "instruction": "判断在频谱数据中是否存在故障,工况条件为:。。。,并给出详细故障内容"
    "input": "0Hz至4Hz振动有效值为:0.7313359070982054, 5Hz至9Hz振动有效值为:0.5516794943665811, 10Hz至14Hz振动有效值为:0.64766006489319, 15Hz至19Hz振动有效值为:3.0707501463193583, 20Hz至24Hz振动有效值为:0.6548395850417195, 25Hz至29Hz振动有效值为:0.26479291295984514, 30Hz至34Hz振动有效值为:0.4057607491630266, 35Hz至39Hz振动有效值为:0.3412169025400324, 40Hz至44Hz振动有效值为:0.19167337178728241, 45Hz至49Hz振动有效值为:0.20967801201261643,50Hz至54Hz振动有效值为:0.2543538684896362"
    "output": "存在故障,故障特征为:。。。",
    "history":[
    [
    "55Hz至59Hz振动有效值为:0.6280184736095086, 60Hz至64Hz振动有效值为:0.3123565814013977, 65Hz至69Hz振动有效值为:0.2982322812970321, 70Hz至74Hz振动有效值为:0.2523941314035457, 75Hz至79Hz振动有效值为:1.0019673960653408, 80Hz至84Hz振动有效值为:0.38334371478931645, 85Hz至89Hz振动有效值为:0.40893629328147063, 90Hz至94Hz振动有效值为:1.1188524730758498, 95Hz至99Hz振动有效值为:0.819641930908284, 100Hz至104Hz振动有效值为:0.5804811401893758, 105Hz至109Hz振动有效值为:0.21268042122054098, 110Hz至114Hz振动有效值为:0.2152073347988916, 115Hz至119Hz振动有效值为:0.3527403211644723", "不存在故障,原因为:。。。"
    ],
    ]
    }
    ]
    数据描述文件配置:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    "dataset_name": {
    "file_name": "data.json",
    "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
    }
    }
  2. 预训练式对话
    数据集:
    1
    2
    3
    4
    5
    6
    7
    8
    [
    {
    "text": "Don’t think you need all the bells and whistles? No problem. McKinley Heating Service Experts Heating & Air Conditioning offers basic air cleaners that work to improve the quality of the air in your home without breaking the bank. "
    },
    {
    "text": "To the apparent surprise of everyone, the Walt Disney Company has announced a deal to purchase Lucasfilm Ltd.
    }
    ]
    数据描述:
    1
    2
    3
    4
    5
    6
    "dataset_name": {
    "file_name": "data.json",
    "columns": {
    "prompt": "text"
    }
    }
    ShareGpt
  3. 监督式对话
    数据集:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    [
    {
    "conversations": [
    {
    "from": "human",
    "value": "Hi, I have some ingredients and I want to cook something. Can you help me find a recipe?"
    },
    {
    "from": "gpt",
    "value": "Of course! I can help you with that. Please tell me what ingredients you have."
    },
    {
    "from": "human",
    "value": "I have chicken, bell peppers, and rice."
    },
    {
    "from": "function_call",
    "value": "{\"name\": \"search_recipes\", \"arguments\": {\"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"]}}"
    },
    {
    "from": "observation",
    "value": "{\"recipes\": [{\"name\": \"Chicken and Bell Pepper Stir Fry\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice.\"}, {\"name\": \"Chicken and Rice Casserole\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown.\"}]}"
    },
    {
    "from": "gpt",
    "value": "I found two recipes for you. The first one is \"Chicken and Bell Pepper Stir Fry\". The instructions are: Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice. The second one is \"Chicken and Rice Casserole\". The instructions are: Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown. Which one would you like to try?"
    },
    {
    "from": "human",
    "value": "They both sound delicious, but I think I'll try the stir fry. Can you order the ingredients for me?"
    },
    {
    "from": "gpt",
    "value": "I'm sorry, but as an AI, I don't have the capability to perform external tasks such as ordering ingredients. However, I can help you find more recipes or provide cooking tips if you need."
    }
    ],
    "tools": "[{\"name\": \"search_recipes\", \"description\": \"Search for recipes based on ingredients\", \"parameters\": {\"type\": \"object\", \"properties\": {\"ingredients\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}, \"description\": \"The ingredients to search for\"}}, \"required\": [\"ingredients\"]}}]"
    },
    {
    "conversations": [
    {
    "from": "human",
    "value": "Write a definition of \"photoshop\"."
    },
    {
    "from": "gpt",
    "value": "Photoshop is a software application developed by Adobe that enables users to manipulate digital images by providing a variety of tools and features to alter, enhance, and edit photos. It allows users to adjust the color balance, contrast, and brightness of images, remove backgrounds, add or remove elements from images, and perform numerous other image manipulation tasks. Photoshop is widely used by graphic designers, photographers, and digital artists for creating and enhancing images for a variety of purposes, including print and online media."
    },
    {
    "from": "human",
    "value": "Can you explain the difference between machine learning and deep learning?"
    },
    {
    "from": "gpt",
    "value": "Machine learning is a subset of artificial intelligence that involves the use of algorithms to analyze and learn from data, without being explicitly programmed. It enables computers to identify patterns in data, make decisions based on those patterns, and improve performance over time by learning from experience.\n\nDeep learning, on the other hand, is a subset of machine learning that involves the use of neural networks to learn from data. These neural networks are composed of multiple layers of interconnected nodes, and they can be trained on large datasets to recognize patterns and correlations in data. Deep learning has been particularly successful in image recognition, speech recognition, and natural language processing tasks.\n\nOne way to think about the difference between the two is that machine learning relies on statistical models and algorithms to make predictions, while deep learning uses neural networks to mimic the structure of the human brain and achieve higher levels of accuracy and complexity in its predictions."
    },
    {
    "from": "human",
    "value": "How does blockchain work?"
    },
    {
    "from": "gpt",
    "value": "Blockchain is a decentralized digital ledger technology that allows multiple parties to maintain a shared and tamper-proof record of transactions and information. Each block in the chain contains a cryptographic hash of the previous block, along with new data and a unique identifier, which creates an unbreakable chain of blocks that cannot be altered or deleted without consensus from the network.\n\nWhen a transaction is initiated on the blockchain, it is broadcast to all nodes on the network, which validate and verify the transaction using complex cryptographic algorithms. Once validated, the transaction is added to the blockchain and becomes a permanent part of the ledger.\n\nThe decentralized nature of blockchain means that there is no single point of failure or control, and the system is highly secure and transparent. This makes it ideal for use cases such as cryptocurrency transactions, supply chain management, and identity verification."
    },
    {
    "from": "human",
    "value": "Can you explain what quantum computing is?"
    },
    {
    "from": "gpt",
    "value": "Quantum computing is a type of computing technology that uses the principles of quantum mechanics to perform complex calculations and solve problems that are currently beyond the capabilities of classical computers. Unlike classical computers, which use bits to represent information as either 0 or 1, quantum computers use qubits, which can exist in multiple states simultaneously.\n\nThis allows quantum computers to perform certain types of calculations much faster than classical computers, such as factoring large numbers, simulating complex chemical reactions, and searching through large databases. Quantum computing also has the potential to revolutionize fields such as cryptography, machine learning, and drug discovery.\n\nHowever, quantum computing is still in its early stages of development, and there are many technical challenges that need to be overcome before it can become a practical technology. These include improving the stability and reliability of qubits, developing error correction algorithms, and scaling up the size and complexity of quantum systems."
    },
    {
    "from": "human",
    "value": "What is the difference between AR and VR?"
    },
    {
    "from": "gpt",
    "value": "Augmented reality (AR) and virtual reality (VR) are both technologies that provide immersive and interactive experiences, but they differ in how they create those experiences.\n\nAR overlays digital information onto the real world, typically using a camera or other sensors to track the user's environment and position. This allows users to interact with digital objects and information while still being aware of their physical surroundings. Examples of AR applications include games, navigation apps, and marketing campaigns.\n\nVR, on the other hand, creates a completely simulated environment that users can enter and interact with using specialized headsets and controllers. This allows users to experience completely new worlds and environments, and can be used for everything from gaming and entertainment to education and training.\n\nWhile both AR and VR have their own unique strengths and applications, they are often used together in what is known as mixed reality (MR), which combines the real and virtual worlds to create even more immersive and interactive experiences."
    }
    ],
    "tools": "[]"
    }
    ]
    数据描述:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    "dataset_name": {
    "file_name": "data.json",
    "formatting": "sharegpt",
    "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
    }
    }
  4. 多模态图片
    数据集
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    [
    {
    "messages": [
    {
    "content": "<image>Who are they?",
    "role": "user"
    },
    {
    "content": "They're Kane and Gretzka from Bayern Munich.",
    "role": "assistant"
    },
    {
    "content": "What are they doing?<image>",
    "role": "user"
    },
    {
    "content": "They are celebrating on the soccer field.",
    "role": "assistant"
    }
    ],
    "images": [
    "mllm_demo_data/1.jpg",
    "mllm_demo_data/2.jpg"
    ]
    },
    {
    "messages": [
    {
    "content": "<image>根据转速功率图分析工况",
    "role": "user"
    },
    {
    "content": "图中有两条转速功率散点分布图,红色的散点轮廓涵盖了转速从。。。",
    "role": "assistant"
    },
    {
    "content": "<image>根据转速功率图分析是否有故障以及原因",
    "role": "user"
    },
    {
    "content": "有故障发生,原因是",
    "role": "assistant"
    }
    ],
    "images": [
    "mllm_demo_data/1.jpg"
    ]
    }
    ]
    数据描述
    1
    2
    3
    4
    5
    6
    7
    8
    "dataset_name": {
    "file_name": "data.json",
    "formatting": "sharegpt",
    "columns": {
    "messages": "conversations",
    "images": "images"
    }
    }
  5. 多模态语音
  6. 多模态视频

微调工具

LLaMA-Factory

image-20250310155012489

ms-swift

image-20250310155109619

部署流程

ollama部署

搭建应用服务LangChain

image-20250310161552667

特性

  • LLM和提示(prompt):统一LLM模型的访问API,提供prompt模版管理机制
  • chain: 将任务以工作流方式执行
  • LCEL: LangChain Expression Language自定义任务的处理流程
  • RAG:将新的信息导入LLM增强LLM生成内容质量
  • Agents:根据用户需求自动调用外部系统完成任务
  • 模型记忆:让LLM记住之前的对话内容,(1)内存存储:程序重启后记忆清零,(2)数据库存储例如MongoDB, redis

image-20250310162212153

image-20250310162815816

image-20250310163139416

核心概念

  1. LLMs:LangChain封装的模型,接收一个文本输入返回一个文本结果
  2. Chat Models:对话模型,转为对话场景设计,模型可接收一组对话消息,返回对话消息
  3. Message:消息类型包括:HumanMessage, AIMessage, SystemMessage, FunctionMessage和ToolMessage等
  4. Prompt: 提示词管理工具,方便格式化提示词
  5. OutputParse:对大模型返回内容格式化,例如解析为json,python对象等
  6. Retrievers:私有数据导入LLM, 可以加载文档,切割文档,存储和检索。
  7. Vector stores:支持向量数据库,用于检索
  8. Agents: 调用外部系统,实现例如文章写作,客服机器人

安装

pip install langchain-ollama langchain langchain-community Pillow faiss-cpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#引入langchain聊天场景的提示词模版
from langchain_core.prompts import ChatPromptTemplate
#引入langchain ollama sdk
from langchain_ollama import ChatOllama
#引入langchain openai sdk
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutParser

#要求配置好key环境变量:export OPENAI_API_KEY="your_api_key"
#llm = ChatOpenAI()
llm = ChatOllama(model="llama3.1", temperature=0.7)
#根据Message定义提示词模版
prompt = ChatPromptTemplate.from_messages([
("system", "你是专家"),
("user", "{input}")
])
#创建输出解析器
output_parser = StrOutputParser()

#创建链式调用生成一个chain
chain = prompt | llm | output_parser
result = chain.invoke({"input":"帮我写一篇文章,100个字"})
#content='当今AI技术。。。'
#additional_kwargs={"refuse":None}
#response_metadata={'token_usage':{'completion_tokens':183,...}}
#加StrOutputParser只输出文本
print(result)

向量存储

使用向量数据便于做相似度匹配,可以用于向量数据库的查询。Retrievers接口将文档转化为向量并存储到向量数据库中,检索时就在向量数据库做相似性匹配(文档召回),将检索到文档拼装成提示词投给LLM。

1
2
vectorstore = MyVectorStore()
retriever = vectorstore.as_retriever()

向量数据库

  • Chroma: pip install langchain-chroma

    • in-memory 在python脚本
    • in-memory with persistance 脚本和加载到磁盘
    • in a docker container作为服务器
    • 操作:.add, .get, .update, .upsert, .delete, .peek, .query
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    #chroma_base.py:存到内存
    from langchain_chroma import Chroma
    from langchain_community.document_loaders import TextLoader
    #pip install -U langchain-huggingface
    from langchain_huggingface import HuggingFaceEmbeddings
    from langchain_text_splitters import CharacterTextSplitter

    # 加载文档并分割成片段
    loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
    documents = loader.load()
    #分片
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    #创建开源嵌入函数
    embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    #文档特征加载到Chroma内存中
    db = Chroma.from_documents(docs, embedding_function)
    #进行查询
    query = "Pixar公司是做什么的?"
    docs = db.similarity_search(query)
    #打印结果
    print(docs[0].page_content)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    #chroma_client.py:存到内存和磁盘
    import chromadb
    from langchain_chroma import Chroma
    from langchain_huggingface import HuggingFaceEmbeddings

    persistant_client = chromadb.PersistantClient()
    #创建开源嵌入函数
    embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    collection = persistant_client.get_or_create_collection("collection_1")
    collection.add(ids=["1","2","3"], documents=["a","b","c"])
    langchain_chroma = Chroma(
    client=persistant_client,
    collection_name="collection_1",
    embedding_function=embedding_function,
    )
    print("在集合中有",langchain_chroma._collection.count(),"个文档")

    image-20250311095822863

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    #chroma_disk.py:存到指定磁盘
    from langchain_chroma import Chroma
    from langchain_community.document_loaders import TextLoader
    #from langchain_huggingface import HuggingFaceEmbeddings
    from langchain_openai import OpenAIEmbeddings
    from langchain_text_splitters import CharacterTextSplitter
    import chromadb

    #建立数据库对象
    #persistant_client = chromadb.PersistantClient()
    new_client = chromadb.EphemeralClient()

    # 加载文档并分割成片段
    loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
    documents = loader.load()
    #分片
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    #创建开源嵌入函数
    #embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    embedding_function = OpenAIEmbeddings()
    #文档特征保存到指定磁盘中
    db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
    #从磁盘加载
    db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    #进行查询
    query = "Pixar公司是做什么的?"
    #按分数检索,余弦距离,向量夹角余弦值,值越大相似性越高
    docs = db2.similarity_search_with_score(query)
    #打印结果
    print(docs[0].page_content)
    #使用mmr检索:平衡相关性和多样性
    retriever = db.as_retriever(search_type="mmr")
    result = retriever.invoke(query)
    print(result[0])
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    #chroma_update.py:增删改查
    from langchain_chroma import Chroma
    from langchain_community.document_loaders import TextLoader
    #from langchain_huggingface import HuggingFaceEmbeddings
    from langchain_openai import OpenAIEmbeddings
    from langchain_text_splitters import CharacterTextSplitter
    import chromadb

    #建立数据库对象
    #persistant_client = chromadb.PersistantClient()
    new_client = chromadb.EphemeralClient()

    # 加载文档并分割成片段
    loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
    documents = loader.load()
    #分片
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    #创建开源嵌入函数
    #embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    embedding_function = OpenAIEmbeddings()
    #文档特征保存到指定磁盘中
    db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
    #从磁盘加载
    db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    #进行查询
    query = "Pixar公司是做什么的?"

    #创建简单的ids
    ids = [str(i) for i in range(1,len(docs)+1)]
    #添加数据
    example_db = chroma.from_documents(docs, embedding_function, ids=ids)
    #按分数检索
    docs = example_db.similarity_search_with_score(query)
    #更新文档的元数据
    docs[0].metadata = {
    "source":"./resource/knowledge.txt",
    "new_value":"hello world",
    }
    print("更新前内容")
    print(example_db.collection.get(ids=[ids[0]]))
    example_db.update_document(ids[0], docs[0])
    print("更新后内容")
    print(example_db.collection.get(ids=[ids[0]]))
    #删除最后一个分段
    print("删除前计数", example_db._collection.count())
    print(example_db._collection.get(ids=[ids[-1]]))
    example_db._collection.delete(ids=[ids[-1]])
    print("删除后计数", example_db._collection.count())
    print(example_db._collection.get(ids=[ids[-1]]))

  • FAISS: 聚类检索 pip install -U faiss-cpu tikoken

  • weaviate: 云数据库 pip install -Uqq lang chain-weaviate
    weaviate_client = weaviate.connection_to_local()

  • Qdrant: 本地模式和云模式 pip install —upgrade —quiet liangchain-qdrant

  • Milvus:云数据库,可视化词向量好

向量嵌入Vector Embeddings

向量嵌入把自然语言转成向量。传统搜索(B Tree,BM25,TF-IDF)基于文本匹配而非语义。例如小狗和金毛是不同文本但是语义上是有相关性的,传统上使用特征工程做标签。

非结构化数据例如图片特征量大可以把颜色、形状、纹理、边缘、对象、场景等作为特征;例如时序波形特征量可以有均值、标准差、均方根值、绝对均值、峰值、偏度、峭度、方差、峭度指数、峰值指数、波形指数和脉冲指数;频谱数据特征可以有频率均值、频率方差、频率偏度、频率峭度、频率重心、标准差、频率均方根、平均频率、正则度、变异参数、八阶矩、十六阶矩。人力标注有难度,向量嵌入可以实现自动化提取特征。

向量嵌入由LLM生成,不同算法生成不同的向量数据(数据特征),对于文本这些特征可能包括词汇、语法、语义、情感、主题、上下文等,音频可能包含音调、音节、音高、音色、音量等。

文本向量可以用text-embedding-ada-002模型生成,图像可以用clip-vit-base-patch32模型生成,音频可以通过wav2vec2-base-960模型生成。

检索方式

1)减少向量大小:降维

2)缩小搜索范围:聚类或向量组织成树、图结构,在最接近的簇中检索

  • k-meansimage-20250311110443758
  • Hierarchical Navigable Small Worlds (HNSW)image-20250311110706767

Callback组建

回调系统允许连接到LLM应用的各个阶段,做日志记录、监控、流式处理等。

回调事件

event event trigger assosiated method
chat model start when a chat model starts on_chat_model_start
LLM start when a llm starts on_llm_start
LLM new token when an llm OR chat model emits a new token on_llm_new_token
LLM ends when an llm OR chat model ends on_llm_end
Tool start When a tool starts running on_tool_start

会话管理和Retriever

image-20250311140714880

工作流编排

image-20250311142049607

image-20250311142227959

多模态输入和自定义输出

多模态

image-20250311144000155

自定义输出

image-20250311144500487

image-20250311144923925

Agent开发

LLM智能输出文本,无法采取行动。Agent的能力包括:会工具调用,会检索信息,会在线查找信息Tavily,有聊天记忆功能,跟踪应用程序,langsmith调试。

搜索工具

1
2
3
4
5
#tools_searcher.py
from langchain_community.tools.tavily_search import TavilySearchResults
search = TavilySearchResults()
search.invoke("what is the weather in SF")

检索工具

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#tools_retriever.py
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.tools.retriever import create_retriever_tool

loader = WebBaseLoader("https://docs.smith.langchain.com/overview")
docs = loader.load()
documents = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
).split_documents(docs)
#生成向量库
vector = FAISS.from_documents(documents, OpenAIEmbeddings())
retriever = vector.as_retriever()
#检索信息
retriever.invoke("how to upload a dataset")[0]
#封装检索工具
retriever_tool = create_retriever_tool(
retriever,
"langsmith_search",
"Search for information about LangSmith. For any questions about LangSmith, you must use this tool!",
)

agent创建

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#agent_tools_create.py
from langchain_openai import ChatOpenAI
from .tools_searcher import search
from .tools_retriever import retriever_tool

model = ChatOpenAI(model="gpt-4")
tools = [search, retriever_tool]
#获取提示
from langchain import hub
#为agent设计提示词模版
prompt = hub.pull("hwchase17/openai-functions-agent")
print(prompt.messages)
#创建代理, 大模型根据提示词语义决策调用哪个工具,由agent执行具体的工具
from langchain.agents import create_tool_calling_agent
agent = create_tool_calling_agent(model, tools, prompt)

from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools)

#消息存储
from langchain.community.chat_message_histories import ChatMessageHistroy
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistroy #回话隔离,每个回话窗口有自己的历史聊天

store={}
def get_session_history(session_id:str)->BaseChatMessageHistory:
if session_id not in store:
store[sesssion_id] = ChatMessageHistory()
return store[session_id]

agent_with_chat_history = RunnableWithMessageHistory(
agent_executor,
get_session_history,
input_message_key="input",
history_messages_key="chat_histroy",
)


运行代理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#agent_tools_run.py
from .agent_tools_create import agent_executor
from .agent_tools_create import agent_with_chat_history
#无历史记忆
print(agent_executor.invoke("input": "你好"))#agent不调用工具
print(agent_executor.invoke("input": "数据库功能"))#agent调用检索工具
print(agent_executor.invoke("input": "济南天气"))#agent调用在线查询工具
print(agent_executor.invoke("input": "济南天气,上传数据库"))#agent调用在线查询工具和检索工具,agent可以做任务拆分

#有历史记忆
print(agent_executor.invoke({"input": "济南天气","chat_history":[HumanMessage(content="Hi,我是Jack"),AIMessage(content="你好")]}))#agent调用在线查询工具
response = agent_with_chat_history.invoke(
{"input:Hi,我是Jack"},
config={"configurable":{"session_id":"123"}}
)
response = agent_with_chat_history.invoke(
{"input:我是谁"},
config={"configurable":{"session_id":"123"}}
)

其他多模态

JanusinternVL, LLaVA,VITA-MLLM ,TaskMatrix,

image-20250310154044374