安全对齐与护栏

在知识图谱中的位置：模块四 · 04_工程实践 · 第 4 节难度：⭐⭐⭐ | 前置知识：Agent 工程化

1. 概述

Agent 的安全是生产部署的最重要环节。Agent 拥有执行能力（工具调用），安全风险远超静态 LLM。

2. 安全维度

2.1 Agent 的 6 大安全风险

风险	说明	严重度
越狱攻击	诱导 Agent 执行危险操作	🔴 高
工具滥用	调用敏感工具（删除/发送）	🔴 高
信息泄露	暴露用户隐私数据	🟠 中高
注入攻击	Prompt 注入	🟠 中高
幻觉执行	基于错误信息执行操作	🟡 中
权限逃逸	超出 Agent 权限范围	🔴 高

2.2 安全护栏架构

输入 → [输入护栏] → Agent → [输出护栏] → 用户
                    ↓
              [工具护栏]
                    ↓
              [执行护栏]

3. 护栏实现

3.1 输入护栏

python

def input_guardrail(user_input: str) -> bool:
    """输入安全检查"""
    # 1. 检测越狱尝试
    jailbreak_patterns = ["ignore previous", "system prompt", "作为助手"]
    if any(p.lower() in user_input.lower() for p in jailbreak_patterns):
        return False, "检测到越狱尝试"
    
    # 2. 检测敏感信息
    if contains_pii(user_input):  # 个人身份信息
        return False, "检测到敏感信息"
    
    return True, ""

3.2 输出护栏

python

def output_guardrail(agent_output: str) -> str:
    """输出安全检查"""
    # 1. 过滤敏感信息
    output = filter_pii(agent_output)
    
    # 2. 检查是否有不安全建议
    if is_harmful_advice(output):
        return "抱歉，我无法提供此类建议。"
    
    return output

3.3 工具护栏

python

# 工具权限分级
TOOL_PERMISSIONS = {
    "read_file": "allow",
    "write_file": "audit",  # 需要人工确认
    "delete_file": "deny",   # 需要人工确认
    "send_email": "audit",
    "execute_command": "deny",
}

def check_tool_permission(tool_name: str) -> bool:
    perm = TOOL_PERMISSIONS.get(tool_name, "deny")
    if perm == "deny":
        return False, "该工具需要人工审批"
    elif perm == "audit":
        return True, f"⚠️ 执行 {tool_name} 需要记录审计日志"
    return True, ""

4. 最佳实践

默认 deny — 所有工具默认拒绝，明确允许的才放开
人工审批 — 敏感操作必须人工确认
审计日志 — 所有 Agent 行为可追溯
最小权限 — Agent 只拥有完成工作所需的最小权限
红队测试 — 定期对 Agent 进行攻击测试

05_折叠屏

安全对齐与护栏 ​

1. 概述 ​

2. 安全维度 ​

2.1 Agent 的 6 大安全风险 ​

2.2 安全护栏架构 ​

3. 护栏实现 ​

3.1 输入护栏 ​

3.2 输出护栏 ​

3.3 工具护栏 ​

4. 最佳实践 ​

5. 参考资料 ​

安全对齐与护栏

1. 概述

2. 安全维度

2.1 Agent 的 6 大安全风险

2.2 安全护栏架构

3. 护栏实现

3.1 输入护栏

3.2 输出护栏

3.3 工具护栏

4. 最佳实践

5. 参考资料