返回博客
NEW

从零打造 DeepSeek V4 联网智能体:Decodo 完整集成教程

分享文章:

DeepSeek V4 是目前能力最强的开放权重模型之一。但和所有大语言模型一样,它存在训练数据截止时间——在数据采集之后发生的任何事情,它都一无所知。对于需要就当前价格、最新新闻、实时商品页面或其他任何真实世界数据进行推理的智能体来说,这个截止时间就是一道硬墙。本教程将完整演示如何借助 Decodo 网络抓取API 为 DeepSeek V4 智能体接入实时联网能力。

你将构建什么

一个能够完成以下任务的 Python 智能体:

  1. 在运行任何任务前,先验证凭据并确认 Decodo 连接畅通
  2. 通过 Decodo 网络抓取API 抓取任意公开 URL,并获得干净的内容
  3. 通过 Decodo SERP API 执行实时 Google 搜索,并获得结构化结果
  4. 将这些内容直接传入 DeepSeek V4-Flash 或 V4-Pro 进行推理与输出

整套流程从零搭建大约只需 15 分钟。

前置准备

在写下第一行代码之前,你需要准备两样东西:

Decodo 账号和 API 令牌。注册一个 Decodo 控制台账号。登录后,进入 网络抓取API 板块,开通订阅(提供免费方案),然后在 Basic authentication token 标签页中复制你的 API 令牌。

DeepSeek API 密钥。创建一个 DeepSeek 账号,并在控制台中生成一个 API 密钥。DeepSeek V4-Flash 是兼顾成本的默认选项,而 V4-Pro 则是能力更强的版本。

在终端中安装所需依赖:

pip install requests

请记得将凭据存储为环境变量,而不要把令牌硬编码到源文件中:

DECODO_TOKEN="your_decodo_api_token"
DEEPSEEK_API_KEY="your_deepseek_api_key"

第 0 步:先验证你的凭据

在构建任何东西之前,先确认你的 Decodo 令牌有效、API 可正常访问。这样能在一开始就发现鉴权问题,而不是等到流程进行到一半才报错。

import os
import requests
from dotenv import load_dotenv, find_dotenv
# 1. Automatically search up the directory tree to find the .env file
load_dotenv(find_dotenv())
# 2. Safely fetch the token
DECODO_TOKEN = os.getenv("DECODO_TOKEN")
if not DECODO_TOKEN:
raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")
# 3. Pass the raw token directly (No Base64 encoding needed)
DECODO_HEADERS = {
"accept": "application/json",
"content-type": "application/json",
"authorization": f"Basic {DECODO_TOKEN}",
}
def verify_credentials() -> bool:
"""
Hit the Decodo IP endpoint to confirm the token is valid
and the API is reachable. Returns True on success.
"""
try:
response = requests.post(
"https://scraper-api.decodo.com/v2/scrape",
json={"url": "https://ip.decodo.com/ip"},
headers=DECODO_HEADERS,
timeout=30,
)
response.raise_for_status()
if response.status_code == 200:
ip = response.json()["results"][0]["content"]
print(f"Connection verified. Assigned IP: {ip.strip()}")
return True
except requests.exceptions.RequestException as e:
print(f"Network or Auth error occurred: {e}")
except KeyError:
print("Received an unexpected JSON structure from the API.")
return False
if __name__ == "__main__":
assert verify_credentials(), "Fix credentials before proceeding."

该 IP 端点(https://ip.decodo.com/ip)是 Decodo 自家的轻量级测试目标。它会返回一行内容,显示分配给你这次请求的出口 IP,这是在不抓取真实目标站点的情况下,确认令牌是否有效的最快方式。

第 1 步:用 Decodo 网络抓取API 抓取实时 URL

网页抓取天生就充满不确定性。你迟早会遇到限流或目标站点超时。为了稳妥应对,我们先创建一个带指数退避重试的健壮网络辅助函数,再用它构建核心的 URL 抓取器,使其能够请求由 JavaScript 渲染的页面。

import time
def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
"""Shared POST helper with exponential-backoff retry."""
delay = backoff
for attempt in range(retries):
try:
r = requests.post(
"https://scraper-api.decodo.com/v2/scrape",
json=payload,
headers=DECODO_HEADERS,
timeout=60,
)
r.raise_for_status()
return r.json()
except requests.HTTPError as exc:
if exc.response.status_code in (429, 524) and attempt < retries - 1:
print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
time.sleep(delay)
delay *= 2
else:
raise
def scrape_url(
url: str,
javascript: bool = False,
geo: str = None,
) -> str:
"""
Fetch a public URL via the Decodo Web Scraping API.
Args:
url: Target URL to scrape.
javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).
geo: Route through a specific country, e.g. "China", "US".
Returns:
Raw HTML string of the target page.
"""
payload: dict = {"url": url}
if javascript:
payload["headless"] = "html"
if geo:
payload["geo"] = geo
return _post_decodo(payload)["results"][0]["content"]
# Quick smoke test
html = scrape_url("https://ip.decodo.com/ip")
print(html.strip()) # Prints the assigned IP

响应结构

一次成功的 200 响应:

{
"results": [
{
"content": "Your Ip is: 213.87.163.6",
"status_code": 200,
"url": "https://ip.decodo.com/ip",
"task_id": "6971034977135771649",
"created_at": "2026-04-24 09:24:14",
"updated_at": "2026-04-24 09:24:17"
}
]
}

关键参数

url — 唯一必填的参数。可以是任意公开可访问的 URL。

headless — 设为 "html" 即可启用完整的 JavaScript 渲染。对于淘宝商品页、重度依赖 JS 的仪表盘,以及任何在首次渲染后才动态加载价格或内容的页面,都必须启用。对静态 HTML 页面则应省略,因为它会增加延迟。

geo — 让请求通过指定国家的 IP 进行路由。接受国家名称:"China"、"United States"、"Germany" 等。

device_type — 可选。可接受的值:"desktop"(默认)、"mobile"、"desktop_chrome"、"mobile_android"。

不要在 payload 中加入 proxy_pool 或任何未在文档中说明的参数。它们对该端点并不是有效字段,轻则被悄悄忽略,重则导致意料之外的行为。Decodo API 会根据目标 URL 和你的订阅,自动选择合适的代理池。

第 2 步:用 Decodo SERP API 执行实时 Google 搜索

用 Google 搜索可以让 AI 接触到时事,但搜索引擎的页面布局会随查询不断变化。我们必须采用防御式编程来安全地解析 JSON 响应,确保当 Google 返回的是知识面板(Knowledge Panel)而非标准自然结果链接时,智能体不会崩溃。

带防御式解析的 Google 搜索

import json
def google_search(
query: str,
geo: str = "China",
num_pages: int = 1,
) -> list[dict]:
"""
Run a Google Search via the Decodo SERP API.
Returns a list of organic result dicts (title, url, desc).
"""
payload = {
"target": "google_search",
"query": query,
"parse": True,
"num_pages": num_pages,
"locale": "zh-CN",
"geo": geo,
}
response_data = _post_decodo(payload)
try:
results_array = response_data.get("results", [{}])
if not results_array:
return []
content_dict = results_array[0].get("content", {})
inner_results = content_dict.get("results", {})
organic_results = inner_results.get("organic", [])
if not organic_results:
formatted_json = json.dumps(content_dict, indent=2)
print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
return organic_results
except Exception as e:
print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
return []

解析后的结果结构

{
"pos": 1,
"url": "https://example.com/deepseek-v4",
"title": "DeepSeek V4 benchmark results",
"desc": "DeepSeek V4-Pro scores top marks on MMLU and coding evals...",
"url_shown": "example.com",
"pos_overall": 1
}

向 LLM 传入 SERP 结果时,务必设置 parse: True。对于相同的信息,结构化 JSON 所消耗的 token 仅为原始 HTML 的一小部分。

其他受支持的 SERP 目标

baidu_search — 百度关键词搜索(返回 HTML,不支持解析)

google_shopping_search — Google 购物结果(可解析)

google_ads — 含付费广告的 Google 结果(可解析)

google_trends_explore — 某关键词的 Google Trends 数据(返回结构化 JSON)

第 3 步:在传给 DeepSeek V4 之前先清洗 HTML

大语言模型(LLM)按 token 计费。把充满结构标签、内联样式和跟踪脚本的原始 HTML 喂给它们,既浪费上下文窗口,又会降低推理质量。这个清洗步骤会剥离噪声,只留下 AI 真正需要的、具有语义的纯文本。

import re
def html_to_text(html: str, max_chars: int = 12_000) -> str:
"""
Strip HTML tags, remove script/style blocks, collapse whitespace.
Trims output to max_chars to control token consumption.
"""
html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
'', html, flags=re.DOTALL)
text = re.sub(r'<[^>]+>', ' ', html)
return re.sub(r'\s+', ' ', text).strip()[:max_chars]
# For complex pages, BeautifulSoup gives better results:
# pip install beautifulsoup4
# from bs4 import BeautifulSoup
# def html_to_text(html, max_chars=12_000):
# text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)
# return text[:max_chars]

第 4 步:调用 DeepSeek V4

DeepSeek V4 使用与 OpenAI 兼容的 completions 端点,因此集成起来极其简单。注意,我们特意把 temperature 设为 0.2;把这个值保持得较低,会迫使模型进入确定性更强、更注重事实的模式,这对 RAG(检索增强生成)流程至关重要。

DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]
DEEPSEEK_MODEL = "deepseek-v4-flash" # swap to "deepseek-v4-pro" as needed
DEEPSEEK_HEADERS = {
"authorization": f"Bearer {DEEPSEEK_API_KEY}",
"content-type": "application/json",
}
def ask_deepseek(
system_prompt: str,
user_message: str,
max_tokens: int = 1024,
temperature: float = 0.2,
) -> str:
"""Send a prompt to DeepSeek V4 and return the response text."""
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
json={
"model": DEEPSEEK_MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
"max_tokens": max_tokens,
"temperature": temperature,
},
headers=DEEPSEEK_HEADERS,
timeout=60,
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]

第 5 步:完整的智能体

现在我们把整条流程串联起来。这个主函数接收用户的查询,通过 Decodo 动态抓取实时内容(既可以直接指定 URL,也可以执行 Google 搜索),并构建一个严格的上下文窗口。系统提示词明确要求 LLM 只能依据我们刚刚提供的实时事实来作答。

def web_agent(
question: str,
url: str | None = None,
search_query: str | None = None,
javascript: bool = False,
geo: str = "China",
) -> str:
"""
DeepSeek V4 agent with live web access via Decodo.
Supply at least one of: url (scrape a specific page) or
search_query (run a Google Search). Both can be used together.
"""
if not url and not search_query:
raise ValueError("Provide at least one of: url or search_query")
context_parts: list[str] = []
if url:
print(f"Scraping {url} ...")
raw = scrape_url(url, javascript=javascript, geo=geo)
text = html_to_text(raw)
context_parts.append(f"[Page: {url}]\n{text}")
if search_query:
print(f"Searching: {search_query} ...")
results = google_search(search_query, geo=geo)
formatted = "\n".join(
f"{i + 1}. {res['title']}\n {res['url']}\n {res['desc']}"
for i, res in enumerate(results[:5])
)
context_parts.append(f"[Search: {search_query}]\n{formatted}")
context = "\n\n".join(context_parts)
return ask_deepseek(
system_prompt=(
"You are a precise research assistant. You have been given live "
"web content fetched right now. Answer the user's question using "
"only the provided content. Be specific and cite facts directly."
),
user_message=f"Content:\n{context}\n\nQuestion: {question}",
)
# ── Usage examples ─────────────────────────────────────────────────────────
# 1. Scrape a specific page
print(web_agent(
question="What scraping plans are available and what do they cost?",
url="https://decodo.cn/scraping/web/pricing",
))
# 2. Search Google and reason over the top results
print(web_agent(
question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",
search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",
))
# 3. Scrape a JS-rendered page (e.g. Taobao)
print(web_agent(
question="What is the current listed price for this product?",
url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",
javascript=True,
))

HTTP 响应码

在生产环境中请显式处理这些响应码:

200 — 成功。内容位于 results[0]["content"] 中。

204 — 请求已受理,但尚未完成。请等待后重试。

400 — payload 格式有误。请检查必填字段和参数名称。

401 — 令牌无效或缺失。请重新检查 _DECODO_TOKEN_,必要时重新生成。

429 — 触发限流。请按指数退避后重试。

524 — 目标站点超时。请重试;如果页面需要 JS 渲染,可启用 headless: "html"。

完整的生产脚本

所有内容都在一个文件里。设置好环境变量,然后运行即可。

import os
import re
import time
import requests
import json
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
DECODO_TOKEN = os.getenv("DECODO_TOKEN")
DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")
DEEPSEEK_MODEL = "deepseek-v4-flash"
if not DECODO_TOKEN:
raise ValueError("DECODO_TOKEN is missing from your .env file.")
if not DEEPSEEK_API_KEY:
raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")
DECODO_HEADERS = {
"accept": "application/json",
"content-type": "application/json",
"authorization": f"Basic {DECODO_TOKEN}",
}
DEEPSEEK_HEADERS = {
"authorization": f"Bearer {DEEPSEEK_API_KEY}",
"content-type": "application/json",
}
def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
"""Shared POST helper with exponential-backoff retry."""
delay = backoff
for attempt in range(retries):
try:
r = requests.post(
"https://scraper-api.decodo.com/v2/scrape",
json=payload,
headers=DECODO_HEADERS,
timeout=60,
)
r.raise_for_status()
return r.json()
except requests.HTTPError as exc:
if exc.response.status_code in (429, 524) and attempt < retries - 1:
print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
time.sleep(delay)
delay *= 2
else:
raise
def verify_credentials() -> bool:
data = _post_decodo({"url": "https://ip.decodo.com/ip"})
ip = data["results"][0]["content"].strip()
print(f"Decodo OK — assigned IP: {ip}")
return True
def scrape_url(
url: str,
javascript: bool = False,
geo: str = None,
) -> str:
payload: dict = {"url": url}
if javascript:
payload["headless"] = "html"
if geo:
payload["geo"] = geo
return _post_decodo(payload)["results"][0]["content"]
def google_search(
query: str,
geo: str = "China",
num_pages: int = 1,
) -> list[dict]:
payload = {
"target": "google_search",
"query": query,
"parse": True,
"num_pages": num_pages,
"locale": "zh-CN",
"geo": geo,
}
response_data = _post_decodo(payload)
try:
results_array = response_data.get("results", [{}])
if not results_array:
return []
content_dict = results_array[0].get("content", {})
inner_results = content_dict.get("results", {})
organic_results = inner_results.get("organic", [])
if not organic_results:
formatted_json = json.dumps(content_dict, indent=2)
print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
return organic_results
except Exception as e:
print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
return []
def html_to_text(html: str, max_chars: int = 12_000) -> str:
html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
'', html, flags=re.DOTALL)
text = re.sub(r'<[^>]+>', ' ', html)
return re.sub(r'\s+', ' ', text).strip()[:max_chars]
def ask_deepseek(
system_prompt: str,
user_message: str,
max_tokens: int = 1024,
temperature: float = 0.2,
) -> str:
r = requests.post(
"https://api.deepseek.com/v1/chat/completions",
json={
"model": DEEPSEEK_MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
"max_tokens": max_tokens,
"temperature": temperature,
},
headers=DEEPSEEK_HEADERS,
timeout=60,
)
r.raise_for_status()
return r.json()["choices"][0]["message"]["content"]
def web_agent(
question: str,
url: str | None = None,
search_query: str | None = None,
javascript: bool = False,
geo: str = "China",
) -> str:
if not url and not search_query:
raise ValueError("Provide url or search_query")
parts: list[str] = []
if url:
parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")
if search_query:
res = google_search(search_query, geo=geo)
parts.append("[Search: {}]\n{}".format(
search_query,
"\n".join(
f"{i+1}. {x['title']}\n {x['url']}\n {x['desc']}"
for i, x in enumerate(res[:5])
),
))
return ask_deepseek(
"Answer using only the live web content below. Cite facts directly.",
f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",
)
if __name__ == "__main__":
verify_credentials()
print(web_agent(
question="What are the latest DeepSeek V4 benchmark results?",
search_query="DeepSeek V4 benchmark results May 2026",
))

下一步可以做什么

加入记忆。把抓取结果缓存到字典或 Redis 中,避免在同一会话内重复抓取相同的 URL。

加入结构化输出。提示 DeepSeek V4 返回 JSON(价格、名称、日期)。设置 temperature: 0 即可获得确定性的格式。

加入多步推理。让 V4 根据初步结果决定下一个要抓取的 URL,然后循环往复。百万 token 的上下文窗口让多轮链式调用变得切实可行。

用异步来扩展。把 requests 换成 httpx 和 asyncio,以并发运行多个 Decodo 抓取任务——这对于一次性检查大量 SKU 的价格监控流程必不可少。

分享文章:

关于作者

Kristina Selivanovaite

Decodo 德口多专家专栏: 品牌保护专家 Kristina Selivanovaite

Kristina 是国际关系和外交方面的专家,拥有硕士学位,并对全球数字访问桥梁有着浓厚的兴趣。凭借她的学术背景和全球视野,Kristina 为我们的中国读者量身定制了富有洞察力的内容,涵盖的主题包括网络搜刮、代理以及绕过各种网络限制的方法。

通过 LinkedIn 与 Kristina 联系。

Decodo 博客上的所有信息均按原样提供,仅供参考。对于您使用 Decodo 博客上的任何信息或其中可能链接的任何第三方网站,我们不作任何陈述,也不承担任何责任。

© 2018-2026 decodo.cn(原名 smartproxy.com)。版权所有 津ICP备2022004334号-2