从零打造 DeepSeek V4 联网智能体:Decodo 完整集成教程
DeepSeek V4 是目前能力最强的开放权重模型之一。但和所有大语言模型一样,它存在训练数据截止时间——在数据采集之后发生的任何事情,它都一无所知。对于需要就当前价格、最新新闻、实时商品页面或其他任何真实世界数据进行推理的智能体来说,这个截止时间就是一道硬墙。本教程将完整演示如何借助 Decodo 网络抓取API 为 DeepSeek V4 智能体接入实时联网能力。
Kristina Selivanovaite
最后更新: 6月 04日, 2026年
4 分钟阅读

你将构建什么
一个能够完成以下任务的 Python 智能体:
- 在运行任何任务前,先验证凭据并确认 Decodo 连接畅通
- 通过 Decodo 网络抓取API 抓取任意公开 URL,并获得干净的内容
- 通过 Decodo SERP API 执行实时 Google 搜索,并获得结构化结果
- 将这些内容直接传入 DeepSeek V4-Flash 或 V4-Pro 进行推理与输出
整套流程从零搭建大约只需 15 分钟。
前置准备
在写下第一行代码之前,你需要准备两样东西:
Decodo 账号和 API 令牌。注册一个 Decodo 控制台账号。登录后,进入 网络抓取API 板块,开通订阅(提供免费方案),然后在 Basic authentication token 标签页中复制你的 API 令牌。
DeepSeek API 密钥。创建一个 DeepSeek 账号,并在控制台中生成一个 API 密钥。DeepSeek V4-Flash 是兼顾成本的默认选项,而 V4-Pro 则是能力更强的版本。
在终端中安装所需依赖:
pip install requests
pip install requests
请记得将凭据存储为环境变量,而不要把令牌硬编码到源文件中:
DECODO_TOKEN="your_decodo_api_token"DEEPSEEK_API_KEY="your_deepseek_api_key"
DECODO_TOKEN="your_decodo_api_token"DEEPSEEK_API_KEY="your_deepseek_api_key"
第 0 步:先验证你的凭据
在构建任何东西之前,先确认你的 Decodo 令牌有效、API 可正常访问。这样能在一开始就发现鉴权问题,而不是等到流程进行到一半才报错。
import osimport requestsfrom dotenv import load_dotenv, find_dotenv# 1. Automatically search up the directory tree to find the .env fileload_dotenv(find_dotenv())# 2. Safely fetch the tokenDECODO_TOKEN = os.getenv("DECODO_TOKEN")if not DECODO_TOKEN:raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")# 3. Pass the raw token directly (No Base64 encoding needed)DECODO_HEADERS = {"accept": "application/json","content-type": "application/json","authorization": f"Basic {DECODO_TOKEN}",}def verify_credentials() -> bool:"""Hit the Decodo IP endpoint to confirm the token is validand the API is reachable. Returns True on success."""try:response = requests.post("https://scraper-api.decodo.com/v2/scrape",json={"url": "https://ip.decodo.com/ip"},headers=DECODO_HEADERS,timeout=30,)response.raise_for_status()if response.status_code == 200:ip = response.json()["results"][0]["content"]print(f"Connection verified. Assigned IP: {ip.strip()}")return Trueexcept requests.exceptions.RequestException as e:print(f"Network or Auth error occurred: {e}")except KeyError:print("Received an unexpected JSON structure from the API.")return Falseif __name__ == "__main__":assert verify_credentials(), "Fix credentials before proceeding."
import osimport requestsfrom dotenv import load_dotenv, find_dotenv# 1. Automatically search up the directory tree to find the .env fileload_dotenv(find_dotenv())# 2. Safely fetch the tokenDECODO_TOKEN = os.getenv("DECODO_TOKEN")if not DECODO_TOKEN:raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")# 3. Pass the raw token directly (No Base64 encoding needed)DECODO_HEADERS = {"accept": "application/json","content-type": "application/json","authorization": f"Basic {DECODO_TOKEN}",}def verify_credentials() -> bool:"""Hit the Decodo IP endpoint to confirm the token is validand the API is reachable. Returns True on success."""try:response = requests.post("https://scraper-api.decodo.com/v2/scrape",json={"url": "https://ip.decodo.com/ip"},headers=DECODO_HEADERS,timeout=30,)response.raise_for_status()if response.status_code == 200:ip = response.json()["results"][0]["content"]print(f"Connection verified. Assigned IP: {ip.strip()}")return Trueexcept requests.exceptions.RequestException as e:print(f"Network or Auth error occurred: {e}")except KeyError:print("Received an unexpected JSON structure from the API.")return Falseif __name__ == "__main__":assert verify_credentials(), "Fix credentials before proceeding."
该 IP 端点(https://ip.decodo.com/ip)是 Decodo 自家的轻量级测试目标。它会返回一行内容,显示分配给你这次请求的出口 IP,这是在不抓取真实目标站点的情况下,确认令牌是否有效的最快方式。
第 1 步:用 Decodo 网络抓取API 抓取实时 URL
网页抓取天生就充满不确定性。你迟早会遇到限流或目标站点超时。为了稳妥应对,我们先创建一个带指数退避重试的健壮网络辅助函数,再用它构建核心的 URL 抓取器,使其能够请求由 JavaScript 渲染的页面。
import timedef _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:"""Shared POST helper with exponential-backoff retry."""delay = backofffor attempt in range(retries):try:r = requests.post("https://scraper-api.decodo.com/v2/scrape",json=payload,headers=DECODO_HEADERS,timeout=60,)r.raise_for_status()return r.json()except requests.HTTPError as exc:if exc.response.status_code in (429, 524) and attempt < retries - 1:print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")time.sleep(delay)delay *= 2else:raisedef scrape_url(url: str,javascript: bool = False,geo: str = None,) -> str:"""Fetch a public URL via the Decodo Web Scraping API.Args:url: Target URL to scrape.javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).geo: Route through a specific country, e.g. "China", "US".Returns:Raw HTML string of the target page."""payload: dict = {"url": url}if javascript:payload["headless"] = "html"if geo:payload["geo"] = georeturn _post_decodo(payload)["results"][0]["content"]# Quick smoke testhtml = scrape_url("https://ip.decodo.com/ip")print(html.strip()) # Prints the assigned IP
import timedef _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:"""Shared POST helper with exponential-backoff retry."""delay = backofffor attempt in range(retries):try:r = requests.post("https://scraper-api.decodo.com/v2/scrape",json=payload,headers=DECODO_HEADERS,timeout=60,)r.raise_for_status()return r.json()except requests.HTTPError as exc:if exc.response.status_code in (429, 524) and attempt < retries - 1:print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")time.sleep(delay)delay *= 2else:raisedef scrape_url(url: str,javascript: bool = False,geo: str = None,) -> str:"""Fetch a public URL via the Decodo Web Scraping API.Args:url: Target URL to scrape.javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).geo: Route through a specific country, e.g. "China", "US".Returns:Raw HTML string of the target page."""payload: dict = {"url": url}if javascript:payload["headless"] = "html"if geo:payload["geo"] = georeturn _post_decodo(payload)["results"][0]["content"]# Quick smoke testhtml = scrape_url("https://ip.decodo.com/ip")print(html.strip()) # Prints the assigned IP
响应结构
一次成功的 200 响应:
{"results": [{"content": "Your Ip is: 213.87.163.6","status_code": 200,"url": "https://ip.decodo.com/ip","task_id": "6971034977135771649","created_at": "2026-04-24 09:24:14","updated_at": "2026-04-24 09:24:17"}]}
{"results": [{"content": "Your Ip is: 213.87.163.6","status_code": 200,"url": "https://ip.decodo.com/ip","task_id": "6971034977135771649","created_at": "2026-04-24 09:24:14","updated_at": "2026-04-24 09:24:17"}]}
关键参数
url — 唯一必填的参数。可以是任意公开可访问的 URL。
headless — 设为 "html" 即可启用完整的 JavaScript 渲染。对于淘宝商品页、重度依赖 JS 的仪表盘,以及任何在首次渲染后才动态加载价格或内容的页面,都必须启用。对静态 HTML 页面则应省略,因为它会增加延迟。
geo — 让请求通过指定国家的 IP 进行路由。接受国家名称:"China"、"United States"、"Germany" 等。
device_type — 可选。可接受的值:"desktop"(默认)、"mobile"、"desktop_chrome"、"mobile_android"。
不要在 payload 中加入 proxy_pool 或任何未在文档中说明的参数。它们对该端点并不是有效字段,轻则被悄悄忽略,重则导致意料之外的行为。Decodo API 会根据目标 URL 和你的订阅,自动选择合适的代理池。
第 2 步:用 Decodo SERP API 执行实时 Google 搜索
用 Google 搜索可以让 AI 接触到时事,但搜索引擎的页面布局会随查询不断变化。我们必须采用防御式编程来安全地解析 JSON 响应,确保当 Google 返回的是知识面板(Knowledge Panel)而非标准自然结果链接时,智能体不会崩溃。
带防御式解析的 Google 搜索
import jsondef google_search(query: str,geo: str = "China",num_pages: int = 1,) -> list[dict]:"""Run a Google Search via the Decodo SERP API.Returns a list of organic result dicts (title, url, desc)."""payload = {"target": "google_search","query": query,"parse": True,"num_pages": num_pages,"locale": "zh-CN","geo": geo,}response_data = _post_decodo(payload)try:results_array = response_data.get("results", [{}])if not results_array:return []content_dict = results_array[0].get("content", {})inner_results = content_dict.get("results", {})organic_results = inner_results.get("organic", [])if not organic_results:formatted_json = json.dumps(content_dict, indent=2)print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")return organic_resultsexcept Exception as e:print(f"⚠️ Failed to parse Decodo Search JSON: {e}")return []
import jsondef google_search(query: str,geo: str = "China",num_pages: int = 1,) -> list[dict]:"""Run a Google Search via the Decodo SERP API.Returns a list of organic result dicts (title, url, desc)."""payload = {"target": "google_search","query": query,"parse": True,"num_pages": num_pages,"locale": "zh-CN","geo": geo,}response_data = _post_decodo(payload)try:results_array = response_data.get("results", [{}])if not results_array:return []content_dict = results_array[0].get("content", {})inner_results = content_dict.get("results", {})organic_results = inner_results.get("organic", [])if not organic_results:formatted_json = json.dumps(content_dict, indent=2)print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")return organic_resultsexcept Exception as e:print(f"⚠️ Failed to parse Decodo Search JSON: {e}")return []
解析后的结果结构
{"pos": 1,"url": "https://example.com/deepseek-v4","title": "DeepSeek V4 benchmark results","desc": "DeepSeek V4-Pro scores top marks on MMLU and coding evals...","url_shown": "example.com","pos_overall": 1}
{"pos": 1,"url": "https://example.com/deepseek-v4","title": "DeepSeek V4 benchmark results","desc": "DeepSeek V4-Pro scores top marks on MMLU and coding evals...","url_shown": "example.com","pos_overall": 1}
向 LLM 传入 SERP 结果时,务必设置 parse: True。对于相同的信息,结构化 JSON 所消耗的 token 仅为原始 HTML 的一小部分。
其他受支持的 SERP 目标
baidu_search — 百度关键词搜索(返回 HTML,不支持解析)
google_shopping_search — Google 购物结果(可解析)
google_ads — 含付费广告的 Google 结果(可解析)
google_trends_explore — 某关键词的 Google Trends 数据(返回结构化 JSON)
第 3 步:在传给 DeepSeek V4 之前先清洗 HTML
大语言模型(LLM)按 token 计费。把充满结构标签、内联样式和跟踪脚本的原始 HTML 喂给它们,既浪费上下文窗口,又会降低推理质量。这个清洗步骤会剥离噪声,只留下 AI 真正需要的、具有语义的纯文本。
import redef html_to_text(html: str, max_chars: int = 12_000) -> str:"""Strip HTML tags, remove script/style blocks, collapse whitespace.Trims output to max_chars to control token consumption."""html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>','', html, flags=re.DOTALL)text = re.sub(r'<[^>]+>', ' ', html)return re.sub(r'\s+', ' ', text).strip()[:max_chars]# For complex pages, BeautifulSoup gives better results:# pip install beautifulsoup4# from bs4 import BeautifulSoup# def html_to_text(html, max_chars=12_000):# text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)# return text[:max_chars]
import redef html_to_text(html: str, max_chars: int = 12_000) -> str:"""Strip HTML tags, remove script/style blocks, collapse whitespace.Trims output to max_chars to control token consumption."""html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>','', html, flags=re.DOTALL)text = re.sub(r'<[^>]+>', ' ', html)return re.sub(r'\s+', ' ', text).strip()[:max_chars]# For complex pages, BeautifulSoup gives better results:# pip install beautifulsoup4# from bs4 import BeautifulSoup# def html_to_text(html, max_chars=12_000):# text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)# return text[:max_chars]
第 4 步:调用 DeepSeek V4
DeepSeek V4 使用与 OpenAI 兼容的 completions 端点,因此集成起来极其简单。注意,我们特意把 temperature 设为 0.2;把这个值保持得较低,会迫使模型进入确定性更强、更注重事实的模式,这对 RAG(检索增强生成)流程至关重要。
DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]DEEPSEEK_MODEL = "deepseek-v4-flash" # swap to "deepseek-v4-pro" as neededDEEPSEEK_HEADERS = {"authorization": f"Bearer {DEEPSEEK_API_KEY}","content-type": "application/json",}def ask_deepseek(system_prompt: str,user_message: str,max_tokens: int = 1024,temperature: float = 0.2,) -> str:"""Send a prompt to DeepSeek V4 and return the response text."""response = requests.post("https://api.deepseek.com/v1/chat/completions",json={"model": DEEPSEEK_MODEL,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_message},],"max_tokens": max_tokens,"temperature": temperature,},headers=DEEPSEEK_HEADERS,timeout=60,)response.raise_for_status()return response.json()["choices"][0]["message"]["content"]
DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]DEEPSEEK_MODEL = "deepseek-v4-flash" # swap to "deepseek-v4-pro" as neededDEEPSEEK_HEADERS = {"authorization": f"Bearer {DEEPSEEK_API_KEY}","content-type": "application/json",}def ask_deepseek(system_prompt: str,user_message: str,max_tokens: int = 1024,temperature: float = 0.2,) -> str:"""Send a prompt to DeepSeek V4 and return the response text."""response = requests.post("https://api.deepseek.com/v1/chat/completions",json={"model": DEEPSEEK_MODEL,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_message},],"max_tokens": max_tokens,"temperature": temperature,},headers=DEEPSEEK_HEADERS,timeout=60,)response.raise_for_status()return response.json()["choices"][0]["message"]["content"]
第 5 步:完整的智能体
现在我们把整条流程串联起来。这个主函数接收用户的查询,通过 Decodo 动态抓取实时内容(既可以直接指定 URL,也可以执行 Google 搜索),并构建一个严格的上下文窗口。系统提示词明确要求 LLM 只能依据我们刚刚提供的实时事实来作答。
def web_agent(question: str,url: str | None = None,search_query: str | None = None,javascript: bool = False,geo: str = "China",) -> str:"""DeepSeek V4 agent with live web access via Decodo.Supply at least one of: url (scrape a specific page) orsearch_query (run a Google Search). Both can be used together."""if not url and not search_query:raise ValueError("Provide at least one of: url or search_query")context_parts: list[str] = []if url:print(f"Scraping {url} ...")raw = scrape_url(url, javascript=javascript, geo=geo)text = html_to_text(raw)context_parts.append(f"[Page: {url}]\n{text}")if search_query:print(f"Searching: {search_query} ...")results = google_search(search_query, geo=geo)formatted = "\n".join(f"{i + 1}. {res['title']}\n {res['url']}\n {res['desc']}"for i, res in enumerate(results[:5]))context_parts.append(f"[Search: {search_query}]\n{formatted}")context = "\n\n".join(context_parts)return ask_deepseek(system_prompt=("You are a precise research assistant. You have been given live ""web content fetched right now. Answer the user's question using ""only the provided content. Be specific and cite facts directly."),user_message=f"Content:\n{context}\n\nQuestion: {question}",)# ── Usage examples ─────────────────────────────────────────────────────────# 1. Scrape a specific pageprint(web_agent(question="What scraping plans are available and what do they cost?",url="https://decodo.cn/scraping/web/pricing",))# 2. Search Google and reason over the top resultsprint(web_agent(question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",))# 3. Scrape a JS-rendered page (e.g. Taobao)print(web_agent(question="What is the current listed price for this product?",url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",javascript=True,))
def web_agent(question: str,url: str | None = None,search_query: str | None = None,javascript: bool = False,geo: str = "China",) -> str:"""DeepSeek V4 agent with live web access via Decodo.Supply at least one of: url (scrape a specific page) orsearch_query (run a Google Search). Both can be used together."""if not url and not search_query:raise ValueError("Provide at least one of: url or search_query")context_parts: list[str] = []if url:print(f"Scraping {url} ...")raw = scrape_url(url, javascript=javascript, geo=geo)text = html_to_text(raw)context_parts.append(f"[Page: {url}]\n{text}")if search_query:print(f"Searching: {search_query} ...")results = google_search(search_query, geo=geo)formatted = "\n".join(f"{i + 1}. {res['title']}\n {res['url']}\n {res['desc']}"for i, res in enumerate(results[:5]))context_parts.append(f"[Search: {search_query}]\n{formatted}")context = "\n\n".join(context_parts)return ask_deepseek(system_prompt=("You are a precise research assistant. You have been given live ""web content fetched right now. Answer the user's question using ""only the provided content. Be specific and cite facts directly."),user_message=f"Content:\n{context}\n\nQuestion: {question}",)# ── Usage examples ─────────────────────────────────────────────────────────# 1. Scrape a specific pageprint(web_agent(question="What scraping plans are available and what do they cost?",url="https://decodo.cn/scraping/web/pricing",))# 2. Search Google and reason over the top resultsprint(web_agent(question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",))# 3. Scrape a JS-rendered page (e.g. Taobao)print(web_agent(question="What is the current listed price for this product?",url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",javascript=True,))
HTTP 响应码
在生产环境中请显式处理这些响应码:
200 — 成功。内容位于 results[0]["content"] 中。
204 — 请求已受理,但尚未完成。请等待后重试。
400 — payload 格式有误。请检查必填字段和参数名称。
401 — 令牌无效或缺失。请重新检查 _DECODO_TOKEN_,必要时重新生成。
429 — 触发限流。请按指数退避后重试。
524 — 目标站点超时。请重试;如果页面需要 JS 渲染,可启用 headless: "html"。
完整的生产脚本
所有内容都在一个文件里。设置好环境变量,然后运行即可。
import osimport reimport timeimport requestsimport jsonfrom dotenv import load_dotenv, find_dotenvload_dotenv(find_dotenv())DECODO_TOKEN = os.getenv("DECODO_TOKEN")DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")DEEPSEEK_MODEL = "deepseek-v4-flash"if not DECODO_TOKEN:raise ValueError("DECODO_TOKEN is missing from your .env file.")if not DEEPSEEK_API_KEY:raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")DECODO_HEADERS = {"accept": "application/json","content-type": "application/json","authorization": f"Basic {DECODO_TOKEN}",}DEEPSEEK_HEADERS = {"authorization": f"Bearer {DEEPSEEK_API_KEY}","content-type": "application/json",}def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:"""Shared POST helper with exponential-backoff retry."""delay = backofffor attempt in range(retries):try:r = requests.post("https://scraper-api.decodo.com/v2/scrape",json=payload,headers=DECODO_HEADERS,timeout=60,)r.raise_for_status()return r.json()except requests.HTTPError as exc:if exc.response.status_code in (429, 524) and attempt < retries - 1:print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")time.sleep(delay)delay *= 2else:raisedef verify_credentials() -> bool:data = _post_decodo({"url": "https://ip.decodo.com/ip"})ip = data["results"][0]["content"].strip()print(f"Decodo OK — assigned IP: {ip}")return Truedef scrape_url(url: str,javascript: bool = False,geo: str = None,) -> str:payload: dict = {"url": url}if javascript:payload["headless"] = "html"if geo:payload["geo"] = georeturn _post_decodo(payload)["results"][0]["content"]def google_search(query: str,geo: str = "China",num_pages: int = 1,) -> list[dict]:payload = {"target": "google_search","query": query,"parse": True,"num_pages": num_pages,"locale": "zh-CN","geo": geo,}response_data = _post_decodo(payload)try:results_array = response_data.get("results", [{}])if not results_array:return []content_dict = results_array[0].get("content", {})inner_results = content_dict.get("results", {})organic_results = inner_results.get("organic", [])if not organic_results:formatted_json = json.dumps(content_dict, indent=2)print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")return organic_resultsexcept Exception as e:print(f"⚠️ Failed to parse Decodo Search JSON: {e}")return []def html_to_text(html: str, max_chars: int = 12_000) -> str:html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>','', html, flags=re.DOTALL)text = re.sub(r'<[^>]+>', ' ', html)return re.sub(r'\s+', ' ', text).strip()[:max_chars]def ask_deepseek(system_prompt: str,user_message: str,max_tokens: int = 1024,temperature: float = 0.2,) -> str:r = requests.post("https://api.deepseek.com/v1/chat/completions",json={"model": DEEPSEEK_MODEL,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_message},],"max_tokens": max_tokens,"temperature": temperature,},headers=DEEPSEEK_HEADERS,timeout=60,)r.raise_for_status()return r.json()["choices"][0]["message"]["content"]def web_agent(question: str,url: str | None = None,search_query: str | None = None,javascript: bool = False,geo: str = "China",) -> str:if not url and not search_query:raise ValueError("Provide url or search_query")parts: list[str] = []if url:parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")if search_query:res = google_search(search_query, geo=geo)parts.append("[Search: {}]\n{}".format(search_query,"\n".join(f"{i+1}. {x['title']}\n {x['url']}\n {x['desc']}"for i, x in enumerate(res[:5])),))return ask_deepseek("Answer using only the live web content below. Cite facts directly.",f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",)if __name__ == "__main__":verify_credentials()print(web_agent(question="What are the latest DeepSeek V4 benchmark results?",search_query="DeepSeek V4 benchmark results May 2026",))
import osimport reimport timeimport requestsimport jsonfrom dotenv import load_dotenv, find_dotenvload_dotenv(find_dotenv())DECODO_TOKEN = os.getenv("DECODO_TOKEN")DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")DEEPSEEK_MODEL = "deepseek-v4-flash"if not DECODO_TOKEN:raise ValueError("DECODO_TOKEN is missing from your .env file.")if not DEEPSEEK_API_KEY:raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")DECODO_HEADERS = {"accept": "application/json","content-type": "application/json","authorization": f"Basic {DECODO_TOKEN}",}DEEPSEEK_HEADERS = {"authorization": f"Bearer {DEEPSEEK_API_KEY}","content-type": "application/json",}def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:"""Shared POST helper with exponential-backoff retry."""delay = backofffor attempt in range(retries):try:r = requests.post("https://scraper-api.decodo.com/v2/scrape",json=payload,headers=DECODO_HEADERS,timeout=60,)r.raise_for_status()return r.json()except requests.HTTPError as exc:if exc.response.status_code in (429, 524) and attempt < retries - 1:print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")time.sleep(delay)delay *= 2else:raisedef verify_credentials() -> bool:data = _post_decodo({"url": "https://ip.decodo.com/ip"})ip = data["results"][0]["content"].strip()print(f"Decodo OK — assigned IP: {ip}")return Truedef scrape_url(url: str,javascript: bool = False,geo: str = None,) -> str:payload: dict = {"url": url}if javascript:payload["headless"] = "html"if geo:payload["geo"] = georeturn _post_decodo(payload)["results"][0]["content"]def google_search(query: str,geo: str = "China",num_pages: int = 1,) -> list[dict]:payload = {"target": "google_search","query": query,"parse": True,"num_pages": num_pages,"locale": "zh-CN","geo": geo,}response_data = _post_decodo(payload)try:results_array = response_data.get("results", [{}])if not results_array:return []content_dict = results_array[0].get("content", {})inner_results = content_dict.get("results", {})organic_results = inner_results.get("organic", [])if not organic_results:formatted_json = json.dumps(content_dict, indent=2)print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")return organic_resultsexcept Exception as e:print(f"⚠️ Failed to parse Decodo Search JSON: {e}")return []def html_to_text(html: str, max_chars: int = 12_000) -> str:html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>','', html, flags=re.DOTALL)text = re.sub(r'<[^>]+>', ' ', html)return re.sub(r'\s+', ' ', text).strip()[:max_chars]def ask_deepseek(system_prompt: str,user_message: str,max_tokens: int = 1024,temperature: float = 0.2,) -> str:r = requests.post("https://api.deepseek.com/v1/chat/completions",json={"model": DEEPSEEK_MODEL,"messages": [{"role": "system", "content": system_prompt},{"role": "user", "content": user_message},],"max_tokens": max_tokens,"temperature": temperature,},headers=DEEPSEEK_HEADERS,timeout=60,)r.raise_for_status()return r.json()["choices"][0]["message"]["content"]def web_agent(question: str,url: str | None = None,search_query: str | None = None,javascript: bool = False,geo: str = "China",) -> str:if not url and not search_query:raise ValueError("Provide url or search_query")parts: list[str] = []if url:parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")if search_query:res = google_search(search_query, geo=geo)parts.append("[Search: {}]\n{}".format(search_query,"\n".join(f"{i+1}. {x['title']}\n {x['url']}\n {x['desc']}"for i, x in enumerate(res[:5])),))return ask_deepseek("Answer using only the live web content below. Cite facts directly.",f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",)if __name__ == "__main__":verify_credentials()print(web_agent(question="What are the latest DeepSeek V4 benchmark results?",search_query="DeepSeek V4 benchmark results May 2026",))
下一步可以做什么
加入记忆。把抓取结果缓存到字典或 Redis 中,避免在同一会话内重复抓取相同的 URL。
加入结构化输出。提示 DeepSeek V4 返回 JSON(价格、名称、日期)。设置 temperature: 0 即可获得确定性的格式。
加入多步推理。让 V4 根据初步结果决定下一个要抓取的 URL,然后循环往复。百万 token 的上下文窗口让多轮链式调用变得切实可行。
用异步来扩展。把 requests 换成 httpx 和 asyncio,以并发运行多个 Decodo 抓取任务——这对于一次性检查大量 SKU 的价格监控流程必不可少。
关于作者

Kristina Selivanovaite
Decodo 德口多专家专栏: 品牌保护专家 Kristina Selivanovaite
Kristina 是国际关系和外交方面的专家,拥有硕士学位,并对全球数字访问桥梁有着浓厚的兴趣。凭借她的学术背景和全球视野,Kristina 为我们的中国读者量身定制了富有洞察力的内容,涵盖的主题包括网络搜刮、代理以及绕过各种网络限制的方法。
通过 LinkedIn 与 Kristina 联系。
Decodo 博客上的所有信息均按原样提供,仅供参考。对于您使用 Decodo 博客上的任何信息或其中可能链接的任何第三方网站,我们不作任何陈述,也不承担任何责任。