NEW

从零打造 DeepSeek V4 联网智能体：Decodo 完整集成教程

分享文章:

DeepSeek V4 是目前能力最强的开放权重模型之一。但和所有大语言模型一样，它存在训练数据截止时间——在数据采集之后发生的任何事情，它都一无所知。对于需要就当前价格、最新新闻、实时商品页面或其他任何真实世界数据进行推理的智能体来说，这个截止时间就是一道硬墙。本教程将完整演示如何借助 Decodo 网络抓取API 为 DeepSeek V4 智能体接入实时联网能力。

Kristina Selivanovaite

最后更新: 6月 04日, 2026年

4 分钟阅读

你将构建什么

一个能够完成以下任务的 Python 智能体：

在运行任何任务前，先验证凭据并确认 Decodo 连接畅通
通过 Decodo 网络抓取API 抓取任意公开 URL，并获得干净的内容
通过 Decodo SERP API 执行实时 Google 搜索，并获得结构化结果
将这些内容直接传入 DeepSeek V4-Flash 或 V4-Pro 进行推理与输出

整套流程从零搭建大约只需 15 分钟。

前置准备

在写下第一行代码之前，你需要准备两样东西：

Decodo 账号和 API 令牌。注册一个 Decodo 控制台账号。登录后，进入网络抓取API 板块，开通订阅（提供免费方案），然后在 Basic authentication token 标签页中复制你的 API 令牌。

DeepSeek API 密钥。创建一个 DeepSeek 账号，并在控制台中生成一个 API 密钥。DeepSeek V4-Flash 是兼顾成本的默认选项，而 V4-Pro 则是能力更强的版本。

在终端中安装所需依赖：

pip install requests

请记得将凭据存储为环境变量，而不要把令牌硬编码到源文件中：

DECODO_TOKEN="your_decodo_api_token"
DEEPSEEK_API_KEY="your_deepseek_api_key"

第 0 步：先验证你的凭据

在构建任何东西之前，先确认你的 Decodo 令牌有效、API 可正常访问。这样能在一开始就发现鉴权问题，而不是等到流程进行到一半才报错。

import os
import requests
from dotenv import load_dotenv, find_dotenv


# 1. Automatically search up the directory tree to find the .env file
load_dotenv(find_dotenv())


# 2. Safely fetch the token
DECODO_TOKEN = os.getenv("DECODO_TOKEN")


if not DECODO_TOKEN:
    raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")


# 3. Pass the raw token directly (No Base64 encoding needed)
DECODO_HEADERS = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": f"Basic {DECODO_TOKEN}",
}


def verify_credentials() -> bool:
    """
    Hit the Decodo IP endpoint to confirm the token is valid
    and the API is reachable. Returns True on success.
    """
    try:
        response = requests.post(
            "https://scraper-api.decodo.com/v2/scrape",
            json={"url": "https://ip.decodo.com/ip"},
            headers=DECODO_HEADERS,
            timeout=30,
        )
        
        response.raise_for_status() 
        
        if response.status_code == 200:
            ip = response.json()["results"][0]["content"]
            print(f"Connection verified. Assigned IP: {ip.strip()}")
            return True
            
    except requests.exceptions.RequestException as e:
        print(f"Network or Auth error occurred: {e}")
    except KeyError:
        print("Received an unexpected JSON structure from the API.")


    return False


if __name__ == "__main__":
    assert verify_credentials(), "Fix credentials before proceeding."

import os
import requests
from dotenv import load_dotenv, find_dotenv


# 1. Automatically search up the directory tree to find the .env file
load_dotenv(find_dotenv())


# 2. Safely fetch the token
DECODO_TOKEN = os.getenv("DECODO_TOKEN")


if not DECODO_TOKEN:
    raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")


# 3. Pass the raw token directly (No Base64 encoding needed)
DECODO_HEADERS = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": f"Basic {DECODO_TOKEN}",
}


def verify_credentials() -> bool:
    """
    Hit the Decodo IP endpoint to confirm the token is valid
    and the API is reachable. Returns True on success.
    """
    try:
        response = requests.post(
            "https://scraper-api.decodo.com/v2/scrape",
            json={"url": "https://ip.decodo.com/ip"},
            headers=DECODO_HEADERS,
            timeout=30,
        )
        
        response.raise_for_status() 
        
        if response.status_code == 200:
            ip = response.json()["results"][0]["content"]
            print(f"Connection verified. Assigned IP: {ip.strip()}")
            return True
            
    except requests.exceptions.RequestException as e:
        print(f"Network or Auth error occurred: {e}")
    except KeyError:
        print("Received an unexpected JSON structure from the API.")


    return False


if __name__ == "__main__":
    assert verify_credentials(), "Fix credentials before proceeding."

该 IP 端点（https://ip.decodo.com/ip）是 Decodo 自家的轻量级测试目标。它会返回一行内容，显示分配给你这次请求的出口 IP，这是在不抓取真实目标站点的情况下，确认令牌是否有效的最快方式。

第 1 步：用 Decodo 网络抓取API 抓取实时 URL

网页抓取天生就充满不确定性。你迟早会遇到限流或目标站点超时。为了稳妥应对，我们先创建一个带指数退避重试的健壮网络辅助函数，再用它构建核心的 URL 抓取器，使其能够请求由 JavaScript 渲染的页面。

import time


def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
    """Shared POST helper with exponential-backoff retry."""
    delay = backoff
    for attempt in range(retries):
        try:
            r = requests.post(
                "https://scraper-api.decodo.com/v2/scrape",
                json=payload,
                headers=DECODO_HEADERS,
                timeout=60,
            )
            r.raise_for_status()
            return r.json()
        except requests.HTTPError as exc:
            if exc.response.status_code in (429, 524) and attempt < retries - 1:
                print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
                time.sleep(delay)
                delay *= 2
            else:
                raise


def scrape_url(
    url: str,
    javascript: bool = False,
    geo: str = None,
) -> str:
    """
    Fetch a public URL via the Decodo Web Scraping API.
    
    Args:
        url:        Target URL to scrape.
        javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).
        geo:        Route through a specific country, e.g. "China", "US".
    Returns:
        Raw HTML string of the target page.
    """
    payload: dict = {"url": url}
    if javascript:
        payload["headless"] = "html"
    if geo:
        payload["geo"] = geo
    return _post_decodo(payload)["results"][0]["content"]


# Quick smoke test
html = scrape_url("https://ip.decodo.com/ip")
print(html.strip())  # Prints the assigned IP

import time


def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
    """Shared POST helper with exponential-backoff retry."""
    delay = backoff
    for attempt in range(retries):
        try:
            r = requests.post(
                "https://scraper-api.decodo.com/v2/scrape",
                json=payload,
                headers=DECODO_HEADERS,
                timeout=60,
            )
            r.raise_for_status()
            return r.json()
        except requests.HTTPError as exc:
            if exc.response.status_code in (429, 524) and attempt < retries - 1:
                print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
                time.sleep(delay)
                delay *= 2
            else:
                raise


def scrape_url(
    url: str,
    javascript: bool = False,
    geo: str = None,
) -> str:
    """
    Fetch a public URL via the Decodo Web Scraping API.
    
    Args:
        url:        Target URL to scrape.
        javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).
        geo:        Route through a specific country, e.g. "China", "US".
    Returns:
        Raw HTML string of the target page.
    """
    payload: dict = {"url": url}
    if javascript:
        payload["headless"] = "html"
    if geo:
        payload["geo"] = geo
    return _post_decodo(payload)["results"][0]["content"]


# Quick smoke test
html = scrape_url("https://ip.decodo.com/ip")
print(html.strip())  # Prints the assigned IP

响应结构

一次成功的 200 响应：

{
  "results": [
    {
      "content": "Your Ip is: 213.87.163.6",
      "status_code": 200,
      "url": "https://ip.decodo.com/ip",
      "task_id": "6971034977135771649",
      "created_at": "2026-04-24 09:24:14",
      "updated_at": "2026-04-24 09:24:17"
    }
  ]
}

关键参数

url — 唯一必填的参数。可以是任意公开可访问的 URL。

headless — 设为 "html" 即可启用完整的 JavaScript 渲染。对于淘宝商品页、重度依赖 JS 的仪表盘，以及任何在首次渲染后才动态加载价格或内容的页面，都必须启用。对静态 HTML 页面则应省略，因为它会增加延迟。

geo — 让请求通过指定国家的 IP 进行路由。接受国家名称："China"、"United States"、"Germany" 等。

device_type — 可选。可接受的值："desktop"（默认）、"mobile"、"desktop_chrome"、"mobile_android"。

不要在 payload 中加入 proxy_pool 或任何未在文档中说明的参数。它们对该端点并不是有效字段，轻则被悄悄忽略，重则导致意料之外的行为。Decodo API 会根据目标 URL 和你的订阅，自动选择合适的代理池。

第 2 步：用 Decodo SERP API 执行实时 Google 搜索

用 Google 搜索可以让 AI 接触到时事，但搜索引擎的页面布局会随查询不断变化。我们必须采用防御式编程来安全地解析 JSON 响应，确保当 Google 返回的是知识面板（Knowledge Panel）而非标准自然结果链接时，智能体不会崩溃。

带防御式解析的 Google 搜索

import json


def google_search(
    query: str,
    geo: str = "China",
    num_pages: int = 1,
) -> list[dict]:
    """
    Run a Google Search via the Decodo SERP API.
    Returns a list of organic result dicts (title, url, desc).
    """
    payload = {
        "target":    "google_search",
        "query":     query,
        "parse":     True,
        "num_pages": num_pages,
        "locale":    "zh-CN",
        "geo":       geo,
    }
    
    response_data = _post_decodo(payload)
    
    try:
        results_array = response_data.get("results", [{}])
        if not results_array:
            return []
            
        content_dict = results_array[0].get("content", {})
        inner_results = content_dict.get("results", {})
        
        organic_results = inner_results.get("organic", [])
        
        if not organic_results:
            formatted_json = json.dumps(content_dict, indent=2)
            print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
            
        return organic_results


    except Exception as e:
        print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
        return []

import json


def google_search(
    query: str,
    geo: str = "China",
    num_pages: int = 1,
) -> list[dict]:
    """
    Run a Google Search via the Decodo SERP API.
    Returns a list of organic result dicts (title, url, desc).
    """
    payload = {
        "target":    "google_search",
        "query":     query,
        "parse":     True,
        "num_pages": num_pages,
        "locale":    "zh-CN",
        "geo":       geo,
    }
    
    response_data = _post_decodo(payload)
    
    try:
        results_array = response_data.get("results", [{}])
        if not results_array:
            return []
            
        content_dict = results_array[0].get("content", {})
        inner_results = content_dict.get("results", {})
        
        organic_results = inner_results.get("organic", [])
        
        if not organic_results:
            formatted_json = json.dumps(content_dict, indent=2)
            print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
            
        return organic_results


    except Exception as e:
        print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
        return []

解析后的结果结构

{
  "pos": 1,
  "url": "https://example.com/deepseek-v4",
  "title": "DeepSeek V4 benchmark results",
  "desc": "DeepSeek V4-Pro scores top marks on MMLU and coding evals...",
  "url_shown": "example.com",
  "pos_overall": 1
}

向 LLM 传入 SERP 结果时，务必设置 parse: True。对于相同的信息，结构化 JSON 所消耗的 token 仅为原始 HTML 的一小部分。

其他受支持的 SERP 目标

baidu_search — 百度关键词搜索（返回 HTML，不支持解析）

google_shopping_search — Google 购物结果（可解析）

google_ads — 含付费广告的 Google 结果（可解析）

google_trends_explore — 某关键词的 Google Trends 数据（返回结构化 JSON）

第 3 步：在传给 DeepSeek V4 之前先清洗 HTML

大语言模型（LLM）按 token 计费。把充满结构标签、内联样式和跟踪脚本的原始 HTML 喂给它们，既浪费上下文窗口，又会降低推理质量。这个清洗步骤会剥离噪声，只留下 AI 真正需要的、具有语义的纯文本。

import re


def html_to_text(html: str, max_chars: int = 12_000) -> str:
    """
    Strip HTML tags, remove script/style blocks, collapse whitespace.
    Trims output to max_chars to control token consumption.
    """
    html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
                  '', html, flags=re.DOTALL)
    text = re.sub(r'<[^>]+>', ' ', html)
    return re.sub(r'\s+', ' ', text).strip()[:max_chars]


# For complex pages, BeautifulSoup gives better results:
# pip install beautifulsoup4
# from bs4 import BeautifulSoup
# def html_to_text(html, max_chars=12_000):
#     text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)
#     return text[:max_chars]

import re


def html_to_text(html: str, max_chars: int = 12_000) -> str:
    """
    Strip HTML tags, remove script/style blocks, collapse whitespace.
    Trims output to max_chars to control token consumption.
    """
    html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
                  '', html, flags=re.DOTALL)
    text = re.sub(r'<[^>]+>', ' ', html)
    return re.sub(r'\s+', ' ', text).strip()[:max_chars]


# For complex pages, BeautifulSoup gives better results:
# pip install beautifulsoup4
# from bs4 import BeautifulSoup
# def html_to_text(html, max_chars=12_000):
#     text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)
#     return text[:max_chars]

第 4 步：调用 DeepSeek V4

DeepSeek V4 使用与 OpenAI 兼容的 completions 端点，因此集成起来极其简单。注意，我们特意把 temperature 设为 0.2；把这个值保持得较低，会迫使模型进入确定性更强、更注重事实的模式，这对 RAG（检索增强生成）流程至关重要。

DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]
DEEPSEEK_MODEL   = "deepseek-v4-flash"  # swap to "deepseek-v4-pro" as needed


DEEPSEEK_HEADERS = {
    "authorization": f"Bearer {DEEPSEEK_API_KEY}",
    "content-type":  "application/json",
}


def ask_deepseek(
    system_prompt: str,
    user_message: str,
    max_tokens: int = 1024,
    temperature: float = 0.2,
) -> str:
    """Send a prompt to DeepSeek V4 and return the response text."""
    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        json={
            "model":       DEEPSEEK_MODEL,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            "max_tokens":  max_tokens,
            "temperature": temperature,
        },
        headers=DEEPSEEK_HEADERS,
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]
DEEPSEEK_MODEL   = "deepseek-v4-flash"  # swap to "deepseek-v4-pro" as needed


DEEPSEEK_HEADERS = {
    "authorization": f"Bearer {DEEPSEEK_API_KEY}",
    "content-type":  "application/json",
}


def ask_deepseek(
    system_prompt: str,
    user_message: str,
    max_tokens: int = 1024,
    temperature: float = 0.2,
) -> str:
    """Send a prompt to DeepSeek V4 and return the response text."""
    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        json={
            "model":       DEEPSEEK_MODEL,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            "max_tokens":  max_tokens,
            "temperature": temperature,
        },
        headers=DEEPSEEK_HEADERS,
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

第 5 步：完整的智能体

现在我们把整条流程串联起来。这个主函数接收用户的查询，通过 Decodo 动态抓取实时内容（既可以直接指定 URL，也可以执行 Google 搜索），并构建一个严格的上下文窗口。系统提示词明确要求 LLM 只能依据我们刚刚提供的实时事实来作答。

def web_agent(
    question:     str,
    url:          str | None = None,
    search_query: str | None = None,
    javascript:   bool = False,
    geo:          str  = "China",
) -> str:
    """
    DeepSeek V4 agent with live web access via Decodo.


    Supply at least one of: url (scrape a specific page) or
    search_query (run a Google Search). Both can be used together.
    """
    if not url and not search_query:
        raise ValueError("Provide at least one of: url or search_query")


    context_parts: list[str] = []


    if url:
        print(f"Scraping {url} ...")
        raw = scrape_url(url, javascript=javascript, geo=geo)
        text = html_to_text(raw)
        context_parts.append(f"[Page: {url}]\n{text}")


    if search_query:
        print(f"Searching: {search_query} ...")
        results = google_search(search_query, geo=geo)
        formatted = "\n".join(
            f"{i + 1}. {res['title']}\n   {res['url']}\n   {res['desc']}"
            for i, res in enumerate(results[:5])
        )
        context_parts.append(f"[Search: {search_query}]\n{formatted}")


    context = "\n\n".join(context_parts)


    return ask_deepseek(
        system_prompt=(
            "You are a precise research assistant. You have been given live "
            "web content fetched right now. Answer the user's question using "
            "only the provided content. Be specific and cite facts directly."
        ),
        user_message=f"Content:\n{context}\n\nQuestion: {question}",
    )




# ── Usage examples ─────────────────────────────────────────────────────────


# 1. Scrape a specific page
print(web_agent(
    question="What scraping plans are available and what do they cost?",
    url="https://decodo.cn/scraping/web/pricing",
))


# 2. Search Google and reason over the top results
print(web_agent(
    question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",
    search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",
))


# 3. Scrape a JS-rendered page (e.g. Taobao)
print(web_agent(
    question="What is the current listed price for this product?",
    url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",
    javascript=True,
))

def web_agent(
    question:     str,
    url:          str | None = None,
    search_query: str | None = None,
    javascript:   bool = False,
    geo:          str  = "China",
) -> str:
    """
    DeepSeek V4 agent with live web access via Decodo.


    Supply at least one of: url (scrape a specific page) or
    search_query (run a Google Search). Both can be used together.
    """
    if not url and not search_query:
        raise ValueError("Provide at least one of: url or search_query")


    context_parts: list[str] = []


    if url:
        print(f"Scraping {url} ...")
        raw = scrape_url(url, javascript=javascript, geo=geo)
        text = html_to_text(raw)
        context_parts.append(f"[Page: {url}]\n{text}")


    if search_query:
        print(f"Searching: {search_query} ...")
        results = google_search(search_query, geo=geo)
        formatted = "\n".join(
            f"{i + 1}. {res['title']}\n   {res['url']}\n   {res['desc']}"
            for i, res in enumerate(results[:5])
        )
        context_parts.append(f"[Search: {search_query}]\n{formatted}")


    context = "\n\n".join(context_parts)


    return ask_deepseek(
        system_prompt=(
            "You are a precise research assistant. You have been given live "
            "web content fetched right now. Answer the user's question using "
            "only the provided content. Be specific and cite facts directly."
        ),
        user_message=f"Content:\n{context}\n\nQuestion: {question}",
    )




# ── Usage examples ─────────────────────────────────────────────────────────


# 1. Scrape a specific page
print(web_agent(
    question="What scraping plans are available and what do they cost?",
    url="https://decodo.cn/scraping/web/pricing",
))


# 2. Search Google and reason over the top results
print(web_agent(
    question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",
    search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",
))


# 3. Scrape a JS-rendered page (e.g. Taobao)
print(web_agent(
    question="What is the current listed price for this product?",
    url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",
    javascript=True,
))

HTTP 响应码

在生产环境中请显式处理这些响应码：

200 — 成功。内容位于 results[0]["content"] 中。

204 — 请求已受理，但尚未完成。请等待后重试。

400 — payload 格式有误。请检查必填字段和参数名称。

401 — 令牌无效或缺失。请重新检查 _DECODO_TOKEN_，必要时重新生成。

429 — 触发限流。请按指数退避后重试。

524 — 目标站点超时。请重试；如果页面需要 JS 渲染，可启用 headless: "html"。

完整的生产脚本

所有内容都在一个文件里。设置好环境变量，然后运行即可。

import os
import re
import time
import requests
import json
from dotenv import load_dotenv, find_dotenv


load_dotenv(find_dotenv())


DECODO_TOKEN     = os.getenv("DECODO_TOKEN")
DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")
DEEPSEEK_MODEL   = "deepseek-v4-flash"


if not DECODO_TOKEN:
    raise ValueError("DECODO_TOKEN is missing from your .env file.")
if not DEEPSEEK_API_KEY:
    raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")


DECODO_HEADERS = {
    "accept":        "application/json",
    "content-type":  "application/json",
    "authorization": f"Basic {DECODO_TOKEN}",
}


DEEPSEEK_HEADERS = {
    "authorization": f"Bearer {DEEPSEEK_API_KEY}",
    "content-type":  "application/json",
}


def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
    """Shared POST helper with exponential-backoff retry."""
    delay = backoff
    for attempt in range(retries):
        try:
            r = requests.post(
                "https://scraper-api.decodo.com/v2/scrape",
                json=payload,
                headers=DECODO_HEADERS,
                timeout=60,
            )
            r.raise_for_status()
            return r.json()
        except requests.HTTPError as exc:
            if exc.response.status_code in (429, 524) and attempt < retries - 1:
                print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
                time.sleep(delay)
                delay *= 2
            else:
                raise


def verify_credentials() -> bool:
    data = _post_decodo({"url": "https://ip.decodo.com/ip"})
    ip = data["results"][0]["content"].strip()
    print(f"Decodo OK — assigned IP: {ip}")
    return True


def scrape_url(
    url: str,
    javascript: bool = False,
    geo: str = None,
) -> str:
    payload: dict = {"url": url}
    if javascript:
        payload["headless"] = "html"
    if geo:
        payload["geo"] = geo
    return _post_decodo(payload)["results"][0]["content"]


def google_search(
    query: str,
    geo: str = "China",
    num_pages: int = 1,
) -> list[dict]:
    payload = {
        "target":    "google_search",
        "query":     query,
        "parse":     True,
        "num_pages": num_pages,
        "locale":    "zh-CN",
        "geo":       geo,
    }
    
    response_data = _post_decodo(payload)
    
    try:
        results_array = response_data.get("results", [{}])
        if not results_array:
            return []
            
        content_dict = results_array[0].get("content", {})
        inner_results = content_dict.get("results", {})
        
        organic_results = inner_results.get("organic", [])
        
        if not organic_results:
            formatted_json = json.dumps(content_dict, indent=2)
            print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
            
        return organic_results


    except Exception as e:
        print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
        return []


def html_to_text(html: str, max_chars: int = 12_000) -> str:
    html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
                  '', html, flags=re.DOTALL)
    text = re.sub(r'<[^>]+>', ' ', html)
    return re.sub(r'\s+', ' ', text).strip()[:max_chars]


def ask_deepseek(
    system_prompt: str,
    user_message:  str,
    max_tokens:    int   = 1024,
    temperature:   float = 0.2,
) -> str:
    r = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        json={
            "model":       DEEPSEEK_MODEL,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            "max_tokens":  max_tokens,
            "temperature": temperature,
        },
        headers=DEEPSEEK_HEADERS,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


def web_agent(
    question:     str,
    url:          str | None = None,
    search_query: str | None = None,
    javascript:   bool = False,
    geo:          str  = "China",
) -> str:
    if not url and not search_query:
        raise ValueError("Provide url or search_query")
    parts: list[str] = []
    if url:
        parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")
    if search_query:
        res = google_search(search_query, geo=geo)
        parts.append("[Search: {}]\n{}".format(
            search_query,
            "\n".join(
                f"{i+1}. {x['title']}\n   {x['url']}\n   {x['desc']}"
                for i, x in enumerate(res[:5])
            ),
        ))
    return ask_deepseek(
        "Answer using only the live web content below. Cite facts directly.",
        f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",
    )


if __name__ == "__main__":
    verify_credentials()


    print(web_agent(
        question="What are the latest DeepSeek V4 benchmark results?",
        search_query="DeepSeek V4 benchmark results May 2026",
    ))

import os
import re
import time
import requests
import json
from dotenv import load_dotenv, find_dotenv


load_dotenv(find_dotenv())


DECODO_TOKEN     = os.getenv("DECODO_TOKEN")
DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")
DEEPSEEK_MODEL   = "deepseek-v4-flash"


if not DECODO_TOKEN:
    raise ValueError("DECODO_TOKEN is missing from your .env file.")
if not DEEPSEEK_API_KEY:
    raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")


DECODO_HEADERS = {
    "accept":        "application/json",
    "content-type":  "application/json",
    "authorization": f"Basic {DECODO_TOKEN}",
}


DEEPSEEK_HEADERS = {
    "authorization": f"Bearer {DEEPSEEK_API_KEY}",
    "content-type":  "application/json",
}


def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:
    """Shared POST helper with exponential-backoff retry."""
    delay = backoff
    for attempt in range(retries):
        try:
            r = requests.post(
                "https://scraper-api.decodo.com/v2/scrape",
                json=payload,
                headers=DECODO_HEADERS,
                timeout=60,
            )
            r.raise_for_status()
            return r.json()
        except requests.HTTPError as exc:
            if exc.response.status_code in (429, 524) and attempt < retries - 1:
                print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")
                time.sleep(delay)
                delay *= 2
            else:
                raise


def verify_credentials() -> bool:
    data = _post_decodo({"url": "https://ip.decodo.com/ip"})
    ip = data["results"][0]["content"].strip()
    print(f"Decodo OK — assigned IP: {ip}")
    return True


def scrape_url(
    url: str,
    javascript: bool = False,
    geo: str = None,
) -> str:
    payload: dict = {"url": url}
    if javascript:
        payload["headless"] = "html"
    if geo:
        payload["geo"] = geo
    return _post_decodo(payload)["results"][0]["content"]


def google_search(
    query: str,
    geo: str = "China",
    num_pages: int = 1,
) -> list[dict]:
    payload = {
        "target":    "google_search",
        "query":     query,
        "parse":     True,
        "num_pages": num_pages,
        "locale":    "zh-CN",
        "geo":       geo,
    }
    
    response_data = _post_decodo(payload)
    
    try:
        results_array = response_data.get("results", [{}])
        if not results_array:
            return []
            
        content_dict = results_array[0].get("content", {})
        inner_results = content_dict.get("results", {})
        
        organic_results = inner_results.get("organic", [])
        
        if not organic_results:
            formatted_json = json.dumps(content_dict, indent=2)
            print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")
            
        return organic_results


    except Exception as e:
        print(f"⚠️ Failed to parse Decodo Search JSON: {e}")
        return []


def html_to_text(html: str, max_chars: int = 12_000) -> str:
    html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',
                  '', html, flags=re.DOTALL)
    text = re.sub(r'<[^>]+>', ' ', html)
    return re.sub(r'\s+', ' ', text).strip()[:max_chars]


def ask_deepseek(
    system_prompt: str,
    user_message:  str,
    max_tokens:    int   = 1024,
    temperature:   float = 0.2,
) -> str:
    r = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        json={
            "model":       DEEPSEEK_MODEL,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user",   "content": user_message},
            ],
            "max_tokens":  max_tokens,
            "temperature": temperature,
        },
        headers=DEEPSEEK_HEADERS,
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


def web_agent(
    question:     str,
    url:          str | None = None,
    search_query: str | None = None,
    javascript:   bool = False,
    geo:          str  = "China",
) -> str:
    if not url and not search_query:
        raise ValueError("Provide url or search_query")
    parts: list[str] = []
    if url:
        parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")
    if search_query:
        res = google_search(search_query, geo=geo)
        parts.append("[Search: {}]\n{}".format(
            search_query,
            "\n".join(
                f"{i+1}. {x['title']}\n   {x['url']}\n   {x['desc']}"
                for i, x in enumerate(res[:5])
            ),
        ))
    return ask_deepseek(
        "Answer using only the live web content below. Cite facts directly.",
        f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",
    )


if __name__ == "__main__":
    verify_credentials()


    print(web_agent(
        question="What are the latest DeepSeek V4 benchmark results?",
        search_query="DeepSeek V4 benchmark results May 2026",
    ))

下一步可以做什么

加入记忆。把抓取结果缓存到字典或 Redis 中，避免在同一会话内重复抓取相同的 URL。

加入结构化输出。提示 DeepSeek V4 返回 JSON（价格、名称、日期）。设置 temperature: 0 即可获得确定性的格式。

加入多步推理。让 V4 根据初步结果决定下一个要抓取的 URL，然后循环往复。百万 token 的上下文窗口让多轮链式调用变得切实可行。

用异步来扩展。把 requests 换成 httpx 和 asyncio，以并发运行多个 Decodo 抓取任务——这对于一次性检查大量 SKU 的价格监控流程必不可少。

分享文章:

关于作者

Kristina Selivanovaite

Decodo 德口多专家专栏: 品牌保护专家 Kristina Selivanovaite

Kristina 是国际关系和外交方面的专家，拥有硕士学位，并对全球数字访问桥梁有着浓厚的兴趣。凭借她的学术背景和全球视野，Kristina 为我们的中国读者量身定制了富有洞察力的内容，涵盖的主题包括网络搜刮、代理以及绕过各种网络限制的方法。

通过 LinkedIn 与 Kristina 联系。

Decodo 博客上的所有信息均按原样提供，仅供参考。对于您使用 Decodo 博客上的任何信息或其中可能链接的任何第三方网站，我们不作任何陈述，也不承担任何责任。

在本文中