如何使用Python抓取网页表格：完整指南

HTML表格是网站组织数据最常见的方式之一，包括财务报告、产品列表、体育比分、人口统计等。但这些数据被锁定在网页布局中。要使用它，您需要提取它。本指南将向您展示如何使用Python做到这一点，从简单的静态表格开始，逐步处理复杂的动态表格。

Justinas Tamasevicius

最后更新: 12月 08日, 2025年

9 分钟阅读

理解HTML表格

在抓取表格之前，您需要理解其结构。HTML使用标签来组织内容。要查看这一点，访问任何带有表格的网页，右键单击它，然后选择"检查"（如果需要帮助，请查看我们关于如何检查元素的指南）。

您只需要知道4个标签:

<table> – 整个表格容器（想象成整个电子表格文件）
<tr> – “表格行” – 单行（就像电子表格中的一行）
<th> – “表格标题” – 标题单元格（列标题，如"名称"或"价格"）
<td> – “表格数据” – 数据单元格（单个值，如"$19.99"）

表格是嵌套的。<table>包含<tr>（行），后者包含<th>（标题）和<td>（数据）。理解这种结构对于稍后选择正确的CSS或XPath选择器很重要。

前提条件

您需要安装Python以及几个库:

Requests – 发送HTTP请求以下载网页
BeautifulSoup4 – 将HTML解析为可搜索对象
Pandas – 将抓取的数据组织成表格并导出为CSV/Excel
lxml – pandas.read_html需要的快速解析器
Selenium – 为JavaScript密集型网站自动化浏览器

要安装所有这些库，打开终端（或命令提示符）并运行以下命令:

pip install requests beautifulsoup4 pandas selenium lxml

我们将使用requests和BeautifulSoup处理静态网站，当您需要抓取动态内容时使用Selenium。

如何抓取静态HTML表格

静态表格在初始HTML中包含所有数据，不需要JavaScript加载。有两种方法可以抓取它们。

方法1. 最简单的方法（pandas.read_html）

对于简单的静态表格，pandas有一个完成繁重工作的函数: read_html()。

此函数扫描HTML，找到所有<table>标签，并将它们转换为pandas DataFrames列表。

import pandas as pd
import requests
from io import StringIO  # Needed to wrap the HTML string

# The URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Fetch the HTML content
response = requests.get(url, headers=headers)
html_content = response.text

# Parse all tables from the HTML content
tables = pd.read_html(StringIO(html_content))

print(f"Total tables found: {len(tables)}")

# In this case, the table we want is the 1st one
df = tables[0]

# Display the first 5 rows
print(df.head())

# Save the data to a CSV file
df.to_csv("population_data.csv", index=False)
print("Data saved to population_data.csv")

import pandas as pd
import requests
from io import StringIO  # Needed to wrap the HTML string

# The URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Fetch the HTML content
response = requests.get(url, headers=headers)
html_content = response.text

# Parse all tables from the HTML content
tables = pd.read_html(StringIO(html_content))

print(f"Total tables found: {len(tables)}")

# In this case, the table we want is the 1st one
df = tables[0]

# Display the first 5 rows
print(df.head())

# Save the data to a CSV file
df.to_csv("population_data.csv", index=False)
print("Data saved to population_data.csv")

运行脚本时，控制台输出如下所示:

方法2. 手动方法（BeautifulSoup）

有时pandas.read_html会失败，或者您需要更多控制。这是使用BeautifulSoup的手动方法。

我们将在5个清晰的步骤中抓取相同的Wikipedia页面。

步骤1. 获取网页

使用requests下载页面的HTML。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    print("Successfully fetched the page")
    html_content = response.text
else:
    print(f"Error fetching page: Status code {response.status_code}")
    exit()

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    print("Successfully fetched the page")
    html_content = response.text
else:
    print(f"Error fetching page: Status code {response.status_code}")
    exit()

步骤2. 使用BeautifulSoup解析HTML

将原始HTML传递给BeautifulSoup以创建可搜索对象。

soup = BeautifulSoup(html_content, 'html.parser')

步骤3. 定位并提取表格

在浏览器中右键单击表格并选择"检查"。您会看到它有一个class属性"wikitable"。

使用此类查找表格，然后提取所有标题（<th>）和行（<tr>）。

table = soup.find('table', attrs={'class': 'wikitable'})

if table is None:
    print("Could not find the table. Check your selectors.")
    exit()

# --- Get Headers ---
headers_html = table.find_all('th')
headers = [th.text.strip() for th in headers_html]
print(f"Headers found: {headers}")

# --- Get All Rows ---
rows_html = table.find('tbody').find_all('tr')
all_rows = []

for row in rows_html:
    cells_html = row.find_all('td')
    row_data = [cell.text.strip() for cell in cells_html]
    if row_data:
        all_rows.append(row_data)

table = soup.find('table', attrs={'class': 'wikitable'})

if table is None:
    print("Could not find the table. Check your selectors.")
    exit()

# --- Get Headers ---
headers_html = table.find_all('th')
headers = [th.text.strip() for th in headers_html]
print(f"Headers found: {headers}")

# --- Get All Rows ---
rows_html = table.find('tbody').find_all('tr')
all_rows = []

for row in rows_html:
    cells_html = row.find_all('td')
    row_data = [cell.text.strip() for cell in cells_html]
    if row_data:
        all_rows.append(row_data)

步骤4. 将数据转换为pandas DataFrame

将您的列表转换为结构化DataFrame。

pandas DataFrame是一个强大的内存中表格，类似于电子表格。虽然我们的all_rows变量只是一个Python列表的列表，但将其转换为DataFrame允许我们轻松地使用标题标记数据、清理数据、分析数据，最重要的是，使用单个命令将其导出为CSV或Excel等格式。

df = pd.DataFrame(all_rows, columns=headers)
print(df.head())

步骤5. 保存数据

将DataFrame导出为CSV。

df.to_csv('scraped_data_manual.csv', index=False)

最终的CSV文件将如下所示:

此方法使用了Python网页抓取的所有基本原理，并为您提供完全控制。您可以进一步修改代码以仅提取特定行或列，或筛选相关数据。

快速提示 – 始终首先尝试方法1。它快速简单。如果失败或您需要更多控制，请使用方法2。

处理复杂的表格结构（colspan和rowspan）

某些表格使用colspan（跨越多列）或rowspan（跨越多行）合并单元格。

问题 – 简单的循环会将数据放在错误的列中。

修复方法 – 首先尝试pandas.read_html。它通常足够智能，可以自动正确解析带有colspan和rowspan的表格。

让我们以网页浏览器比较页面为例。提到"Arora"浏览器的行使用rowspan跨越6行。

"Latest release"标题使用colspan跨越2列（“Version"和"Date”）。

让我们看看pandas是否可以处理它。

import pandas as pd
import requests
from io import StringIO

# Configuration
WIKI_URL = 'https://en.wikipedia.org/wiki/Comparison_of_web_browsers'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36'
OUTPUT_FILE = 'browser_comparison.csv'

# Fetch page
response = requests.get(WIKI_URL, headers={'User-Agent': USER_AGENT})
html_content = response.text

# Extract all tables
all_tables = pd.read_html(StringIO(html_content))
print(f"Found {len(all_tables)} tables")

# Find the correct table by checking its columns
for index, table in enumerate(all_tables):
    if 'Browser' in str(table.columns):
        print(f"\nBrowser table found at index {index}")
        
        # Clean the multi-level headers from pandas
        table.columns = ['_'.join(col).strip() for col in table.columns.values]
        print(table.head(10))

        table.to_csv(OUTPUT_FILE, index=False)
        print(f"\nSaved to {OUTPUT_FILE}")
        break

import pandas as pd
import requests
from io import StringIO

# Configuration
WIKI_URL = 'https://en.wikipedia.org/wiki/Comparison_of_web_browsers'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36'
OUTPUT_FILE = 'browser_comparison.csv'

# Fetch page
response = requests.get(WIKI_URL, headers={'User-Agent': USER_AGENT})
html_content = response.text

# Extract all tables
all_tables = pd.read_html(StringIO(html_content))
print(f"Found {len(all_tables)} tables")

# Find the correct table by checking its columns
for index, table in enumerate(all_tables):
    if 'Browser' in str(table.columns):
        print(f"\nBrowser table found at index {index}")
        
        # Clean the multi-level headers from pandas
        table.columns = ['_'.join(col).strip() for col in table.columns.values]
        print(table.head(10))

        table.to_csv(OUTPUT_FILE, index=False)
        print(f"\nSaved to {OUTPUT_FILE}")
        break

然后您将获得成功扁平化复杂标题和行的CSV。

当pandas.read_html（方法1）在复杂表格上失败时，您可以使用专门的库，如html-table-extractor。其主要目的是正确解析复杂表格，自动扩展合并的单元格以构建干净的矩形数据网格。

首先，您需要安装它:

pip install html-table-extractor

以下是结合BeautifulSoup查找表格和html-table-extractor解析表格的代码:

import requests
import csv
from bs4 import BeautifulSoup
from html_table_extractor.extractor import Extractor

# Configuration
WIKI_URL = "https://en.wikipedia.org/wiki/Comparison_of_web_browsers"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
OUTPUT_FILE = "browser_comparison_extractor.csv"

print(f"Fetching page: {WIKI_URL}")
response = requests.get(WIKI_URL, headers={"User-Agent": USER_AGENT})
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")
all_tables = soup.find_all("table")

print(f"Found {len(all_tables)} tables. Searching for the correct one...")

for index, table in enumerate(all_tables):
    try:
        # We convert the 'table' object back to a string
        extractor = Extractor(str(table))
        extractor.parse()

        table_data = extractor.return_list()

        if table_data:
            # Clean header values (e.g., "Browser\n" becomes "Browser")
            headers = [str(header).strip() for header in table_data[0]]

            if "Browser" in headers:
                print(f"\nBrowser table found at index {index}")

                with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
                    writer = csv.writer(f)
                    writer.writerows(table_data)

                print(f"Successfully saved to {OUTPUT_FILE}")

                # Break the loop since we found the first one
                break

    except Exception as e:
        print(f"Could not parse table at index {index}: {e}")

import requests
import csv
from bs4 import BeautifulSoup
from html_table_extractor.extractor import Extractor

# Configuration
WIKI_URL = "https://en.wikipedia.org/wiki/Comparison_of_web_browsers"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36"
OUTPUT_FILE = "browser_comparison_extractor.csv"

print(f"Fetching page: {WIKI_URL}")
response = requests.get(WIKI_URL, headers={"User-Agent": USER_AGENT})
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")
all_tables = soup.find_all("table")

print(f"Found {len(all_tables)} tables. Searching for the correct one...")

for index, table in enumerate(all_tables):
    try:
        # We convert the 'table' object back to a string
        extractor = Extractor(str(table))
        extractor.parse()

        table_data = extractor.return_list()

        if table_data:
            # Clean header values (e.g., "Browser\n" becomes "Browser")
            headers = [str(header).strip() for header in table_data[0]]

            if "Browser" in headers:
                print(f"\nBrowser table found at index {index}")

                with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
                    writer = csv.writer(f)
                    writer.writerows(table_data)

                print(f"Successfully saved to {OUTPUT_FILE}")

                # Break the loop since we found the first one
                break

    except Exception as e:
        print(f"Could not parse table at index {index}: {e}")

清理和保存数据

原始抓取的数据通常很混乱。例如，在我们在方法2（BeautifulSoup）中创建的DataFrame中，Population列包含字符串，如"1,417,492,000"。因为这是一个字符串而不是数字，所以您无法使用它进行计算。让我们清理它。

您可以将以下行添加到方法2脚本中以清理数据。此代码直接从您之前创建的df变量继续:

# This code extends the Method 2 (BeautifulSoup) example.
# It assumes 'df' is the DataFrame we already created.

# The 'Population' column is a string: '1,417,492,000'

# Remove commas
df['Population_Clean'] = df['Population'].str.replace(',', '', regex=False)

# Convert to numeric
df['Population_Clean'] = pd.to_numeric(df['Population_Clean'])

# Now you can perform calculations
print(f"Total population of top 5: {df.head(5)['Population_Clean'].sum()}")

# This code extends the Method 2 (BeautifulSoup) example.
# It assumes 'df' is the DataFrame we already created.

# The 'Population' column is a string: '1,417,492,000'

# Remove commas
df['Population_Clean'] = df['Population'].str.replace(',', '', regex=False)

# Convert to numeric
df['Population_Clean'] = pd.to_numeric(df['Population_Clean'])

# Now you can perform calculations
print(f"Total population of top 5: {df.head(5)['Population_Clean'].sum()}")

清理后，保存您的数据。有关更多格式（JSON、数据库等），请查看我们关于如何保存抓取数据的指南。

# Save to CSV
df.to_csv('population_data_clean.csv', index=False)

# Save to JSON
df.to_json('population_data_clean.json', orient='records')

如何抓取动态（JavaScript）表格

您运行抓取器，但表格数据为空。您检查HTML，看到"Loading…"或空的<div>。发生这种情况是因为网站使用JavaScript动态加载数据。requests.get()仅检索初始HTML，它不执行JavaScript。如果表格数据在页面渲染后加载，您的抓取器将看不到它。

解决方案#1. 查找隐藏的API

许多网站通过后台API请求加载数据。您可以拦截此API并直接请求数据:

在浏览器中打开开发者工具（F12）
转到Network选项卡 – 按Fetch/XHR筛选
重新加载页面
查找返回JSON数据的请求（这是您的表格内容）

像Yahoo Finance和Google Finance这样的金融网站通常使用这种模式。在下图中，您可以看到Fetch/XHR选项卡捕获返回干净JSON数据的请求。

找到正确的请求后，右键单击它并选择Copy as cURL（或记下URL、标头和参数）。然后，您可以在Python中重新创建该请求。

将"Copy as cURL"命令转换为Python的专业提示是使用在线工具，如curl转换器。您可以粘贴整个cURL命令，它将自动使用requests库生成等效代码。

运行此代码将打印干净的JSON数据，看起来像这样:

{
    "marketCap": {
        "raw": 4.580887901397705E12,
        "fmt": "4.581T",
        "longFmt": "4,580,887,901,397"
    },
    "ticker": "NVDA",
    "avgDailyVol3m": {
        "raw": 1.818218125E8,
        "fmt": "181.822M",
        "longFmt": "181,821,812"
    },
...
}

重要说明: 如果您的请求失败或没有获得数据，网站可能正在阻止您的脚本。要解决此问题，您需要使您的请求看起来更像真实用户。我们在本指南后面的最佳实践和道德考虑部分中介绍了这些技术。

解决方案#2. 使用无头浏览器

如果没有API可以定位，您需要自动化真实浏览器来渲染JavaScript。在服务器上运行时，这称为无头浏览器。像Selenium这样的工具将:

打开真实浏览器（如Chrome）
等待JavaScript执行和表格加载
为您提供最终渲染的HTML进行解析

这也是处理动态分页的唯一方法，即当您需要单击触发JavaScript的Next按钮时。我们将在下一节中介绍这一点。

这种方法比直接API请求更慢且资源更密集，因此始终首先检查隐藏的API。

如何抓取分页表格

表格通常分布在多个页面上（“第1页”、"第2页"等）。关键是弄清楚"Next"按钮的工作原理。

情况#1. 静态分页

要查找的内容 – 单击"Next"并检查URL是否更改（例如，更改为/?page=2）。如果是，您正在处理静态网站。

解决方案 – 在循环中使用requests。以CoinMarketCap为例。它使用像/?page=2、/?page=3这样的URL。如果您检查"Next"按钮，您会看到它只是一个简单的链接（<a>标签），具有静态href属性。

因为URL简单且可预测，您可以使用requests和pandas循环抓取它:

import pandas as pd
import requests
from io import StringIO
from time import sleep

all_data = []
page = 90
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}

while True:
    print(f"Page {page}...", end=" ")
    response = requests.get(f"https://coinmarketcap.com/?page={page}", headers=headers)

    if response.status_code != 200:
        break

    try:
        all_data.append(pd.read_html(StringIO(response.text))[0])
    except:
        break

    if (
        'aria-disabled="true"' in response.text
        or "?page=" + str(page + 1) not in response.text
    ):
        print("Last page!")
        break

    page += 1
    sleep(2)

pd.concat(all_data, ignore_index=True).to_csv("coinmarketcap_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages")

import pandas as pd
import requests
from io import StringIO
from time import sleep

all_data = []
page = 90
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}

while True:
    print(f"Page {page}...", end=" ")
    response = requests.get(f"https://coinmarketcap.com/?page={page}", headers=headers)

    if response.status_code != 200:
        break

    try:
        all_data.append(pd.read_html(StringIO(response.text))[0])
    except:
        break

    if (
        'aria-disabled="true"' in response.text
        or "?page=" + str(page + 1) not in response.text
    ):
        print("Last page!")
        break

    page += 1
    sleep(2)

pd.concat(all_data, ignore_index=True).to_csv("coinmarketcap_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages")

情况#2. 动态分页

要查找的内容 – 单击"Next"并检查URL是否保持不变，但表格内容发生变化。这意味着网站使用动态分页。

解决方案 – 使用无头浏览器，如Selenium模拟单击"Next"按钮。

让我们以Datatables为例进行抓取。此页面使用JavaScript加载其数据和分页。

import pandas as pd
from io import StringIO
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from time import sleep

url = "https://datatables.net/examples/data_sources/ajax.html"
all_data = []

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
sleep(2)  # Wait for the page and initial JavaScript to load

page = 90
while True:
    print(f"Scraping page {page}...", end=" ")

    table_html = driver.find_element(By.ID, "example").get_attribute("outerHTML")
    if not table_html:
        break

    df = pd.read_html(StringIO(table_html))[0]
    all_data.append(df)
    print("Done.")

    # Check if the "Next" button is disabled
    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, "button.dt-paging-button.next")
        btn_class = next_btn.get_attribute("class") or ""

        if "disabled" in btn_class:
            print("Last page reached!")
            break

        # If not disabled, click it and wait
        next_btn.click()
        sleep(1)  # Wait for the table to reload
        page += 1
    except:
        print("Could not find 'Next' button or it broke.")
        break

driver.quit()

final_df = pd.concat(all_data, ignore_index=True)
final_df.to_csv("datatables_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages to datatables_data.csv")

import pandas as pd
from io import StringIO
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from time import sleep

url = "https://datatables.net/examples/data_sources/ajax.html"
all_data = []

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get(url)
sleep(2)  # Wait for the page and initial JavaScript to load

page = 90
while True:
    print(f"Scraping page {page}...", end=" ")

    table_html = driver.find_element(By.ID, "example").get_attribute("outerHTML")
    if not table_html:
        break

    df = pd.read_html(StringIO(table_html))[0]
    all_data.append(df)
    print("Done.")

    # Check if the "Next" button is disabled
    try:
        next_btn = driver.find_element(By.CSS_SELECTOR, "button.dt-paging-button.next")
        btn_class = next_btn.get_attribute("class") or ""

        if "disabled" in btn_class:
            print("Last page reached!")
            break

        # If not disabled, click it and wait
        next_btn.click()
        sleep(1)  # Wait for the table to reload
        page += 1
    except:
        print("Could not find 'Next' button or it broke.")
        break

driver.quit()

final_df = pd.concat(all_data, ignore_index=True)
final_df.to_csv("datatables_data.csv", index=False)
print(f"\nDone! Saved {len(all_data)} pages to datatables_data.csv")

这种相同的逻辑可以使用其他现代工具（如Playwright进行网页抓取）应用。

现代方法. 人工智能（AI）可以抓取表格吗？

这是当今最常见的问题，简短的答案是"可以，但有注意事项"。

人工智能（AI）作为助手

人工智能（AI）聊天机器人非常适合生成代码。复制表格的HTML，粘贴到ChatGPT或Claude中，并要求它编写BeautifulSoup脚本。您将在几秒钟内获得可用代码，而不是手动编写它。您可以在我们的其他指南中了解有关使用ChatGPT进行网页抓取或Claude进行网页抓取的更多信息。

但聊天机器人有局限性。它们编写代码，但不为您运行代码。它们无法渲染JavaScript、管理代理或处理被阻止。当网站更改其HTML（这种情况经常发生）时，代码就会崩溃。

人工智能（AI）作为抓取器本身

真正的人工智能（AI）抓取解决方案不仅生成代码，它理解页面结构并适应变化。您不必编写脆弱的CSS选择器，如soup.find(class_="wikitable")，而是用简单的英语告诉它您想要什么数据。

这就是Decodo的人工智能（AI）Parser所做的。它是最好的人工智能（AI）数据收集工具之一，因为当网站发生变化时它保持弹性。有关此方法的更多信息，请参阅Decodo如何处理人工智能（AI）数据收集。

最佳实践和道德考虑

抓取很强大，但您需要尊重他人。每秒向服务器发送数百个请求会很快被阻止。以下是如何负责任地抓取（并避免被禁止）:

首先检查robots.txt. 大多数网站在website.com/robots.txt处都有一个文件，说明机器人的规则。我们有关于如何检查网站是否允许抓取的完整指南
尊重服务条款（ToS）. ToS是一份法律文件。查看它以确保您没有违反政策，特别是对于商业用途。
使用代理网络. 您的家庭IP地址是网站的明显标志。在10或100个请求后，您将被阻止。代理，尤其是来自动态住宅代理网络的代理，通过不同的真实设备路由您的请求，每次都使您看起来像一个普通的独特用户。这是大规模可靠抓取的关键
使用User-Agent. 如我们的示例所示，User-Agent字符串使您的请求看起来像来自浏览器，而不是脚本。这绕过了许多基本的反机器人系统
在请求之间添加延迟. 在循环内添加time.sleep(2)。这对服务器是礼貌的，使您看起来不像机器人。（您甚至可以构建Python requests retry系统来自动处理此问题。）

常见问题排查

最常见的错误是AttributeError: 'NoneType' object has no attribute 'find'。这可能是最常见的抓取错误。这意味着您试图在None上调用方法，当您的选择器找不到任何内容时会发生这种情况。

# Your selector has a typo or wrong class name
table = soup.find('table', attrs={'class': 'my-typo-class'})  # Returns None

# This next line is where it crashes
rows = table.find_all('tr')  # Can't call .find_all() on None!

修复方法 – 检查崩溃前的行。您的soup.find()返回None，因为它没有匹配任何内容。仔细检查您的选择器是否有拼写错误，或检查实际HTML以查看类名真正是什么。

其他常见问题

以下是一些常见的抓取错误和修复方法:

404 Not Found. 您的URL错误，或页面不存在。检查拼写错误
403 Forbidden. 服务器正在阻止您。添加User-Agent标头以使其看起来像浏览器。如果这不起作用，您需要代理来隐藏您的身份
429 Too Many Requests. 您抓取得太快。在请求之间添加time.sleep(2)以减慢速度。如果您仍然被阻止，请使用代理网络来轮换您的IP地址

空数据/缺少内容. 数据可能是用JavaScript加载的。使用浏览器的Network选项卡找到隐藏的API端点，或切换到无头浏览器自动化工具以等待页面完全加载

总结

您已经学会了如何在Python中抓取表格，从简单的pandas一行代码到复杂的Selenium设置。这些工具非常适合小型项目和学习。

但抓取器很脆弱，网站经常更改其HTML。扩展到数千页意味着处理代理、无头浏览器和验证码。如果这听起来很繁琐，还有另一种选择。

您可以使用Decodo网页抓取API，而不是自己构建和维护抓取器。发送URL，API处理混乱的事情（代理、JavaScript渲染、反机器人措施），您将得到干净的JSON。

此外，如果您正在构建人工智能（AI）代理或自动化工作流，您可以与LangChain和n8n集成。

免费试用网页抓取API

无需编写任何代码即可收集实时数据——立即开始您的7天免费试用。

开始使用

关于作者

Justinas Tamasevicius

工程主管

Justinas Tamaševičius 是工程主管，在软件开发领域拥有二十多年的专业经验。从学生时代自学成才的激情开始，他的职业生涯跨越了后端工程、系统架构和基础架构开发等领域。

Justinas 目前负责领导工程部门，推动创新，提供高性能的解决方案，同时保持对效率和质量的高度关注。

通过 LinkedIn 与 Justinas 联系。

Decodo 博客上的所有信息均按原样提供，仅供参考。对于您使用 Decodo 博客上的任何信息或其中可能链接的任何第三方网站，我们不作任何陈述，也不承担任何责任。

在本文中

快速收集实时数据

体验无限制的数据采集，使用我们的网页抓取API——提供7天免费试用。

立即免费试用

常见问题

哪些Python库最适合抓取HTML表格？

对于简单的静态表格，pandas.read_html()是最快的选择。要获得更多控制，请使用requests获取页面并使用BeautifulSoup解析它。对于JavaScript密集型网站，请使用Selenium或Playwright。

如何从使用JavaScript加载数据的网站抓取表格？

两种方法:

查找隐藏的API – 使用浏览器的Network选项卡（按Fetch/XHR筛选）查找返回JSON数据的API请求。这比浏览器自动化更快、更可靠
使用无头浏览器 – 如果没有API，请使用Selenium或Playwright加载页面、执行JavaScript并解析最终HTML

如果表格结构复杂（例如，合并单元格、多级标题），我该怎么办？

首先尝试pandas.read_html()，它在大多数情况下会自动处理colspan和rowspan。如果失败，您需要编写自定义BeautifulSoup解析器来手动处理单元格结构。

如何将抓取的表格数据导出到.csv或Excel文件？

一旦您有了pandas DataFrame（df），这是一行的工作:

CSV – df.to_csv('filename.csv', index=False)
Excel – df.to_excel('filename.xlsx', index=False)

数据收集

PYTHON

使用 Python 解析 XML —— 终极指南 2025

标准是明确和界定世界上人与人、人与物之间交流的一种手段。例如，人类语言、计算机 USB 插座或倒牛奶前必须先加麦片的事实。说到计算机应用程序和系统，有一种标准是最受开发人员欢迎的，它就是 XML（可扩展标记语言）。在本文中，我们将探讨如何使用 Python 的内置库从 XML 文件中解析数据，了解解析的最佳方法，并理解有效读取信息的重要性。

James Keenan

最后更新: 2月 13日, 2025年

13 分钟阅读

PYTHON

数据收集

如何使用 Python 从任何网站抓取图像

如果你需要大量图像,而一张一张保存的想法已经让你感到厌烦,那你并不孤单。在为机器学习项目准备数据集时,这种工作尤其令人疲惫。好消息是,网页抓取通过让你在几个步骤内收集大量图像,使整个过程更快、更易于管理。在这篇博文中,我们将指导你通过一种直接的方法从静态网站抓取图像。我们将使用 Python、几个便捷的库以及代理来保持一切顺利运行。

Dominykas Niaura

最后更新: 12月 05日, 2025年

10 分钟阅读

PYTHON

数据收集

如何抓取 GitHub：实用教程 2025

GitHub 是互联网上最重要的技术知识来源之一，对于构建复杂应用程序的开发人员来说尤其如此。跟随本指南学习如何提取这些宝贵的数据，毫不费力地紧跟最新技术趋势。

Zilvinas Tamulis

最后更新: 2月 13日, 2025年

10 分钟阅读

如何使用Python抓取网页表格：完整指南

理解HTML表格

前提条件

如何抓取静态HTML表格

方法1. 最简单的方法（pandas.read_html）

方法2. 手动方法（BeautifulSoup）

步骤1. 获取网页

步骤2. 使用BeautifulSoup解析HTML

步骤3. 定位并提取表格

步骤4. 将数据转换为pandas DataFrame

步骤5. 保存数据

处理复杂的表格结构（colspan和rowspan）

清理和保存数据

如何抓取动态（JavaScript）表格

解决方案#1. 查找隐藏的API

解决方案#2. 使用无头浏览器

如何抓取分页表格

情况#1. 静态分页

情况#2. 动态分页

现代方法. 人工智能（AI）可以抓取表格吗？

人工智能（AI）作为助手

人工智能（AI）作为抓取器本身

最佳实践和道德考虑

常见问题排查

其他常见问题

总结

常见问题

哪些Python库最适合抓取HTML表格？

如何从使用JavaScript加载数据的网站抓取表格？

如果表格结构复杂（例如，合并单元格、多级标题），我该怎么办？

如何将抓取的表格数据导出到.csv或Excel文件？

相关文章