首页 > 解决方案 > Urlib 在一台机器上抛出 HTTP 错误 503 但在另一台机器上没有

问题描述

我有一个从 SEC 网站检索信息的网络爬虫。它在我的本地 Windows 计算机上一直运行良好,但是当我在虚拟 Linux 机器上运行它时会引发 503 错误。这可能是什么原因造成的?

对于“标题”,我也尝试过'User-Agent':'Mozilla/5.0'但没有帮助。

完整代码如下:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
import pandas as pd


def sec_filings():
    data = pd.read_excel('research_files/tickers.xlsx', index_col=0, parse_dates=True )
    sec = pd.DataFrame()

    for ticker in data['Common_Ticker']:
        link = f"https://sec.report/Ticker/{ticker}"
        req = Request(link, headers={"User-Agent": 'Bot 91888'})
        webpage = urlopen(req).read()

        with requests.Session() as c:
            soup = BeautifulSoup(webpage, 'html5lib')

            sec_files = pd.DataFrame()
            form = []
            link = []
            name = []
            posted = []

            try:

                table = soup.find_all('table', attrs={'class': "table"})[1]

                for item in table.find_all('td'):
                    form.append(item.get_text())
                    for i in item.find_all('small'):
                        posted.append(i.get_text())

                for itemb in table.find_all('a', href=True):
                    link.append(f"https://sec.report/{itemb['href']}")
                    name.append(itemb.get_text())
            except:
                continue

        try:
            sec_files['form'] = form[0::2]
            sec_files['link'] = link
            sec_files['name'] = name
            sec_files['posted'] = posted
            sec_files['ticker'] = ticker

        except:
            continue

        sec = sec.append(sec_files, ignore_index=True)

        print(ticker, end='\r')

    sec.to_csv('research_files/sec_all_data.csv')

标签: pythonweb-scrapingwebserver

解决方案


推荐阅读