首页 > 解决方案 > 使用用户定义的 url 和文件名构建 python 网络爬虫函数

问题描述

我希望用户在这个刮板中输入 URL 和 csv 名称。

#Dependencies 
from lxml import html
import requests
import pandas as pd 


x =input('https://web.archive.org/web/20170111201527/https://www.yellowpages.com/nashville-tn/air-conditioning-service-repair')

def Scraper(x):
#URL 
    url = x
#Use Requests to retrieve html 
    resp = requests.get(url) 
#Create Tree from Request Response 
    tree = html.fromstring(resp.content) 
#Path to Website Link 
    elements = tree.xpath('//*[starts-with(@id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]') 
    websites = []
    for element in elements:
        try:
            websites.append("http"+element.attrib['href'].split("http")[2])
        except:
            continue
#Create Pandas Dataframe
webdf= pd.DataFrame(websites,columns =['Links']).drop_duplicates()
print(webdf)

#Export as CSV
y=input()
webdf.to_csv(y+".csv")

我的输出返回“NameError: name 'websites' is not defined”,但这在代码中很明显。我什至尝试在函数之前将其添加为空列表,但没有成功。

标签: pythonpandasxpathweb-scrapinglxml

解决方案


您甚至没有调用 Scraper 函数并返回值,首先更改函数,例如

#Dependencies 
from lxml import html
import requests
import pandas as pd 


x =input('https://web.archive.org/web/20170111201527/https://www.yellowpages.com/nashville-tn/air-conditioning-service-repair')

def Scraper(x):
#URL 
    url = x
#Use Requests to retrieve html 
    resp = requests.get(url) 
#Create Tree from Request Response 
    tree = html.fromstring(resp.content) 
#Path to Website Link 
    elements = tree.xpath('//*[starts-with(@id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]') 
    websites = []
    for element in elements:
        try:
            websites.append("http"+element.attrib['href'].split("http")[2])
        except:
            continue
   return websites

并打电话给

websites = Scraper(x)
webdf = pd.DataFrame(websites,columns =['Links']).drop_duplicates()
print(webdf)

#Export as CSV
y=input()
webdf.to_csv(y+".csv")

推荐阅读