python - 如何运行 python 命令来单击页面上的每个链接并提取每个链接的标题、内容和日期?
问题描述
使用此链接:https ://1997-2001.state.gov/briefings/statements/2000/2000_index.html 。我有一个命令可以单击页面上的每个链接并取出所有数据,但我想将其转换为 csv 文件,因此需要运行三个不同的命令来获取标题、段落和日期页面上每篇文章的名称(以便它们可以成为 excel 表中的列)。我遇到了困难,因为这个页面没有'class'或'id'。任何建议都会非常有帮助。
这是我当前的代码:
url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('td[width="580"] img + a')[400:]:
u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
print(u)
s = BeautifulSoup(requests.get(u).content, 'html.parser')
t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
print( t.split('[end of document]')[0] )
print('-' * 80)
解决方案
您可以使用此脚本将数据保存到 CSV:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('td[width="580"] img + a'):
date = a.text.strip(':')
title = a.find_next_sibling(text=True).strip(': ')
u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href']
print(u)
s = BeautifulSoup(requests.get(u).content, 'html.parser')
t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
content = t.split('[end of document]')[0]
print(date, title, content)
all_data.append({
'url': u,
'date': date,
'title': title,
'content': content
})
print('-' * 80)
df = pd.DataFrame(all_data)
df.to_csv('data.csv', index=False)
print(df)
印刷:
...
url ... content
0 https://1997-2001.state.gov/briefings/statemen... ... Statement by Philip T. Reeker, Deputy Spokesma...
1 https://1997-2001.state.gov/briefings/statemen... ... Media Note\nDecember 26, 2000\nRenewal of the ...
2 https://1997-2001.state.gov/briefings/statemen... ... Statement by Philip T. Reeker, Deputy Spokesma...
3 https://1997-2001.state.gov/briefings/statemen... ... Notice to the Press\nDecember 21, 2000\nMeetin...
4 https://1997-2001.state.gov/briefings/statemen... ... Statement by Philip T. Reeker, Deputy Spokesma...
.. ... ... ...
761 https://1997-2001.state.gov/briefings/statemen... ... Press Statement by James P. Rubin, Deputy Spok...
762 https://1997-2001.state.gov/briefings/statemen... ... Press Statement by James P. Rubin, Spokesman\n...
763 https://1997-2001.state.gov/briefings/statemen... ... Notice to the Press\nJanuary 6, 2000\nAssistan...
764 https://1997-2001.state.gov/briefings/statemen... ... Press Statement by James P. Rubin, Spokesman\n...
765 https://1997-2001.state.gov/briefings/statemen... ... Press Statement by James P. Rubin, Spokesman\n...
[766 rows x 4 columns]
并保存data.csv
(来自 LibreOffice 的屏幕截图):
编辑:对于 1998 年:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://1997-2001.state.gov/briefings/statements/1998/1998_index.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('td[width="580"] img + a, blockquote img + a'):
date = a.text.strip(':')
title = a.find_next_sibling(text=True).strip(': ')
u = 'https://1997-2001.state.gov/briefings/statements/1998/' + a['href']
print(u)
s = BeautifulSoup(requests.get(u).content, 'html.parser')
if not s.body:
continue
t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"]), blockquote, body').get_text(strip=True, separator='\n')
content = t.split('[end of document]')[0]
print(date, title, content)
all_data.append({
'url': u,
'date': date,
'title': title,
'content': content
})
print('-' * 80)
df = pd.DataFrame(all_data)
df.to_csv('data.csv', index=False)
print(df)
推荐阅读
- linux - 导出到 Excel 时是否可以更改 HUE 使用的临时目录?
- python - Python - 数据帧分组并保存到每个组的文件
- c# - 如何在 AJAX 中接收从 ASP .NET 控制器发送的数组?
- javascript - Angular - 从服务发送请求比从组件发送请求更好?
- java - Spring Boot 1.5.20.RELEASE 上的 ErrorViewResolver 内无法重定向视图
- json - 如何将 json 内容写入 csv 文件?
- datatables - 如何将我的 BI 工具连接到 YouTube API (ODBC)?
- javascript - 我可以在客户端代码中更改 URL / 重定向的可能方法是什么?
- intellij-idea - Intellij Idea Refactoring - 如何让多个类扩展一个基类?
- python - 函数参数默认值的 Python 风格指南