python - 使用 beautifulsoup 将输出保存到数据框中
问题描述
我是网络抓取的新手。我正在尝试从新闻网站上抓取数据。
我有这个代码:
from bs4 import BeautifulSoup as soup
import pandas as pd
import requests
detik_url = "https://news.detik.com/indeks/2"
detik_url
html = requests.get(detik_url)
bsobj = soup(html.content, 'lxml')
bsobj
for link in bsobj.findAll("h3"):
print("Headline : {}".format(link.text.strip()))
links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
links.append(news.a['href'])
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
print(p.find('p').text.strip())
如何利用 Pandas Dataframe 将获取的内容存储到 CSV 文件中?
解决方案
您可以将内容存储在 pandas 数据框中,然后将结构写入 csv 文件。
假设您想将所有文本p.find('p').text.strip()
连同标题一起保存在 csv 文件中,您可以将标题存储在任何变量中(例如head
):
所以,从你的代码:
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div: # <----- Here we make the changes
print(p.find('p').text.strip())
在上面显示的行中,我们执行以下操作:
import pandas as pd
# Create an empty array to store all the data
generated_text = [] # create an array to store your data
for link in links:
page = requests.get(link)
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
for p in div:
# print statement if you want to see the output
generated_text.append(p.find('p').text.strip()) # <---- save the data in an array
# then write this into a csv file using pandas, first you need to create a
# dataframe from our list
df = pd.DataFrame(generated_text, columns = [head])
# save this into a csv file
df.to_csv('csv_name.csv', index = False)
此外,您可以直接使用列表推导并保存到您的 CSV,而不是 for 循环。
# instead of the above snippet, replace the whole `for p in div` loop by
# So from your code above:
.....
bsobj = soup(page.content)
div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
# Remove the whole `for p in div:` and instead use this:
df = pd.DataFrame([p.find('p').text.strip() for p in div], columns = [head])
....
df.to_csv('csv_name.csv', index = False)
此外,您可以将列表推导生成的数组转换为 numpy 数组,然后直接将其写入 csv 文件:
import numpy as np
import pandas as pd
# On a side note:
# convert your normal array to numpy array or use list comprehension to make a numpy array,
# also there are faster ways to convert a normal array to numpy array which you can explore,
# from there you can write to a csv
pd.DataFrame(nparray).to_csv('csv_name.csv'))
推荐阅读
- scheduled-tasks - Powershell 脚本仅在手动调用时才会触发进程。通过计划任务触发时超时
- android - 无尽的前台服务永远不会按预期工作
- javascript - 无法解析 CSS 样式表 Jest & js-dom
- angular - Angular Ractive Forms:从自定义组件中获取验证器
- html - 当作为 wordpress 帖子上传时,Rmarkdown 在文档中创建选项卡不起作用
- react-native - 如何用 react-native 创建一个逼真的 3D 头像?
- triggers - 如何在任务计划程序 Windows 10 中启动触发器
- imagemagick - ImageMagick:将灰度值精确重映射为 RGB 值
- c# - 如何将 System.Windows.Media.Visual 转换为 System.Windows.Controls.Control?
- javascript - 视频全屏模式