首页 > 解决方案 > 使用 beautifulsoup 将输出保存到数据框中

问题描述

我是网络抓取的新手。我正在尝试从新闻网站上抓取数据。

我有这个代码:

from bs4 import BeautifulSoup as soup
import pandas as pd
import requests

detik_url = "https://news.detik.com/indeks/2"
detik_url

html = requests.get(detik_url)

bsobj = soup(html.content, 'lxml')
bsobj

for link in bsobj.findAll("h3"):
  print("Headline : {}".format(link.text.strip()))

links = []
for news in bsobj.findAll('article',{'class':'list-content__item'}):
  links.append(news.a['href'])

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
  for p in div:
    print(p.find('p').text.strip())

如何利用 Pandas Dataframe 将获取的内容存储到 CSV 文件中?

标签: pythonhtmlpandas

解决方案


您可以将内容存储在 pandas 数据框中,然后将结构写入 csv 文件。

假设您想将所有文本p.find('p').text.strip()连同标题一起保存在 csv 文件中,您可以将标题存储在任何变量中(例如head):

所以,从你的代码:

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
  for p in div:                 # <----- Here we make the changes
    print(p.find('p').text.strip())

在上面显示的行中,我们执行以下操作:

import pandas as pd

# Create an empty array to store all the data

generated_text = []  # create an array to store your data

for link in links:
  page = requests.get(link)
  bsobj = soup(page.content)
  div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
    for p in div:
        # print statement if you want to see the output
        generated_text.append(p.find('p').text.strip())  # <---- save the data in an array


# then write this into a csv file using pandas, first you need to create a 
# dataframe from our list

df = pd.DataFrame(generated_text, columns = [head])

# save this into a csv file

df.to_csv('csv_name.csv', index = False)

此外,您可以直接使用列表推导并保存到您的 CSV,而不是 for 循环。


# instead of the above snippet, replace the whole `for p in div` loop by

# So from your code above:
.....
    bsobj = soup(page.content)
    div = bsobj.findAll('div',{'class':'detail__body itp_bodycontent_wrapper'})
  # Remove the whole  `for p in div:` and instead use this:

    df = pd.DataFrame([p.find('p').text.strip() for p in div], columns = [head])
....

df.to_csv('csv_name.csv', index = False)


此外,您可以将列表推导生成的数组转换为 numpy 数组,然后直接将其写入 csv 文件:

import numpy as np
import pandas as pd

# On a side note: 
# convert your normal array  to numpy array or use list comprehension to make a numpy array, 
# also there are faster ways to convert a normal array to numpy array which you can explore,
# from there you can write to a csv

pd.DataFrame(nparray).to_csv('csv_name.csv'))

推荐阅读