首页 > 解决方案 > 从网站中提取数据

使用 Python 的容器

问题描述

我是 Stackoverflow 的新手,也是 Python 的新手。我正在尝试抓取一些数据的网站。我设法从段落中提取文本

,并且我已经从链接下载了一个文件。我现在想从图形容器中提取数据。

html 代码如下所示:

<figure class="chart-container"
        data-chart-type="stacked-level"
        data-anchor=""
        data-is-split-series="False"
        data-all-data='[["Date","Residential","Non-residential","Other construction"],["15",13419.0,5858.0,6000.0],["Jun-15",13536.0,5918.0,5962.0],["Sep-15",13750.0,5870.0,5942.0],["Dec-15",14003.0,5962.0,5957.0],["16",14368.0,6104.0,5873.0],["Jun-16",14868.0,6296.0,5657.0],["Sep-16",15234.0,6524.0,5534.0],["Dec-16",15514.0,6747.0,5456.0],["17",15587.0,6756.0,5408.0],["Jun-17",15496.0,6677.0,5508.0],["Sep-17",15561.0,6597.0,5815.0],["Dec-17",15653.0,6559.0,6130.0],["18",15750.0,6590.0,6356.0],["Jun-18",15893.0,6660.0,6523.0],["Sep-18",15953.0,6710.0,6413.0],["Dec-18",16063.0,6804.0,6294.0],["19",16321.0,7064.0,6111.0],["Jun-19",16526.0,7226.0,5927.0],["Sep-19",16848.0,7408.0,5819.0],["Dec-19",16972.0,7499.0,5743.0],["20",17008.0,7342.0,5753.0],["Jun-20",17148.0,7287.0,5775.0],["Sep-20",17150.0,7201.0,5887.0],["Dec-20",17118.0,7106.0,6005.0],["21",17134.0,7050.0,6102.0],["Jun-21",17108.0,6926.0,6159.0],["Sep-21",17128.0,6788.0,6285.0],["Dec-21",17131.0,6655.0,6389.0],["22",16954.0,6595.0,6490.0],["Jun-22",16742.0,6575.0,6541.0],["Sep-22",16444.0,6606.0,6636.0],["Dec-22",15987.0,6643.0,6726.0],["23",15470.0,6684.0,6815.0],["Jun-23",14956.0,6740.0,6831.0],["Sep-23",14417.0,6786.0,6931.0],["Dec-23",13982.0,6799.0,7029.0],["24",13504.0,6783.0,7127.0],["Jun-24",13035.0,6740.0,7150.0]]'
        data-show-text-every="4"
        data-color="#233657,#64971c,#869eac,#00cc7a,#c5cdd3"
        data-forecast-start="17"

        >
    Chart goes here.
</figure>

我想提取与“data-all-data”部分相关的数据。理想情况下,我想将其保存到 .csv 文件中,以便我可以重新创建图表。

import requests
from bs4 import BeautifulSoup

#create dictionary for login data
login_data = {
    'UserName': 'myUsername',
    'Password': 'myPassword',
    'RememberMe': 'true'
}

#create a session
with requests.session() as s:
    url = 'https://portal.infometrics.co.nz/Login'
    r = s.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
#Add the unique login values to dictionary
    login_data['ReturnUrl']= soup.find('input', attrs={'id': 'ReturnUrl'})['value']
    login_data['__RequestVerificationToken']= soup.find('input', attrs={'name': '__RequestVerificationToken'})['value']
    r = s.post(url, data=login_data)
    #soup = BeautifulSoup(r.content, 'html5lib')
    #print(soup.prettify())

#1. Find the latest 'Downloads' file extension
url = 'https://portal.infometrics.co.nz/Forecasts/Building%20forecasts'
r = s.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
el_d = soup.find(string='Data download')
url_2 = el_d.find_parent('a')['href']

#Add the extension to the known part of the url

url_1 = 'https://portal.infometrics.co.nz'
url_d = url_1+url_2
#print(url_d)

r = s.get(url_d)
#save content into .xlsx workbook
with open ('C:/Users/ZAGOOBR/Downloads/QBR_Data.xlsx','wb')as f:
    f.write(r.content)

#2. Find the latest 'Chart write' up
r = s.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
el_cw = soup.find('p').getText()
#print(el_cw)
#save content into .txt file
with open ('C:/Users/ZAGOOBR/Downloads/QBR_ChartText.txt','a')as f:
    f.write(el_cw)

#3. Download the chart data
list = []
el_cd = soup.find('figure', attrs = {'class':'data-all-data'})

对初学者的任何帮助将不胜感激。

标签: pythonhtmlbeautifulsoup

解决方案


假设您的其余代码按您的意愿工作,请尝试以下操作

import json
import csv

# your code here

el_cd = soup.find('figure', attrs = {'class':'chart-container'})
data = el_cd.get('data-all-data')

rows = json.loads(data)
with open('your-file-here.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)

推荐阅读