python - 无法使用 python lxml 刮网页与许多表
问题描述
我正在尝试抓取此网页,但没有得到任何结果,这适用于只有一个简单表格的其他页面。你能帮我写代码吗?
import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd
import urllib
def scrape_table(url):
# Fetch the page that we're going to parse
page = requests.get(url)
tree = html.fromstring(page.content)
# Using XPATH, fetch all table elements on the page
#df = tree.xpath('//div[@id="main content"]/div[@id="style-1"]/table[@class="table"]/tbody')
df = tree.xpath('//tr')
#assert len(table) == 1
#df = pd.read_html(lxml.etree.tostring(table[0], method='html'))[0]
return df
symbol = 'AMZN'
#balance_sheet_url = 'https://finance.yahoo.com/quote/' + symbol + '?p=' + symbol
#df_balance_sheet = scrape_table(balance_sheet_url)
#df_balance_sheet.info()
#print(df_balance_sheet)
url = "https://www.macrotrends.net/stocks/charts/"+ symbol + "/pe-ratio"
data = requests.request("GET", url)
url_completo = data.url
print(url_completo)
df_pe = scrape_table(url_completo)
这是我试图抓取的网络(代码)网络:https ://www.macrotrends.net/stocks/charts/TMO/thermo-fisher-scientific/pe-ratio
<div id="style-1" style="background-color:#fff; height: 500px; overflow:auto; margin: 0px 0px 30px 0px; padding:0px 30px 20px 0px; border:1px solid #dfdfdf;">
<table class="table">
<thead>
<tr>
<th colspan="4" style="text-align:center;">Thermo Fisher Scientific PE Ratio Historical Data</th>
</tr>
</thead>
<thead>
<tr>
<th style="text-align:center;">Date</th>
<th style="text-align:center;">Stock Price</th>
<th style="text-align:center;">TTM Net EPS</th>
<th style="text-align:center;">PE Ratio</th>
</tr>
</thead>
<tbody><tr>
<td style="text-align:center;">2019-04-12</td>
<td style="text-align:center;">280.65</td>
<td style="text-align:center;"></td>
<td style="text-align:center;">38.71</td>
</tr><tr>
<td style="text-align:center;">2018-12-31</td>
<td style="text-align:center;">223.79</td>
<td style="text-align:center;">$7.25</td>
<td style="text-align:center;">30.87</td>
</tr><tr>
<td style="text-align:center;">2018-09-30</td>
<td style="text-align:center;">243.90</td>
<td style="text-align:center;">$6.33</td>
<td style="text-align:center;">38.53</td>
</tr><tr>
<td style="text-align:center;">2018-06-30</td>
<td style="text-align:center;">206.84</td>
<td style="text-align:center;">$5.92</td>
<td style="text-align:center;">34.94</td>
</tr>
</table>
</div>```
解决方案
您没有正确构建您的 URL。此代码将获取两张表,一张用于亚马逊,另一张用于 thermo-fisher-scientific。
import lxml
from lxml import html
import requests
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
def scrape_table(url):
# Fetch the page that we're going to parse
page = requests.get(url)
tree = html.fromstring(page.content)
tables = tree.findall('.//*/table')
df = pd.read_html(lxml.etree.tostring(tables[0], method='html'))[0]
return df
for symbol in ['AMZN/amazon', 'TMO/thermo-fisher-scientific']:
url = "https://www.macrotrends.net/stocks/charts/" + symbol + "/pe-ratio"
data = requests.request("GET", url)
url_completo = data.url
print(url_completo)
df_pe = scrape_table(url_completo)
print(df_pe)
输出:
Amazon PE Ratio Historical Data
Date Stock Price TTM Net EPS PE Ratio
0 2019-04-12 1843.06 NaN 91.56
1 2018-12-31 1501.97 $20.13 74.61
2 2018-09-30 2003.00 $17.84 112.28
...
Thermo Fisher Scientific PE Ratio Historical Data
Date Stock Price TTM Net EPS PE Ratio
0 2019-04-12 280.65 NaN 38.71
1 2018-12-31 223.79 $7.25 30.87
2 2018-09-30 243.90 $6.33 38.53
...
推荐阅读
- mysql - MYSQL 在字符串中包含函数
- javascript - Angular 9 应用程序在处理 MQTT 消息 5 分钟后冻结
- .net-core - .NetCore 应用程序内存泄漏 - 高开销|未使用的内存
- java - Java Spark:如何在 Dataset 上做 flatMap 并为后续的 groupBy 提供新的模式?
- jquery - Ajax POST 请求 - 400 错误请求 - 格式不正确的语法
- sql - 使用 partition by 从 Snowflake (SQL) 中某个列的最后一个非空值中查找相邻列值
- redux - Redux 持久化数据持久性问题
- python - Plotly Dash:go.Choropleth subunitwidth 不起作用
- nsolid - 如何在 nsolid 控制台中对相关进程进行分组?
- vba - 当我使用 VBA 将 PDF 转换为 Word 时,页码被推回