python - python将几个数据框与日期合并
问题描述
这段代码的目的是抓取一堆 URL,然后从每个 URL 中提取数据表。
将表格转换为 pandas 数据框,修复日期,删除不必要的列,重命名列,最后将它们全部组合成一个统一的数据框,以日期为索引,这样数据将按日期排序,以便发生的事件同时应该在同一行。拼接前的原始数据:
Release Date Argentina Economic Activity YoY
0 2018-10-25 21:00:00+02:00 -1.6%
1 2018-09-26 21:00:00+02:00 -2.7%
2 2018-08-23 21:00:00+02:00 -6.7%
3 2018-07-24 21:00:00+02:00 -5.8%
4 2018-06-26 21:00:00+02:00 -0.9%
Release Date Argentina Gross Domestic Product (GDP) YoY
0 2018-09-19 22:00:00+02:00 -4.2%
1 2018-06-19 21:00:00+02:00 3.6%
2 2018-03-21 21:00:00+02:00 3.9%
3 2017-12-20 22:00:00+02:00 4.2%
4 2017-09-21 21:00:00+02:00 2.7%
但是在连接之后发生的事情是不同的日期在同一行,所以假设它们是 3 个表,我会在第一行找到三个日期,然后在第二行......等等。
像这样:
2018-01-24 22:00:00+02:00, 2016-06-29 21:00:00... 3.9% 0.5%
(2018-02-28 22:00:00+02:00, 2016-09-22 21:00:00... 2.0% -3.4%
(2018-03-28 21:00:00+02:00, 2016-12-22 22:00:00... 4.1% -3.8%
(2018-04-24 21:00:00+02:00, 2017-03-21 21:00:00... 5.1% -2.1%
这是代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
from datetime import datetime
from tzlocal import get_localzone
import time
class DataEngine:
def __init__(self):
self.urls = open(r"C:\Users\Sayed\Desktop\sample.txt").readlines()
self.driver = webdriver.Chrome(r"D:\Projects\Tutorial\Driver\chromedriver.exe")
self.wait = WebDriverWait(self.driver, 10)
self.time = time.time()
def title(self):
names = []
for url in self.urls:
self.driver.get(url)
title = self.driver.title
names.append(title)
return names
def table(self):
DataFrames = []
for url in self.urls:
self.driver.get(url)
while True:
try:
item = self.wait.until(
ec.visibility_of_element_located((By.XPATH, '//*[contains(@id,"showMoreHistory")]/a')))
self.driver.execute_script("arguments[0].click();", item)
except Exception:
break
df = pd.DataFrame(columns=['Release Date', 'Time', 'Actual', 'Forecast', 'Previous'])
pos = 0
for table in self.wait.until(
ec.visibility_of_all_elements_located((By.XPATH, '//*[contains(@id,"eventHistoryTable")]//tr'))):
data = [item.text for item in table.find_elements_by_xpath(".//*[self::td]")]
if data:
df.loc[pos] = data[0:5]
pos += 1
df = df.head(10)
DataFrames.append(df)
return DataFrames
def date(self):
dfs = []
tables = self.table()
for df in tables:
Dates = []
df["Date"] = df["Release Date"].apply(lambda x: x[:12]) + " " + df["Time"]
for date in df["Date"]:
date = datetime.strptime(date.strip(), '%b %d, %Y %H:%M')
Dates.append(date)
df["Date"] = Dates
df['Date'] = df['Date'].dt.tz_localize('US/Eastern').dt.tz_convert(get_localzone())
df = df[['Date', 'Actual', 'Forecast', 'Previous', 'Release Date', 'Time']]
df = df.drop(df.columns[-4:], axis=1).reset_index(drop=True)
dfs.append(df)
return dfs
def rename(self):
FinalDataFrames = []
tables = self.date()
names = self.title()
for name, table in zip(names, tables):
table.rename(columns={'Date': 'Release Date', 'Actual': name}, inplace=True)
table['Release Date'] = pd.to_datetime(table['Release Date'])
FinalDataFrames.append(table)
return FinalDataFrames
def update(self):
dfs = self.rename()
for df in dfs:
last_read = df.iloc[0, 0]
latest_release_date = self.driver.find_element_by_xpath('//*[@id="releaseInfo"]/span[1]/div').text
latest_release_time = self.driver.find_elements_by_css_selector('td.left')[1].text
latest = latest_release_date + ' ' + latest_release_time
latest = pd.to_datetime(latest)
latest_release = latest.tz_localize('US/Eastern').tz_convert(get_localzone())
if last_read == latest_release:
pass
else:
self.rename()
def final_df(self):
self.update()
while True:
dfs = self.rename()
df = pd.concat(dfs, axis=1, join='outer')
df = df.set_index('Release Date')
df = df.sort_index(ascending=True)
print('fin', time.time() - self.time)
print(df)
df.to_csv('FinalDF.csv')
if __name__ == "__main__":
DataEngine().final_df()
解决方案
看起来您正在创建具有从 0 开始运行的数字索引的数据帧。当您将它们沿列 ( axis=1
) 连接时,Pandas 会合并具有相同索引值的记录。您应该将日期设置为连接之前的索引,这将使 Pandas 有机会合并具有相同日期的记录。
这是一个简化的例子。让我们创建两个具有日期和一些值的数据框:
>>> df1 = pd.DataFrame([['2018-10-01', 3.1],['2018-10-03', 5.5]],
columns=['date','growth %'])
>>> df1
date growth %
0 2018-10-01 3.1
1 2018-10-03 5.5
>>> df2 = pd.DataFrame([['2018-10-01', 100],['2018-10-02', 200]],
columns=['date','items'])
>>> df2
date items
0 2018-10-01 100
1 2018-10-02 200
如果我们直接将它们连接起来,Pandas 将合并具有相同索引值的记录,从而导致两者都有两dates
列并且记录在时间轴上没有正确对齐:
>>> pd.concat([df1, df2], axis=1)
date growth % date items
0 2018-10-01 3.1 2018-10-01 100
1 2018-10-03 5.5 2018-10-02 200
这不是你想要的。
第一步是将date
每个数据框的列转换为日期时间对象并将其设置为索引:
>>> df1['date'] = pd.to_datetime(df1['date'])
>>> df1 = df1.set_index('date')
>>> df1
growth %
date
2018-10-01 3.1
2018-10-03 5.5
>>> df2['date'] = pd.to_datetime(df2['date'])
>>> df2 = df2.set_index('date')
>>> df2
items
date
2018-10-01 100
2018-10-02 200
连接现在按预期工作:
>>> pd.concat([df1, df2], axis=1)
growth % items
date
2018-10-01 3.1 100.0
2018-10-02 NaN 200.0
2018-10-03 5.5 NaN
您实际上并不需要将日期列转换为日期时间。它也适用于字符串:
>>> df1 = pd.DataFrame(...)
>>> df2 = pd.DataFrame(...)
>>> pd.concat([df1.set_index('date'), df2.set_index('date')], axis=1)
growth % items
2018-10-01 3.1 100.0
2018-10-02 NaN 200.0
2018-10-03 5.5 NaN
所需要的只是每个数据帧都按日期索引。日期时间索引虽然允许对时间序列进行切片和重采样。
推荐阅读
- java - 从 Java API 客户端访问 HBase
- visual-studio - Visual Studio Extensions 有什么权限?
- c# - 如何使 DataGrid 行中的 ComboBox 唯一?
- angular - 从附加到的指令中访问 FormGroup 实例
- sql - 在日期格式不同的日期连接两个表 DB2
- python - cv2.resize 与 Python :插值方法到底是什么?
- syntax - 如何部分初始化 ArrayVec?
- amazon-web-services - 如何在 AWS Step Function 中从 Choice 状态传递不同的输出?
- javascript - 匹配地址的正则表达式的一部分
- regex - 使用 Where 对象匹配数组列表