python-3.7 - 需要帮助将网页处理为文件,然后再将文件与网页进行比较以确定是否发生了更改
问题描述
目标:在 Python 3.7 中,我想将网页复制到文件中,然后定期将该文件(复制的网页)与实际网页进行比较,以查看是否有任何更改。
我创建网页副本 (SEC_old.txt) 的代码有效。当我通过 notebook++ 打开这个文件时,它会显示一个格式完美的 HTML 网页。Notepad++“编码”点击将文件列为“UTF-8 编码”。下面是我的代码:
# CopySEC.py
import urllib.request
import pickle
## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'
response = urllib.request.urlopen(url)
webPageCopy = response.read()
## Initiate the output file and write the contents of
## webPageCopy variable to the output file.
SEC_copy_bytes = open("SEC_old.txt","wb")
SEC_copy_bytes.write(webPageCopy)
接下来是我的简单程序 Compare_SEC,它(1)将网页复制到一个变量(如上面的示例),(2)打开并将 Sec_old.txt(网页副本)读入另一个变量,(3)比较两者以确定是否发生了任何变化。这个程序似乎不起作用。问题:程序不会将两个变量评估为相等。他们应该是。另外,我可以打印 webPageCopy 变量(使用 Print()),但是当我尝试对复制变量(即 print(SEC_copy) )执行相同操作时,我收到错误:<_io.BufferedReader name='SEC_old.txt' >并且内容不打印。
这是比较程序的代码:
# Compare_SEC.py
import urllib.request
import pickle
## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'
#Place SEC website in variable, webPageCopy and print it to console
response = urllib.request.urlopen(url)
webPageCopy = response.read()
print(webPageCopy)
#open file and write contents (old web page) to variable, SEC-copy and print SEC-copy to console
SEC_copy = open("SEC_old.txt","rb")
print(SEC_copy)
#compare variables containing the old copy of the webpage (SEC-copy) to the current web page (webPageCopy) for differneces
if (webPageCopy) != SEC_copy:
mesg="SEC website not equal to old copy! New SEC filings!!!!!!!"
else:
mesg="SEC website is equal to old copy. No new SEC filings"
print(mesg)
这是结果的输出(减去网页的打印):
<_io.BufferedReader name='SEC_old.txt'>
SEC website not equal to old copy! New SEC filings!!!!!!!
解决问题的任何帮助将不胜感激。再一次,这里的问题是:程序不会评估这两个变量,SEC_copy 和 webPageCopy,相等。他们应该是。另外,我可以打印 webPageCopy 变量(使用 Print()),但是当我尝试对复制变量(即 print(SEC_copy) )执行相同操作时,我收到错误:<_io.BufferedReader name='SEC_old.txt' >并且内容不打印。
提前致谢。我希望我清楚地说明了这个问题。
为了进一步探索解决这个问题,我创建了一个程序来简单地阅读
file into two different variables and compare the variables. There were not equivalent! Why?!?!?!?!? Here's the code:
# Readfile.py
import urllib.request
import pickle
SEC_copy = open("SEC_old3.txt","rb")
SEC_copy2 = open("SEC_old3.txt","rb")
if SEC_copy != SEC_copy2:
print("files are not equivalent")
else:
print("files are equal")
这是输出:
RESTART: C:/Users/Office/AppData/Local/Programs/Python/Python37-32/readfile.py
files are not equivalent
那么,为什么两个变量应该具有相同的内容时却不相等呢?
解决方案
好的。我找到了解决方法。我重写了我的程序以将新网页写入文件(就像副本一样)。然后我用另一种方法来阅读它们。有效。当然,我必须将当前网页写入文件并再次读取(这是多余的......但我无法将文件与当前网页内容进行比较,除非我将其写入文件并再次读取)。这是代码:
# SEC_CHK.py
import urllib.request
import pickle
import datetime
import calendar
now = (datetime.datetime.now())
activity_msg="Default Message"
programdesc="SEC Check"
programname="SEC_CHK.py"
## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'
response = urllib.request.urlopen(url)
webPageCopy = response.read()
## write the new web page contents to the output file, SEC_new.txt.
SEC_new = open("SEC_new.txt","wb")
SEC_new.write(webPageCopy)
SEC_new.close
#open new web page file and read contents to variable, SEC_new.
f = open("SEC_new.txt","r")
SEC_new = f.read()
##print(SEC_new)
f.close
#open old web page file and read contents to variable, SEC-old.
g = open("SEC_old.txt","r")
SEC_old = g.read()
##print(SEC_old)
g.close
#compare variables containing the old copy of the webpage (SEC-old) to the current web page (SEC_new) for differneces
logfile = open("prlog.txt","a")
if (SEC_new) != SEC_old:
activity_msg="!!!!!!!!!New NWBO SEC Filing(s) Found!!!!!!!"
SEC_old = open("SEC_old.txt","wb")
SEC_old.write(webPageCopy)
SEC_old.close
logfile.write(todaysdate + " @ " + "SEC CHECK " + timestamp + ": " + "Old SEC file updated." "\n")
else:
activity_msg="SEC website is equal to old copy. No new SEC filings"
print(activity_msg)
logfile = open("prlog.txt","a")
logfile.write(programdesc + "(" + programname + "): " + activity_msg + "\n")
logfile.close()
推荐阅读
- sql-server - 无法使用 SQL Server 连接到 Azure VM
- c - 在非结构或联合的情况下请求成员“nama”
- flowtype - 获取带流的函数组合的输出类型
- vba - 由于升级到 Access 2013 OutputTo 命令(到 PDF)不起作用
- regex - 正则表达式不会用边界条件替换完全匹配
- reactjs - componentWillRecieveProps 与 getDerivedStateFromProps
- sql-server-2012 - 在多个数据库中搜索特定列名的表
- oracle - 我在哪里放置 JDBC 驱动程序文件?
- javascript - 在区域设置存储中编辑值?
- python - sympy 分段函数的 Lambdification 评估每个表达式