首页 > 解决方案 > 需要帮助将网页处理为文件,然后再将文件与网页进行比较以确定是否发生了更改

问题描述

目标:在 Python 3.7 中,我想将网页复制到文件中,然后定期将该文件(复制的网页)与实际网页进行比较,以查看是否有任何更改。

我创建网页副本 (SEC_old.txt) 的代码有效。当我通过 notebook++ 打开这个文件时,它会显示一个格式完美的 HTML 网页。Notepad++“编码”点击将文件列为“UTF-8 编码”。下面是我的代码:

# CopySEC.py
import urllib.request
import pickle
## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'
response = urllib.request.urlopen(url)
webPageCopy = response.read()
## Initiate the output file and write the contents of
## webPageCopy variable to the output file.
SEC_copy_bytes = open("SEC_old.txt","wb")
SEC_copy_bytes.write(webPageCopy)

接下来是我的简单程序 Compare_SEC,它(1)将网页复制到一个变量(如上面的示例),(2)打开并将 Sec_old.txt(网页副本)读入另一个变量,(3)比较两者以确定是否发生了任何变化。这个程序似乎不起作用。问题:程序不会将两个变量评估为相等。他们应该是。另外,我可以打印 webPageCopy 变量(使用 Print()),但是当我尝试对复制变量(即 print(SEC_copy) )执行相同操作时,我收到错误:<_io.BufferedReader name='SEC_old.txt' >并且内容不打印。

这是比较程序的代码:

# Compare_SEC.py
import urllib.request
import pickle
## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'

#Place SEC website in variable, webPageCopy and print it to console
response = urllib.request.urlopen(url)
webPageCopy = response.read()
print(webPageCopy)

#open file and write contents (old web page) to variable, SEC-copy and print SEC-copy to console
SEC_copy = open("SEC_old.txt","rb")
print(SEC_copy)

#compare variables containing the old copy of the webpage (SEC-copy) to the current web page (webPageCopy) for differneces
if (webPageCopy) != SEC_copy:
    mesg="SEC website not equal to old copy! New SEC filings!!!!!!!"
else:
    mesg="SEC website is equal to old copy.  No new SEC filings"
print(mesg)

这是结果的输出(减去网页的打印):

<_io.BufferedReader name='SEC_old.txt'>
SEC website not equal to old copy! New SEC filings!!!!!!!

解决问题的任何帮助将不胜感激。再一次,这里的问题是:程序不会评估这两个变量,SEC_copy 和 webPageCopy,相等。他们应该是。另外,我可以打印 webPageCopy 变量(使用 Print()),但是当我尝试对复制变量(即 print(SEC_copy) )执行相同操作时,我收到错误:<_io.BufferedReader name='SEC_old.txt' >并且内容不打印。

提前致谢。我希望我清楚地说明了这个问题。

为了进一步探索解决这个问题,我创建了一个程序来简单地阅读

file into two different variables and compare the variables. There were not equivalent! Why?!?!?!?!?  Here's the code:

# Readfile.py
import urllib.request
import pickle

SEC_copy = open("SEC_old3.txt","rb")
SEC_copy2 = open("SEC_old3.txt","rb")

if SEC_copy != SEC_copy2:
    print("files are not equivalent")
else:
    print("files are equal")

这是输出:

 RESTART: C:/Users/Office/AppData/Local/Programs/Python/Python37-32/readfile.py 
files are not equivalent

那么,为什么两个变量应该具有相同的内容时却不相等呢?

标签: python-3.7

解决方案


好的。我找到了解决方法。我重写了我的程序以将新网页写入文件(就像副本一样)。然后我用另一种方法来阅读它们。有效。当然,我必须将当前网页写入文件并再次读取(这是多余的......但我无法将文件与当前网页内容进行比较,除非我将其写入文件并再次读取)。这是代码:

# SEC_CHK.py
import urllib.request
import pickle
import datetime
import calendar
now = (datetime.datetime.now())
activity_msg="Default Message"
programdesc="SEC Check"
programname="SEC_CHK.py"

## Read web page contents into webPageCopy variable.
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001072379&owner=include&count=40'
response = urllib.request.urlopen(url)
webPageCopy = response.read()

## write the new web page contents to the output file, SEC_new.txt.
SEC_new = open("SEC_new.txt","wb")
SEC_new.write(webPageCopy)
SEC_new.close

#open new web page file and read contents to variable, SEC_new.
f = open("SEC_new.txt","r")
SEC_new = f.read()
##print(SEC_new)
f.close

#open old web page file and read contents to variable, SEC-old.
g = open("SEC_old.txt","r")
SEC_old = g.read()
##print(SEC_old)
g.close

#compare variables containing the old copy of the webpage (SEC-old) to the current web page (SEC_new) for differneces
logfile = open("prlog.txt","a")
if (SEC_new) != SEC_old:
   activity_msg="!!!!!!!!!New NWBO SEC Filing(s) Found!!!!!!!"
   SEC_old = open("SEC_old.txt","wb")
   SEC_old.write(webPageCopy)
   SEC_old.close
   logfile.write(todaysdate + " @ " + "SEC CHECK  " + timestamp + ":   " + "Old SEC file updated." "\n") 
else:
   activity_msg="SEC website is equal to old copy. No new SEC filings"
print(activity_msg)
logfile = open("prlog.txt","a")
logfile.write(programdesc + "(" + programname + "): " + activity_msg + "\n")
logfile.close()

推荐阅读