python - 从我的网站页面获取文本:Python 脚本
问题描述
我有一个曾经可以工作的脚本——我把它拿出来使用已经一年了。问题是我现在遇到了一个错误,我不知道如何解决它。我还想要一种改进此代码的方法,以便我不再需要列出所有网页,而只需列出域下的所有内容。
我之前曾尝试安装 Beautiful Soup,但由于某种原因,这对我不起作用。我安装了它,但无法让 Spyder/Anaconda 重新识别该库的存在。
这是我得到的错误:
runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')
Traceback (most recent call last):
File "<ipython-input-13-2f567a94e1f6>", line 1, in <module>
runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')
File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "F:/CRM/CRM/translations/Python script for text from website pages.py", line 30, in <module>
file.write(text)
File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\encodings\cp1255.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncode
import urllib.request
from inscriptis import get_text
sitelist = ["https://grapaes.com",
"https://grapaes.com/events/past-events",
"https://grapaes.com/about-us-our-story",
"https://grapaes.com/about-us-our-story/story",
"https://grapaes.com/worldwide",
"https://grapaes.com/varieties/arra-branding",
"https://grapaes.com/press",
"https://grapaes.com/press/media",
"https://grapaes.com/press/newsletters",
"https://grapaes.com/about-us-our-story/team",
"https://grapaes.com/varieties",
"https://grapaes.com/events",
"https://grapaes.com/varieties/varieties-red-varieties",
"https://grapaes.com/varieties/varieties-black-varieties",
"https://grapaes.com/varieties/varieties-white-varieties",
"https://grapaes.com/partners",
]
i=0
n=0
length = len(sitelist)
for i in sitelist:
url = i
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
name = i.replace("/",".")
name1 = name.replace("https:..grapaes.com.", "site - ")
file=open(name1 + ".doc","w")
file.write(text)
file.close()
n = n + 1
解决方案
推荐阅读
- ios - 为什么频繁切换CALayer的隐藏属性会导致图形极度退化?
- database - 单个实体的所有者和实体组的所有者。约束还是关系?
- typescript - 你能指定一个属性必须在接口中保存某个值吗?
- php - 我正在使用 C 语言的 libcurl 库编写一个 REST API 来登录我的网页
- r - shinyApp(ui,服务器)中的错误:shinyApp 中缺少“服务器”
- c# - 在 WinForms 中使用 EasyTabs 的带有 Chrome 样式标签的应用程序
- dart - 如何使用动态日期使用 Dart 的差异方法和 DateTime?
- ios - 如何在 iOS 中测试您的延迟深层链接
- php - “方法 [validateEach] 不存在” Laravel 4.2
- java - 如何使用许多 OR 动态构建 booleanbuilder 条件