首页 > 解决方案 > 从我的网站页面获取文本:Python 脚本

问题描述

我有一个曾经可以工作的脚本——我把它拿出来使用已经一年了。问题是我现在遇到了一个错误,我不知道如何解决它。我还想要一种改进此代码的方法,以便我不再需要列出所有网页,而只需列出域下的所有内容。

我之前曾尝试安装 Beautiful Soup,但由于某种原因,这对我不起作用。我安装了它,但无法让 Spyder/Anaconda 重新识别该库的存在。

这是我得到的错误:

runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')
Traceback (most recent call last):

  File "<ipython-input-13-2f567a94e1f6>", line 1, in <module>
    runfile('F:/CRM/CRM/translations/Python script for text from website pages.py', wdir='F:/CRM/CRM/translations')

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "F:/CRM/CRM/translations/Python script for text from website pages.py", line 30, in <module>
    file.write(text)

  File "C:\Users\Gittel\AppData\Local\Continuum\anaconda3\lib\encodings\cp1255.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncode
import urllib.request
from inscriptis import get_text
sitelist = ["https://grapaes.com",
"https://grapaes.com/events/past-events",
"https://grapaes.com/about-us-our-story",
"https://grapaes.com/about-us-our-story/story",
"https://grapaes.com/worldwide",
"https://grapaes.com/varieties/arra-branding",
"https://grapaes.com/press",
"https://grapaes.com/press/media",
"https://grapaes.com/press/newsletters",
"https://grapaes.com/about-us-our-story/team",
"https://grapaes.com/varieties",
"https://grapaes.com/events",
"https://grapaes.com/varieties/varieties-red-varieties",
"https://grapaes.com/varieties/varieties-black-varieties",
"https://grapaes.com/varieties/varieties-white-varieties",
"https://grapaes.com/partners",
]
i=0
n=0
length = len(sitelist)
for i in sitelist:
        url = i
        html = urllib.request.urlopen(url).read().decode('utf-8')
        text = get_text(html)
        name = i.replace("/",".")
        name1 = name.replace("https:..grapaes.com.", "site - ")
        file=open(name1 + ".doc","w")
        file.write(text)
        file.close()
        n = n + 1

标签: pythonweb

解决方案


推荐阅读