python - 使用python解码url内容
问题描述
有一个网站包含俄语的希腊语和希伯来语圣经翻译。这里用于获取 url 内容的 python2 代码:
# -*- coding: utf-8 -*-
import io, urllib
f1 = io.open('url.txt','w',encoding='utf8')
#link = "http://manuscript-bible.ru/OT/Gen1.htm"
link = "http://manuscript-bible.ru/S/H/psa117.htm"
f2 = urllib.urlopen(link)
myfile = f2.read().decode("utf-8")
f1.write(myfile)
f1.close()
因此,对于使用希伯来语和俄语的用户f2.read().decode("utf-8")
,可以从http://manuscript-bible.ru/S/H/psa117.htm获取网址,例如:
>1</strong>
<a target="_blank" href="../S/h19.htm#1984" title="hалэлу">הַֽלְלוּ</a>
<a target="_blank" href="../S/h08.htm#853" title="ʼэт">אֶת</a>
对于俄语,也可以 myfile = f2.read().decode('cp1251')
从http://manuscript-bible.ru/RSV/22_116.htm获取内容
的一句话.htm
<b>1</b> Хвалите
<01984> Господа
<03068>, все народы
<01471>, прославляйте
<07623> Его, все племена
<0523>;
问题出在这个无法解码的网址http://manuscript-bible.ru/OT/Ps116.htm.htm
上:通过从源头检查,似乎有两个 Gen1.htm
具有相同的扩展名.htm
,其中一个包含一些东西喜欢:
anot=0;bn=22;cn=116;variants="";cr=new Array();parsing="";a=new Array("","","","","","","","
<a name=116>","","","","
<br>","","","","
<br>","CALMOI","calmoi","6163","\u041F\u0421\u0410\u041B\u041C\u042B","\u0413\u043B\u0430\u0432\u0430 116","","","","","","","
<br>","","","","
<br>","","","","","1","","","","Allhlouia.","allhlouia","6C61","\u0410\u043B\u043B\u0438\u043B\u0443\u0439\u044F.","A\u042Ene\u042Dte","a\u042Ene\u042Dte","6961","\u0425\u0432\u0430\u043B\u0438́\u0442\u0435","t\u0442n","t\u0441n","6F74","-","k\u0436rion,","k\u0436rion","756B","\u0413\u043É\u0441\u043F\u043E\u0434\u0430,"
这似乎是utf-16
编码,因为如果插入https://www.branah.com/unicode-converter
给出
anot=0;bn=22;cn=116;variants="";cr=new Array();parsing="";a=new
Array("","","","","","","","
<a name=116>","","","","
<br>","","","","
<br>","CALMOI","calmoi","6163","ПСАЛМЫ","Глава 116","","","","","","","
<br>","","","","
<br>","","","","","1","","","","Allhlouia.","allhlouia","6C61","Аллилуйя.","AЮneЭte","aЮneЭte","6961","Хвали́те","tтn","tсn","6F74","-","kжrion,","kжrion","756B","Го́спода,"
问题是,如何使用上面的类似 python 的代码来获取提到的 url http://manuscript-bible.ru/OT/Ps116.htm的解码内容?
解决方案
推荐阅读
- content-management-system - 用于 Angular 项目的 BuilderIO 视觉 CMS 的替代品
- heroku - 我在heroku中的哪里输入--fix选项?
- wso2 - 在 WSO2 APIM 3.2.0 中添加全局中介扩展
- python - 在 discord.py 中将“client.load_extension('name here')”彼此分开
- r - 使用现有数据中的值创建新数据框
- php - 尝试 SQL 注入绕过登录
- python - CSV 未使用分隔符打开;但使用 genfromtxt 以分隔符开头
- python - Python中可能的简单异步/等待示例
- xcode - 应用程序运行时 Xcode 调试器不开始会话
- git - 如何只添加当前目录中的所有文件?