首页 > 解决方案 > 使用斜纹从登录页面 Python 抓取 .txt

问题描述

我正在使用 Twill 检索包含所需 .txt 数据的页面,以便将它们存储为 Excel 文件。数据受密码保护,所以我从/user/login页面登录。

我的代码遇到了一个问题,它试图从登录屏幕访问文本页面并碰到 HTML 的砖墙,而不是 .txt 本身。

当我运行登录时:

path = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/"
end = "td.txt"

go("http://www.naturalgasintel.com/user/login")
showforms()
fv("2", "user[email]", user_email)
fv("2", "user[password]", user_password)
fv("2", "commit", "Login")

datafilelocation = path + year + "/" + month + "/" + date + end
go(datafilelocation)

当我的代码到达时,go(datafilelocation)我得到了这个:

==> at https://www.naturalgasintel.com/user/login?referer=%2Fext%2Fresources%2FData-Feed%2FDaily-GPI%2F2018%2F12%2F20181221td.txt
Out[18]: u'https://www.naturalgasintel.com/user/login?referer=%2Fext%2Fresources%2FData-Feed%2FDaily-GPI%2F2018%2F12%2F20181221td.txt'

referer所以当我真的想进入页面时,它指向而不是实际的文本:

https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt

我使用fv("2", "commit", "Login")而不是的原因submit()是,当我到达页面时,它给了我这个:

showforms()

Form name=quick-search (#1)
## ## __Name__________________ __Type___ __ID________ __Value__________________
1     q                        text      q            Search 


Form #2
## ## __Name__________________ __Type___ __ID________ __Value__________________
1     utf8                     hidden    (None)       ✓ 
2     authenticity_token       hidden    (None)       pnFnPGhMomX2Lyh7/U8iGOZKsiQnyicj7BWT ... 
3     referer                  hidden    (None)       https://www.naturalgasintel.com/ext/ ... 
4     popup                    hidden    (None)       false 
5     user[email]              text      user_email    
6     user[password]           password  user_pas ... 
7     user[remember_me]        hidden    (None)       0 
8     user[remember_me]        checkbox  user_rem ... None 
9     commit                   submit    (None)       Login 

然后它在我之后告诉我submit()

Note: submit is using submit button: name="commit", value="Login"

解决此问题的最佳解决方案是什么?

标签: pythonpython-2.7logintwill

解决方案


如果您可以使用 Mechanize 而不是 Twill,请试一试:

import mechanize

username = ""
password = ""
login_post_url = "http://www.naturalgasintel.com/user/login"
internal_url = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt"

browser = mechanize.Browser()
browser.open(login_post_url)
browser.select_form(nr = 1)
browser.form['user[email]'] = username
browser.form['user[password]'] = password
browser.submit()

response = browser.open(internal_url)
print response.read()


推荐阅读