首页 > 解决方案 > 执行一些步骤后,无法从网页中获取动态填充的数字

问题描述

我使用 requests 模块和 BeautifulSoup 库创建了一个脚本来从网页中获取一些表格内容。要生成表格,必须手动执行我在附图中显示的步骤。我在下面粘贴的代码是一个有效的代码,但我试图解决的主要问题是以title编程方式获取数字,在这种情况下628086906,它附加到table_link我在这里硬编码的数字上。

单击工具按钮后 - 在步骤 6 中 - 当您将光标悬停在地图上时,您可以看到此选项Multiple,当您单击该选项时,您会看到包含标题编号的 url。

首页

这正是脚本所遵循的步骤。

这是0030278592在步骤 6 中输入框中输入的 linc 编号。

我已经尝试过(工作一个,因为我在其中使用了硬编码的标题号table_link):

import requests
from bs4 import BeautifulSoup

link = 'https://alta.registries.gov.ab.ca/spinii/logon.aspx'
lnotice = 'https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx'
search_page = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
map_page = 'http://alta.registries.gov.ab.ca/SpinII/mapindex.aspx'
map_find = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
table_link = 'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title=628086906'

def get_content(s,link):   
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['uctrlLogon:cmdLogonGuest.x'] = '80'
    payload['uctrlLogon:cmdLogonGuest.y'] = '20'

    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['cmdYES.x'] = '52'
    payload['cmdYES.y'] = '8'

    s.post(lnotice,data=payload)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/spinii/welcomeguest.aspx'
    
    s.get(search_page)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
    
    s.get(map_page)
    
    r = s.get(map_find)
    s.headers['Referer'] = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'Finds$lstFindTypes'
    payload['Finds:lstFindTypes'] = 'Linc'
    payload['Finds:ctlLincNumber:txtLincNumber'] = '0030278592'
    
    r = s.post(map_find,data=payload)
    
    r = s.get(table_link)
    print(r.text)


if __name__ == "__main__":
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
        get_content(s,link)

如何从 url 中获取标题号?

或者

如何从该站点获取所有 linc 号码,以便我根本不需要使用地图?

The only problem with this site is that it is unavailable in daytime for maintenance.

标签: pythonpython-3.xweb-scrapingpython-requests

解决方案


数据从以下位置调用:

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

内容在被OpenLayers 库使用之前以自定义格式编码。所有的解码都位于这个 JS 文件中。如果你美化它,你可以寻找它的WayTo.Wtb.Format.WTB解码OpenLayers.Class。二进制文件在 JS 中逐字节解码,如下所示:

switch(elementType){
    case 1:
        var lineColor = new WayTo.Wtb.Element.LineColor();
        byteOffset = lineColor.parse(dataReader, byteOffset);
        outputElement = lineColor;
        break;
    case 2:
        var lineStyle = new WayTo.Wtb.Element.LineStyle();
        byteOffset = lineStyle.parse(dataReader, byteOffset);
        outputElement = lineStyle;
        break;
    case 3:
        var ellipse = new WayTo.Wtb.Element.Ellipse();
        byteOffset = ellipse.parse(dataReader, byteOffset);
        outputElement = ellipse;
        break;
    ........
}

我们必须重现这个解码算法才能得到原始数据。我们不需要解码所有的对象,我们只想得到正确的偏移量并strings正确提取。这是解码部分的的输出):

with open("wtb.bin", mode='rb') as file:
    encodedData = file.read()
    offset = 0
    objects = []

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        print(f"type {curElemType} | size {curElemSize}")
        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            objects.append({
                "type": "Text",
                "x_position": int.from_bytes(encodedData[offset:offset+2], "little"),
                "y_position": int.from_bytes(encodedData[offset+2:offset+4], "little"),
                "rotation": int.from_bytes(encodedData[offset+4:offset+6], "little"),
                "text": encodedData[offset+6:offset+6+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            highShort = int.from_bytes(encodedData[offset+2:offset+4], "little")
            lowShort = int.from_bytes(encodedData[offset+4:offset+6], "little")
            objects.append({
                "type": "StartNumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence": (highShort << 16) + lowShort
            })
            offset+=6
        elif curElemType == 105:
            #end cell
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            objects.append({
                "type": "StartAlphanumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence":encodedData[offset+2:offset+2+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            objects.append({
                "type": "CoordinatePlane",
                "projection_code": encodedData[offset+48:offset+52].decode("utf-8").replace('\x00','')
            })
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little")
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2
        print(f"offset diff {offset-offsetInit}")
        print("--------------------------------")

    print(objects)
    print(len(encodedData))
    print(offset)

(旁注:注意元素大小是大端,所有其他值都是小端)

运行此 repl.it以查看它如何解码文件

从那里我们构建了抓取数据的步骤,为了清楚起见,我将描述所有步骤(甚至是您已经弄清楚的那些步骤):

登录

使用以下方式登录网站:

GET https://alta.registries.gov.ab.ca/spinii/logon.aspx

抓取输入名称/值并添加uctrlLogon:cmdLogonGuest.x然后uctrlLogon:cmdLogonGuest.y调用

POST https://alta.registries.gov.ab.ca/spinii/logon.aspx

法律声明

法律通知电话不是获取地图值所必需的,而是获取项目信息所必需的(您帖子中的最后一步)

GET https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

刮掉input标签名称/值并设置cmdYES.x然后cmdYES.y调用

POST https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

地图数据

调用服务器地图 API:

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

使用以下数据:

{
    "mt":"titleresults",
    "qt":"lincNo",
    "LINCNumber": lincNumber,
    "rights": "B", #not required
    "cx": 1920, #screen definition
    "cy": 1080,
}

cx/xy是画布大小

使用上述方法对编码数据进行解码。你会得到 :

[{'type': 'LargePolygon', 'name': '0010495134 8722524;1;162', 'entity': 23, 'occurence': 628079167, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170859 8022146;8;99', 'entity': 23, 'occurence': 628048595, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010691822 8722524;1;163', 'entity': 23, 'occurence': 628222354, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169736 8022146;8;89', 'entity': 23, 'occurence': 628021327, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694454 8722524;1;179', 'entity': 23, 'occurence': 628191678, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694362 8722524;1;178', 'entity': 23, 'occurence': 628307403, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010433381 8722524;1;177', 'entity': 23, 'occurence': 628209696, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169710 8022146;8;88A', 'entity': 23, 'occurence': 628021328, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694355 8722524;1;176', 'entity': 23, 'occurence': 628315826, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170866 8022146;8;100', 'entity': 23, 'occurence': 628163431, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694347 8722524;1;175', 'entity': 23, 'occurence': 628132810, 'line_color_green': 0, 'line_color_red': 129, 

提取信息

如果您想针对特定对象,lincNumber您将需要查找多边形的样式,因为对于“多个”值(例如具有多个项目的值),没有提及lincNumber响应的 id,只是一个链接参考。以下将获得选定的项目:

selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

调用您在帖子中提到的网址以获取数据并提取表格:

GET https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}

完整代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

lincNumber = "0030278592"
#lincNumber = "0010661156"

s = requests.Session()

# 1) login
r = s.get("https://alta.registries.gov.ab.ca/spinii/logon.aspx")
soup = BeautifulSoup(r.text, "html.parser")

payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["uctrlLogon:cmdLogonGuest.x"] = 76
payload["uctrlLogon:cmdLogonGuest.y"] = 25
s.post("https://alta.registries.gov.ab.ca/spinii/logon.aspx",data=payload)

# 2) legal notice
r = s.get("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx")
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["cmdYES.x"] = 82
payload["cmdYES.y"] = 3
s.post("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx", data = payload)

# 3) map data
r = s.post("http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx",
    data= {
        "mt":"titleresults",
        "qt":"lincNo",
        "LINCNumber": lincNumber,
        "rights": "B", #not required
        "cx": 1920, #screen definition
        "cy": 1080,
    })

def decodeWtb(encodedData):
    offset = 0

    objects = []
    iteration = 0

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            offset+=6
        elif curElemType == 105:
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little"),
                "line_color_green": encodedData[offset + 8],
                "line_color_red": encodedData[offset + 7],
                "line_color_blue": encodedData[offset + 9],
                "fill_color_green": encodedData[offset + 10],
                "fill_color_red": encodedData[offset + 11],
                "fill_color_blue": encodedData[offset + 13]
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2

    return objects

# 4) decode custom format
objects = decodeWtb(r.content)

# 5) get the selected area
selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

# 6) get the info about item
r = s.get(f'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}')
df = pd.read_html(r.content, attrs = {'class': 'bodyText'}, header =0)[0]
del df['Add to Cart']
del df['View']
print(df[:-1])

在 repl.it 上运行它

输出

  Title Number           Type LINC Number Short Legal   Rights Registration Date Change/Cancel Date
0    052400228  Current Title  0030278592  0420091;16  Surface        19/09/2005         13/11/2019
1    072294084  Current Title  0030278551  0420091;12  Surface        22/05/2007         21/08/2007
2    072400529  Current Title  0030278469   0420091;3  Surface        05/07/2007         28/08/2007
3    072498228  Current Title  0030278501   0420091;7  Surface        18/08/2007         08/02/2008
4    072508699  Current Title  0030278535  0420091;10  Surface        23/08/2007         13/12/2007
5    072559500  Current Title  0030278477   0420091;4  Surface        17/09/2007         19/11/2007
6    072559508  Current Title  0030278576  0420091;14  Surface        17/09/2007         09/01/2009
7    072559521  Current Title  0030278519   0420091;8  Surface        17/09/2007         07/11/2007
8    072559530  Current Title  0030278493   0420091;6  Surface        17/09/2007         25/08/2008
9    072559605  Current Title  0030278485   0420091;5  Surface        17/09/2007         23/12/2008

objects如果您想获得更多条目,可以查看该字段。如果您想获得有关坐标等项目的更多信息,您可以改进解码器......

也可以通过查看name包含 lincNumber 的字段来匹配位于目标周围的其他 lincNumber,除非其中有“多个”名称。

有趣的事实 :

此流程中无需设置 http 标头


推荐阅读