首页 > 解决方案 > BeautifulSoup 的奇怪之处?

问题描述

我正在尝试使用 BeautifulSoup 来解析如下网页

     <div id="block-info-advan" style="width: 760px;">
        <div class="content-0">
            <div class="left-col">
                <div class="number">
                    1</div>
                <div class="nq">
                    <p class="nqTitle" lawid="435943">
                        <a>...</a>
                    </p>
                </div>
            </div>
        </div>
        <div class="content-1">
            <div class="left-col">
                <div class="number">
                    2</div>
                <div class="nq">
                    <p class="nqTitle" lawid="435632">
                        <a <...</a>
                    </p>
                </div>
            </div>
        </div>
    </div>

如果我使用:

test = soup.select(".nqTitle")
result: 2

但是为什么如果我使用:

test = soup.select("body .nqTitle")
result: 1

或者

test = soup.select("body")
test2 = test[0].select(".nqTitle")
result: 1

在第二个代码中,我希望结果为 2。

谁能为我解释一下?

谢谢。

标签: pythonweb-scrapingbeautifulsoup

解决方案


我尝试了两种方法,都没有发生。也许是因为 HTML 中其他地方的数据。

html = '''
<body>
<div id="block-info-advan" style="width: 760px;">
        <div class="content-0">
            <div class="left-col">
                <div class="number">
                    1</div>
                <div class="nq">
                    <p class="nqTitle" lawid="435943">
                        <a>...</a>
                    </p>
                </div>
            </div>
        </div>
        <div class="content-1">
            <div class="left-col">
                <div class="number">
                    2</div>
                <div class="nq">
                    <p class="nqTitle" lawid="435632">
                        <a>...</a>
                    </p>
                </div>
            </div>
        </div>
</div></body>
'''

美丽汤:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
test = soup.select(".nqTitle")
print (test)
test = soup.select("body .nqTitle")
print (test)
test = soup.select("body")
test = test[0].select(".nqTitle")
print (test)

结果:

[<p class="nqTitle" lawid="435943">
<a>...</a>
</p>, <p class="nqTitle" lawid="435632">
<a>...</a>
</p>]
[<p class="nqTitle" lawid="435943">
<a>...</a>
</p>, <p class="nqTitle" lawid="435632">
<a>...</a>
</p>]

简化文档:

doc = SimplifiedDoc(html)
test = doc.selects('p.nqTitle')
print (test)
test = doc.selects('body>p.nqTitle')
print (test)
test = doc.select('body').selects('p.nqTitle')
print (test)

结果:

[{'class': 'nqTitle', 'lawid': '435943', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}, {'class': 'nqTitle', 'lawid': '435632', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}]
[{'class': 'nqTitle', 'lawid': '435943', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}, {'class': 'nqTitle', 'lawid': '435632', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}]
[{'class': 'nqTitle', 'lawid': '435943', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}, {'class': 'nqTitle', 'lawid': '435632', 'tag': 'p', 'html': '\n                        <a>...</a>\n                    '}]

推荐阅读