首页 > 解决方案 > 如何使用 Beautifulsoup4 跳过使用相同元素的抓取

问题描述

我想从网页上抓取视频,但在该页面中有两个iframe标签。一个用于显示 Facebook 页面,另一个用于嵌入视频。我只想从中获取视频网址。但是当我尝试抓取时,我得到了所有 iframe ..

像这样:

url_videos = requests.get(link_to_video)

video_link = BeautifulSoup(url_videos.text, 'html.parser')

video_on_iframe = video_link.find('iframe')

print(video_on_iframe)

当我尝试运行上面的代码时,我得到了这个结果:

<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="80" scrolling="no" src="https://www.facebook.com/plugins/page.php?href=https%3A%2F%2Fwww.facebook.com%2FAnimeindoFans%2F&amp;tabs&amp;width=280&amp;height=180&amp;small_header=true&amp;adapt_container_width=true&amp;hide_cover=true&amp;show_facepile=false&amp;appId=123434497681677" style="border:none;overflow:hidden" width="280"></iframe>
<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="80" scrolling="no" src="https://www.facebook.com/plugins/page.php?href=https%3A%2F%2Fwww.facebook.com%2FAnimeindoFans%2F&amp;tabs&amp;width=280&amp;height=180&amp;small_header=true&amp;adapt_container_width=true&amp;hide_cover=true&amp;show_facepile=false&amp;appId=123434497681677" style="border:none;overflow:hidden" width="280"></iframe>
<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="80" scrolling="no" src="https://www.facebook.com/plugins/page.php?href=https%3A%2F%2Fwww.facebook.com%2FAnimeindoFans%2F&amp;tabs&amp;width=280&amp;height=180&amp;small_header=true&amp;adapt_container_width=true&amp;hide_cover=true&amp;show_facepile=false&amp;appId=123434497681677" style="border:none;overflow:hidden" width="280"></iframe>
<iframe frameborder="0" height="380" scrolling="no" src="http://www.mp4upload.com/embed-q7xxgge1yu1c.html" type="text/html" width="640">
</iframe>
<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="80" scrolling="no" src="https://www.facebook.com/plugins/page.php?href=https%3A%2F%2Fwww.facebook.com%2FAnimeindoFans%2F&amp;tabs&amp;width=280&amp;height=180&amp;small_header=true&amp;adapt_container_width=true&amp;hide_cover=true&amp;show_facepile=false&amp;appId=123434497681677" style="border:none;overflow:hidden" width="280"></iframe>
<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="80" scrolling="no" src="https://www.facebook.com/plugins/page.php?href=https%3A%2F%2Fwww.facebook.com%2FAnimeindoFans%2F&amp;tabs&amp;width=280&amp;height=180&amp;small_header=true&amp;adapt_container_width=true&amp;hide_cover=true&amp;show_facepile=false&amp;appId=123434497681677" style="border:none;overflow:hidden" width="280"></iframe>

我不需要那个 Facebook iframe ,我只需要来自其他iframe的视频 URL和属性height="380"width="280"

当我尝试在find()方法中指定更多详细信息时,如下所示:

video_on_iframe = video_link.find('iframe', width=640, height=380)

我懂了:

None
None
None
<iframe frameborder="0" height="380" scrolling="no" src="http://www.mp4upload.com/embed-q7xxgge1yu1c.html" type="text/html" width="640">
</iframe>
None
None

一个iframe元素,其他没有。

所以..我的问题是如何找到所有iframe', width=640, height=380价值并跳过None其他价值..?

标签: htmlpython-3.xiframeweb-scrapingbeautifulsoup

解决方案


video_on_frame = video_link.find_all('iframe', height = '380')## This means I wanna scrape iframe who has height value 380 . You can also use widht. 
link_array = []
for link in video_on_frame:  ## Your html has 1 iframe in video_on_frame format.

        get_iframe_url = link['src'] ## find iframe's src 
           


        try:
            link_array.append(get_iframe_url) ## add src into a array

        except:
             link_array.append('Error')

print(link_array) 将显示您的网址您想要什么


推荐阅读