首页 > 解决方案 > 从python中的网页获取文本

问题描述

我正在尝试从网页获取网址。

我尝试使用 wget 和 urllib 和 lynx(返回最有条理的结果),但棘手的部分是 url 作为文本写在网页上,如果它们很长,那么 url 的其余部分将被点缀( 3 个点)(例如,exampppppppppppppple.com 将被写为 exampleppp...)为了查看它,您必须单击条目的 id,这将打开一个新窗口,并且在该窗口中,url 将被完整写入也作为文本。我设法获取了网址,但我不知道如何进入另一个页面并获取文本“url”,如果它被点缀,我不确定是否wget -r适用于我的情况(因为 url 是文本)。

这是我写的

import os

def get_urls():
     os.system("lynx -dump https://www.example.com/ 
     | grep -v https://ww.example.com/* | grep https* | grep http* | cut -f5- -d' '> 
      urls.txt")

输出

http://www.another-example... 
https://example1.com
https://www.example.com

标签: pythonurlwgetlynx

解决方案


2020 年中的更新

如果我正确理解了这个任务,那就是获取嵌入在网页中的 url 列表,这些 url 与网页本身的基本 url 不同。因此,如果页面是https://example.com,则列出所有非'example.com/..' url。

使用外部 Lynx 程序

调用 Lynx,Python 3.5 及更高版本
# Since Python 3.5 
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.run(
        ["lynx", "-listonly", "-dump", siteurl],
        capture_output=True,
        encoding='utf-8',
        timeout=3,
    )
    result.check_returncode()
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.stdout.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)
调用 Lynx,Pre-Python 3.5
# Pre-Python 3.5
import subprocess

site = "stackoverflow.com"
siteurl = "https://" + site

# set encoding so that result is in strings rather than bytes
# set timeout for the case of a non-existant url
try:
    result = subprocess.check_output(
        ["lynx", "-listonly", "-dump", "https://stackoverflow.com"],
        stderr=subprocess.PIPE,
        encoding='utf-8',
        timeout=2
    )
except subprocess.TimeoutExpired as err:
    print("[Error] ", err)
    exit(err.timeout)
except subprocess.CalledProcessError as err:
    print("[Error] ", err.stderr)
    exit(err.returncode)
except Exception as err:
    print(err)
    exit(err.errno)

resultlist = result.splitlines()

for item in resultlist:
    item = item.strip()

    urlindicator = "://"
    if item.find(urlindicator) > 0:
        # example split line: ["1.", "https://example.com"]
        item_url = item.split()[1] 
        if item_url.find(site) == -1:
            print(item_url)

这些示例的输出应如下所示:

head list
https://stackexchange.com/sites
https://stackoverflow.blog/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://stackoverflowbusiness.com/
...

推荐阅读