首页 > 解决方案 > 为什么我修复格式错误的 URL 的函数会返回格式错误的 URL?

问题描述

我有 2 个功能:

def getDomainFromLink(domain):
    if domain.startswith("https://") or domain.startswith("http://"):
        return domain.split("/")[3]
    if domain.startswith("//"):
        return domain.split("/")[2]

def fixLink(Link,LinkOriginalPage):
    '''Fixes link. ex. /f/d -> https://www.wtds.com/f/d
    LinkOriginalPage=page Link redirected from'''
    if Link.startswith("https://") or Link.startswith("http://"):
        return Link # , and exit
        #fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/
    if Link.startswith("//"):
        Link="https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/
        # print(Link)
        return Link # due to glitch
    # now link does not start with //
    # check if link is like a/b/c->site.com/a/b/c
    asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    linkStartsWithValidProtocol=not Link.startswith("http://") or Link.startswith("https://")
    linkDoesNotStartWithSlash=Link[0] in asciiLetters
    if linkStartsWithValidProtocol and linkDoesNotStartWithSlash:
        if LinkOriginalPage.endswith("/"):
            Link=LinkOriginalPage+Link
        else:
            Link=LinkOriginalPage+"/"+Link
        return Link
    # now link does not start with ascii letter
    # check if link is like /a/b/c
    if Link.startswith("/"):
        domainOfLink=getDomainFromLink(LinkOriginalPage)
        # print(domainOfLink)
        Link="http://"+domainOfLink+Link
        # print("startswith / "+Link)
        return Link # due to glitch
    # fix div links (widely used bad code practice)
    if Link.startswith("#"):
        #glitch, invalud url like *&YT -> invalud url schema
        #fix div
        domainOfLink=getDomainFromLink(LinkOriginalPage)
        Link=domainOfLink+Link
        return Link
    # return the output if not returned (nvm)
    return Link

它尝试使用 href 标记来自的站点和 a[href] 标记名称来修复 aa[href] 标记中的链接。问题是有时它会返回“https:///wiki/www”之类的链接。有谁知道为什么会这样?

标签: pythonhttppython-requests

解决方案


推荐阅读