首页 > 解决方案 > 提取href链接

问题描述

我编写了一个 python 代码,通过传递 url 来提取没有 https 链接的 href 值。

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://kteq.in/index")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    if link.get('href')==None:
       continue
    result = re.sub(r"http\S+", "", link.get('href'))
    print result 

当我运行上面的代码时,我提取了除 https 链接之外的所有 href 链接。我得到了以下输出。

index
index
#
solutions#internet-of-things
solutions#online-billing-and-payment-solutions
solutions#customer-relationship-management
solutions#enterprise-mobility
solutions#enterprise-content-management
solutions#artificial-intelligence
solutions#b2b-and-b2c-web-portals
solutions#robotics
solutions#augement-reality-virtual-reality
solutions#azure
solutions#omnichannel-commerce
solutions#document-management
solutions#enterprise-extranets-and-intranets
solutions#business-intelligence
solutions#enterprise-resource-planning
services
clients
contact
#
#
#
#myCarousel
#myCarousel

#
#
#
#
#
#
#
#
#
#
#step1
#step2
AndroidAppDevelopment
contact
solutions
contact

index
services
#
contact
#
iOSDevelopmentServices
AndroidAppDevelopment
WindowsAppDevelopment
HybridSoftwareSolutions
CloudServices
HTML5Development
iPadAppDevelopment
services
services
services
services
services
services
contact
contact
contact
contact
contact

#
#
#
#

现在,我需要提取上面这些输出链接中存在的 href 链接。例如,我需要从上面的输出中提取“索引”内的链接。请建议我获取输出。

标签: pythonpython-2.7

解决方案


推荐阅读