首页 > 解决方案 > 网页抓取表过滤结果

问题描述

我正在使用 Python 来 Web 抓取此处找到的数据表。具体来说,我想提取商家名称、网址、所有者姓名、街道、城市和电话。通过 Beautiful Soup 运行并拆分代码以过滤后显示为:

['\\\', \\\' href="?listingid=9758&profileid=217Y3Q544Y&action=uweb&url=http%3a%2f%2f www.jpspa.com " target="_BLANK"', " Johnson Price Sprinkle PA ', '/a"、"'、'/b"、"'、'/td"、"'、'/tr"、"'、'/table"、"'、'/td"、"'、'/ tr", '', 'tr class="GeneralBody"', '', 'td bgcolor="#808080" height="1"', '', 'img border="0" height="1" src= "images/dot_clear.gif" width="1"/', "', '/td", "', '/tr", "', '/table", "', '/td", "', '/tr", '', 'tr class="GeneralBody"', '', 'td align="left" valign="top" width="90%"', ' Maria Pilos ', "', '', ' 79 Woodfin Place, Suite 300 ", "', '', ' Asheville, NC 28801 ", "', '', '", 'b', "电话:', '/b", '** (828) 254-2374'"、'b'、"电话:'、'/b"、'** (828) 254-2374'"、'b'、"电话:'、'/b"、'** (828) 254-2374**', "', '', '", 'b', "传真:', '/b", " (828) 252-9994', '\', \'", '\\\', \\\' href="DirectoryEmailForm.aspx?listingid=9758"', "发送电子邮件', '/a", "', '/td", '', 'td align="right" rowspan="3" valign="top" width="10%"', '', 'span style="font-size: 8pt"', '\\\', \\\' href="?, '!--..结束列表--", '', "/td']<

我将要返回的值加粗,并确定了它们在代码中的位置。要过滤它们,代码如下。Temp_array 是上面要过滤的代码,temp_count 是数组中的位置,business_listing 是我在找到时将值附加到的数组。基本上,当 temp_count == 数组中值的位置时,它会将该值附加到数组中。

        <
        temp_count=0
            for i in temp_array:
                if temp_count ==0:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==2:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1    
                elif temp_count ==19:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==20:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==23:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==27:
                    business_listings.append(i)
                    temp_count+=1
                elif temp_count ==42:
                    business_listings.append(i)
                    temp_count+=1
                    
        else:
            count+=1 

输出如下: ['\\\', \\\' href="?listingid=9758&profileid=2B713K5Z48&action=uweb&url=http%3a%2f%2fwww.jpspa.com" target="_BLANK"']> 和仅过滤前 2 个值或不过滤任何内容。

标签: pythonweb-scraping

解决方案


此脚本将打印有关各种业务的信息:

import requests
from bs4 import BeautifulSoup


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')


for b in soup.select('td[bgcolor="#E6E6E6"] b'):
    business_name = b.text
    business_url = b.a['href'] if b.a else '-'
    owner = b.find_next('td', width="90%").contents[0]

    addr, current = [], owner.find_next(text=True)
    while not current.find_parent('b'):
        addr.append(current.strip())
        current = current.find_next(text=True)

    addr = '\n'.join(addr)
    phone = current.find_next(text=True).strip()

    print('Business Name :', business_name)
    print('Business URL  :', business_url)
    print('Owner         :', owner)
    print('Phone         :', phone)
    print('Address:')
    print(addr)
    print('-' * 80)

印刷:

Business Name : Johnson Price Sprinkle PA
Business URL  : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL  : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Worley, Woodbery, & Associates, PA
Business URL  : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner         : Mr. David Worley
Phone         : (828) 271-7997
Address:
7 Orchard Street, Ste. 202
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting, Inc.
Business URL  : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner         : John Michael  Kledis
Phone         : (828) 242-6971
Address:
PO Box 8904
Asheville, NC  28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL  : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner         : Adrienne Bernardi
Phone         : (828) 254-2254
Address:
PO Box 3049
Asheville, NC  28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group, P.A.
Business URL  : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner         : Ed Towson
Phone         : (828) 258-0363
Address:
100 Coxe Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA, CFE, PLLC
Business URL  : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner         : Michelle Tracz
Phone         : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville, NC  28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley, P.A.
Business URL  : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner         : Bronwyn Burleson, CPA
Phone         : (828) 251-2846
Address:
902 Sand Hill Road
Asheville, NC  28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates, P.A.
Business URL  : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner         : Carol King
Phone         : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL  : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner         : Allen Gray
Phone         : (828) 281-3161
Address:
32 Orange St.
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon, PLLC
Business URL  : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner         : Slater Solomon
Phone         : (828) 236-0206
Address:
242 Charlotte St., Suite 1
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL  : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner         : Matthew Raker
Phone         : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL  : -
Owner         : Leslie LeBlanc
Phone         : (828) 225-4940
Address:
218 Broadway
Asheville, NC  28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates, PA, CPA's
Business URL  : -
Owner         : Alan E Bolick, CPA
Phone         : (828) 253-4692
Address:
Central Office Park   Suite 104
56 Central Avenue
Asheville, NC  28801
--------------------------------------------------------------------------------

编辑:解析网址:

import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote


url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')


for b in soup.select('td[bgcolor="#E6E6E6"] b'):
    business_name = b.text
    business_url = b.a['href'] if b.a else '-'
    owner = b.find_next('td', width="90%").contents[0]

    addr, current = [], owner.find_next(text=True)
    while not current.find_parent('b'):
        addr.append(current.strip())
        current = current.find_next(text=True)

    addr = '\n'.join(addr)
    phone = current.find_next(text=True).strip()

    print('Business Name :', business_name)
    print('Business URL  :', unquote(business_url).rsplit('=', maxsplit=1)[-1])
    print('Owner         :', owner)
    print('Phone         :', phone)
    print('Address:')
    print(addr)
    print('-' * 80)

印刷:

Business Name : Johnson Price Sprinkle PA
Business URL  : http://www.jpspa.com
Owner         : Maria Pilos
Phone         : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC  28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL  : http://www.lbnoelcpa.com/
Owner         : Ms. Leah Noel
Phone         : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC  28801
--------------------------------------------------------------------------------

...and so on.

推荐阅读