python - 网页抓取表过滤结果
问题描述
我正在使用 Python 来 Web 抓取此处找到的数据表。具体来说,我想提取商家名称、网址、所有者姓名、街道、城市和电话。通过 Beautiful Soup 运行并拆分代码以过滤后显示为:
['\\\', \\\' href="?listingid=9758&profileid=217Y3Q544Y&action=uweb&url=http%3a%2f%2f www.jpspa.com " target="_BLANK"', " Johnson Price Sprinkle PA ', '/a"、"'、'/b"、"'、'/td"、"'、'/tr"、"'、'/table"、"'、'/td"、"'、'/ tr", '', 'tr class="GeneralBody"', '', 'td bgcolor="#808080" height="1"', '', 'img border="0" height="1" src= "images/dot_clear.gif" width="1"/', "', '/td", "', '/tr", "', '/table", "', '/td", "', '/tr", '', 'tr class="GeneralBody"', '', 'td align="left" valign="top" width="90%"', ' Maria Pilos ', "', '', ' 79 Woodfin Place, Suite 300 ", "', '', ' Asheville, NC 28801 ", "', '', '", 'b', "电话:', '/b", '** (828) 254-2374'"、'b'、"电话:'、'/b"、'** (828) 254-2374'"、'b'、"电话:'、'/b"、'** (828) 254-2374**', "', '', '", 'b', "传真:', '/b", " (828) 252-9994', '\', \'", '\\\', \\\' href="DirectoryEmailForm.aspx?listingid=9758"', "发送电子邮件', '/a", "', '/td", '', 'td align="right" rowspan="3" valign="top" width="10%"', '', 'span style="font-size: 8pt"', '\\\', \\\' href="?, '!--..结束列表--", '', "/td']<
我将要返回的值加粗,并确定了它们在代码中的位置。要过滤它们,代码如下。Temp_array 是上面要过滤的代码,temp_count 是数组中的位置,business_listing 是我在找到时将值附加到的数组。基本上,当 temp_count == 数组中值的位置时,它会将该值附加到数组中。
<
temp_count=0
for i in temp_array:
if temp_count ==0:
business_listings.append(i)
temp_count+=1
elif temp_count ==2:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==20:
business_listings.append(i)
temp_count+=1
elif temp_count ==23:
business_listings.append(i)
temp_count+=1
elif temp_count ==27:
business_listings.append(i)
temp_count+=1
elif temp_count ==42:
business_listings.append(i)
temp_count+=1
else:
count+=1
输出如下: ['\\\', \\\' href="?listingid=9758&profileid=2B713K5Z48&action=uweb&url=http%3a%2f%2fwww.jpspa.com" target="_BLANK"']> 和仅过滤前 2 个值或不过滤任何内容。
解决方案
此脚本将打印有关各种业务的信息:
import requests
from bs4 import BeautifulSoup
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', business_url)
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
印刷:
Business Name : Johnson Price Sprinkle PA
Business URL : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Worley, Woodbery, & Associates, PA
Business URL : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner : Mr. David Worley
Phone : (828) 271-7997
Address:
7 Orchard Street, Ste. 202
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting, Inc.
Business URL : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner : John Michael Kledis
Phone : (828) 242-6971
Address:
PO Box 8904
Asheville, NC 28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner : Adrienne Bernardi
Phone : (828) 254-2254
Address:
PO Box 3049
Asheville, NC 28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group, P.A.
Business URL : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner : Ed Towson
Phone : (828) 258-0363
Address:
100 Coxe Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA, CFE, PLLC
Business URL : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner : Michelle Tracz
Phone : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville, NC 28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley, P.A.
Business URL : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner : Bronwyn Burleson, CPA
Phone : (828) 251-2846
Address:
902 Sand Hill Road
Asheville, NC 28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates, P.A.
Business URL : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner : Carol King
Phone : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner : Allen Gray
Phone : (828) 281-3161
Address:
32 Orange St.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon, PLLC
Business URL : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner : Slater Solomon
Phone : (828) 236-0206
Address:
242 Charlotte St., Suite 1
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner : Matthew Raker
Phone : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL : -
Owner : Leslie LeBlanc
Phone : (828) 225-4940
Address:
218 Broadway
Asheville, NC 28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates, PA, CPA's
Business URL : -
Owner : Alan E Bolick, CPA
Phone : (828) 253-4692
Address:
Central Office Park Suite 104
56 Central Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
编辑:解析网址:
import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', unquote(business_url).rsplit('=', maxsplit=1)[-1])
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
印刷:
Business Name : Johnson Price Sprinkle PA
Business URL : http://www.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : http://www.lbnoelcpa.com/
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
...and so on.
推荐阅读
- python - 如何从 pytest 函数中获取数据(monekypath)
- azure - 在 GoLang 中为 Azure 函数使用 swagger 规范
- c# - 将 LINQ 查询的结果作为复杂模型返回
- javascript - CSS 被应用到上面的部分
- javascript - 为什么不删除 url 查询组件?
- jupyter-notebook - 如何在 jupyter bash 魔术中同时引用 python 和环境变量?
- qt - 如何在属性更改 QML 上实现行为动画
- visual-studio-code - 如何在 VScode 中将参数传递给 launch.json
- angular - 无法使用 Karma 测试测试 MatTable 填充行
- java - 如何从回调函数返回值