首页 > 解决方案 > 如何从没有属性值的 HTML 树中抓取内容

问题描述

我在抓取 html 数据和获取特定字段时遇到问题。这是html代码:

```

<li class="highlight">
                                                    Relationship Issues
                                            </li>
<li class="highlight">
                                                    Depression
                                            </li>
<li class="highlight">
                                                    Spirituality
                                            </li>
</ul>
</div>
</div>, <div class="spec-list attributes-issues">
<h5 class="spec-subcat">Issues</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            ADHD
                                                    </li>
<li class="">
                                                            Alcohol Use
                                                    </li>
<li class="">
                                                            Anger Management
                                                    </li>
<li class="">
                                                            Antisocial Personality
                                                    </li>
<li class="">
                                                            Anxiety
                                                    </li>
<li class="">
                                                            Behavioral Issues
                                                    </li>
<li class="">
                                                            Bipolar Disorder
                                                    </li>
<li class="">
                                                            Borderline Personality
                                                    </li>
<li class="">
                                                            Career Counseling
                                                    </li>
<li class="">
                                                            Child or Adolescent
                                                    </li>
<li class="">
                                                            Chronic Illness
                                                    </li>
<li class="">
                                                            Chronic Pain
                                                    </li>
<li class="">
                                                            Coping Skills
                                                    </li>
<li class="">
                                                            Divorce
                                                    </li>
<li class="">
                                                            Domestic Abuse
                                                    </li>
<li class="">
                                                            Domestic Violence
                                                    </li>
<li class="">
                                                            Eating Disorders
                                                    </li>
<li class="">
                                                            Emotional Disturbance
                                                    </li>
<li class="">
                                                            Family Conflict
                                                    </li>
<li class="">
                                                            Grief
                                                    </li>
<li class="">
                                                            Internet Addiction
                                                    </li>
<li class="">
                                                            Life Coaching
                                                    </li>
<li class="">
                                                            Life Transitions
                                                    </li>
<li class="">
                                                            Marital and Premarital
                                                    </li>
<li class="">
                                                            Men's Issues
                                                    </li>
<li class="">
                                                            Narcissistic Personality
                                                    </li>
<li class="">
                                                            Obsessive-Compulsive (OCD)
                                                    </li>
<li class="">
                                                            Parenting
                                                    </li>
<li class="">
                                                            School Issues
                                                    </li>
<li class="">
                                                            Self Esteem
                                                    </li>
<li class="">
                                                            Self-Harming
                                                    </li>
<li class="">
                                                            Stress
                                                    </li>
<li class="">
                                                            Suicidal Ideation
                                                    </li>
<li class="">
                                                            Transgender
                                                    </li>
<li class="">
                                                            Trauma and PTSD
                                                    </li>
<li class="">
                                                            Women's Issues
                                                    </li>
</ul>
</div>
</div>, <div class="spec-list attributes-mental-health">
<h5 class="spec-subcat">Mental Health</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            Dissociative Disorders
                                                    </li>
<li class="">
                                                            Elderly Persons Disorders
                                                    </li>
<li class="">
                                                            Impulse Control Disorders
                                                    </li>
<li class="">
                                                            Mood Disorders
                                                    </li>
<li class="">
                                                            Personality Disorders
                                                    </li>
<li class="">
                                                            Psychosis
                                                    </li>
<li class="">
                                                            Thinking Disorders
                                                    </li>
</ul>
</div>
</div>, <div class="spec-list attributes-sexuality">
<h5 class="spec-subcat">Sexuality</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
                                                            Bisexual
                                                    </li>
<li class="">
                                                            Lesbian
                                                    </li>
<li class="">
                                                            Gay
                                                    </li>
</ul>
</div>
</div>]

```

这是我的代码:

```
import requests
from bs4 import BeautifulSoup
from lxml import html
import html5lib
import re
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, parser='html5lib', features="lxml")

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})
issues = soup.find_all('div', {'class': 'spec-list attributes-issues'})
mental_health = soup.find_all('div', {'class': 'spec-list attributes-mental-health'})
sexuality = soup.find_all('div', {'class': 'spec-list attributes-sexuality'})

```

理想的结果是有一个 csv(或 excel)文件,其输出为:

Name: {name}
Location: {location}
Phone Number: {Phone_number}
Specialties: {Specialities_{count}}
Issues: {Issues_{count}}
Mental Health Care: {Mental_Health_{count}}

我想为它提供一个通用目录网站,并让代码抓取这些字段的 html 数据。网址是:https ://www.psychologytoday.com/us/treatments/gary-l-phillips-northfield-il/43578 谢谢!

标签: pythonhtmlweb-scraping

解决方案


要从页面获取所需信息,您可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = 'https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578'
headers = {'User-Agent': 'ozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

name = soup.select_one('h1[itemprop="name"]').get_text(strip=True)
location = soup.select_one('.address-data').get_text(strip=True, separator=' ')
phone_number = soup.select_one('.phone-number').get_text(strip=True, separator=' ')
specialties = [li.get_text(strip=True) for li in soup.select('h5:contains("Specialties") + div li')]
issues = [li.get_text(strip=True) for li in soup.select('h5:contains("Issues") + div li')]
mental_health = [li.get_text(strip=True) for li in soup.select('h5:contains("Mental Health") + div li')]

print('Name:')
print(name)
print('Location:')
print(location)
print('Phone Number:')
print(phone_number)
print('Specialties:')
print(*specialties, sep=', ')
print('Issues:')
print(*issues, sep=', ')
print('Mental Health')
print(*mental_health, sep=', ')

印刷:

Name:
Gary L Phillips
Location:
550 Sunset Ridge Rd Northfield, IL 60093
Phone Number:
(847) 212-1496
Specialties:
Relationship Issues, Depression, Spirituality
Issues:
ADHD, Alcohol Use, Anger Management, Antisocial Personality, Anxiety, Behavioral Issues, Bipolar Disorder, Borderline Personality, Career Counseling, Child or Adolescent, Chronic Illness, Chronic Pain, Coping Skills, Divorce, Domestic Abuse, Domestic Violence, Eating Disorders, Emotional Disturbance, Family Conflict, Grief, Internet Addiction, Life Coaching, Life Transitions, Marital and Premarital, Men's Issues, Narcissistic Personality, Obsessive-Compulsive (OCD), Parenting, School Issues, Self Esteem, Self-Harming, Stress, Suicidal Ideation, Transgender, Trauma and PTSD, Women's Issues
Mental Health
Dissociative Disorders, Elderly Persons Disorders, Impulse Control Disorders, Mood Disorders, Personality Disorders, Psychosis, Thinking Disorders

推荐阅读