python - 如何从没有属性值的 HTML 树中抓取内容
问题描述
我在抓取 html 数据和获取特定字段时遇到问题。这是html代码:
```
<li class="highlight">
Relationship Issues
</li>
<li class="highlight">
Depression
</li>
<li class="highlight">
Spirituality
</li>
</ul>
</div>
</div>, <div class="spec-list attributes-issues">
<h5 class="spec-subcat">Issues</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
ADHD
</li>
<li class="">
Alcohol Use
</li>
<li class="">
Anger Management
</li>
<li class="">
Antisocial Personality
</li>
<li class="">
Anxiety
</li>
<li class="">
Behavioral Issues
</li>
<li class="">
Bipolar Disorder
</li>
<li class="">
Borderline Personality
</li>
<li class="">
Career Counseling
</li>
<li class="">
Child or Adolescent
</li>
<li class="">
Chronic Illness
</li>
<li class="">
Chronic Pain
</li>
<li class="">
Coping Skills
</li>
<li class="">
Divorce
</li>
<li class="">
Domestic Abuse
</li>
<li class="">
Domestic Violence
</li>
<li class="">
Eating Disorders
</li>
<li class="">
Emotional Disturbance
</li>
<li class="">
Family Conflict
</li>
<li class="">
Grief
</li>
<li class="">
Internet Addiction
</li>
<li class="">
Life Coaching
</li>
<li class="">
Life Transitions
</li>
<li class="">
Marital and Premarital
</li>
<li class="">
Men's Issues
</li>
<li class="">
Narcissistic Personality
</li>
<li class="">
Obsessive-Compulsive (OCD)
</li>
<li class="">
Parenting
</li>
<li class="">
School Issues
</li>
<li class="">
Self Esteem
</li>
<li class="">
Self-Harming
</li>
<li class="">
Stress
</li>
<li class="">
Suicidal Ideation
</li>
<li class="">
Transgender
</li>
<li class="">
Trauma and PTSD
</li>
<li class="">
Women's Issues
</li>
</ul>
</div>
</div>, <div class="spec-list attributes-mental-health">
<h5 class="spec-subcat">Mental Health</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
Dissociative Disorders
</li>
<li class="">
Elderly Persons Disorders
</li>
<li class="">
Impulse Control Disorders
</li>
<li class="">
Mood Disorders
</li>
<li class="">
Personality Disorders
</li>
<li class="">
Psychosis
</li>
<li class="">
Thinking Disorders
</li>
</ul>
</div>
</div>, <div class="spec-list attributes-sexuality">
<h5 class="spec-subcat">Sexuality</h5>
<div class="col-split-xs-1 col-split-md-2">
<ul class="attribute-list copy-small">
<li class="">
Bisexual
</li>
<li class="">
Lesbian
</li>
<li class="">
Gay
</li>
</ul>
</div>
</div>]
```
这是我的代码:
```
import requests
from bs4 import BeautifulSoup
from lxml import html
import html5lib
import re
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
URL = "https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, parser='html5lib', features="lxml")
specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})
issues = soup.find_all('div', {'class': 'spec-list attributes-issues'})
mental_health = soup.find_all('div', {'class': 'spec-list attributes-mental-health'})
sexuality = soup.find_all('div', {'class': 'spec-list attributes-sexuality'})
```
理想的结果是有一个 csv(或 excel)文件,其输出为:
Name: {name}
Location: {location}
Phone Number: {Phone_number}
Specialties: {Specialities_{count}}
Issues: {Issues_{count}}
Mental Health Care: {Mental_Health_{count}}
我想为它提供一个通用目录网站,并让代码抓取这些字段的 html 数据。网址是:https ://www.psychologytoday.com/us/treatments/gary-l-phillips-northfield-il/43578 谢谢!
解决方案
要从页面获取所需信息,您可以使用以下示例:
import requests
from bs4 import BeautifulSoup
url = 'https://www.psychologytoday.com/us/therapists/gary-l-phillips-northfield-il/43578'
headers = {'User-Agent': 'ozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
name = soup.select_one('h1[itemprop="name"]').get_text(strip=True)
location = soup.select_one('.address-data').get_text(strip=True, separator=' ')
phone_number = soup.select_one('.phone-number').get_text(strip=True, separator=' ')
specialties = [li.get_text(strip=True) for li in soup.select('h5:contains("Specialties") + div li')]
issues = [li.get_text(strip=True) for li in soup.select('h5:contains("Issues") + div li')]
mental_health = [li.get_text(strip=True) for li in soup.select('h5:contains("Mental Health") + div li')]
print('Name:')
print(name)
print('Location:')
print(location)
print('Phone Number:')
print(phone_number)
print('Specialties:')
print(*specialties, sep=', ')
print('Issues:')
print(*issues, sep=', ')
print('Mental Health')
print(*mental_health, sep=', ')
印刷:
Name:
Gary L Phillips
Location:
550 Sunset Ridge Rd Northfield, IL 60093
Phone Number:
(847) 212-1496
Specialties:
Relationship Issues, Depression, Spirituality
Issues:
ADHD, Alcohol Use, Anger Management, Antisocial Personality, Anxiety, Behavioral Issues, Bipolar Disorder, Borderline Personality, Career Counseling, Child or Adolescent, Chronic Illness, Chronic Pain, Coping Skills, Divorce, Domestic Abuse, Domestic Violence, Eating Disorders, Emotional Disturbance, Family Conflict, Grief, Internet Addiction, Life Coaching, Life Transitions, Marital and Premarital, Men's Issues, Narcissistic Personality, Obsessive-Compulsive (OCD), Parenting, School Issues, Self Esteem, Self-Harming, Stress, Suicidal Ideation, Transgender, Trauma and PTSD, Women's Issues
Mental Health
Dissociative Disorders, Elderly Persons Disorders, Impulse Control Disorders, Mood Disorders, Personality Disorders, Psychosis, Thinking Disorders
推荐阅读
- javascript - Lightbox2脚本+“条件点击”?
- node.js - NodeJS:提高 SFTP 服务器性能
- laravel - Spatie 媒体库 (Pro) - 将身份验证令牌附加到“上传”调用
- java - 如果在线程中调用函数,则 mockito 模拟静态函数不起作用
- c# - 使用 C# 批量编辑 XML 中的特定元素
- python - 带有数组和 for 循环的 BMI 计算器,得到 TypeError
- firebase-storage - 如何在 Firebase Cloud Storage Emulator 中创建额外的存储桶
- python - 无法将主机名“postgres”转换为地址:未知主机
- javascript - 如何将我自己的事件监听器添加到我的自定义 React Native 组件中?
- amazon-web-services - Docker 构建错误:“gpg:密钥服务器接收失败:无名称”