首页 > 解决方案 > 我对大学课程的网络抓取有问题

问题描述

嗨,我正在尝试在网上搜索阅读大学:http ://www.reading.ac.uk/ready-to-study/study/subject-area/modern-languages-and-european-studies-ug/ba- spanish-and-history.aspx但我无法提取它的课程持续时间。谁能帮我。我使用下面的代码?

duration_title = soup.find('li', text=re.compile(r'Course duration', re.IGNORECASE))
if duration_title:
    duration = duration_title.find_next_sibling('strong')
    if duration:
        duration_text = duration.get_text()
        duration_ = re.search(r"\d+(?:.\d+)|\d+", duration_text)
        if duration_ is not None:
            if duration_.group() == 1 or '1' in duration_.group():
                course_data['Duration'] = duration_.group()
                course_data['Duration_Time'] = 'Year'
            elif '0.5' in duration_.group():
                course_data['Duration'] = '6'
                course_data['Duration_Time'] = 'Months'
            else:
                course_data['Duration'] = duration_.group()
                course_data['Duration_Time'] = 'Years'
else:
    course_data['Duration'] = 'Not mentioned'
    course_data['Duration_Time'] = 'Not mentioned'
print('Duration: ', str(course_data['Duration']) + ' / ' + course_data['Duration_Time'])

标签: webweb-scrapingweb-scraping-language

解决方案


仅尝试text并删除li

soup.find(text=re.compile(r'Course duration', re.IGNORECASE))

推荐阅读