首页 > 解决方案 > 如何使用 Python 遍历一个基本网站以创建 URL 列表,然后打印每个 URL 的文本

问题描述

我想使用 Python 来抓取蒙大拿州代码注释的民事诉讼 URL上的所有链接,以及该页面上链接的所有页面,并最终在最后一个链接处捕获实质性文本。问题是基本 URL 链接到的章节也有部分的 URL。并且零件 URL 具有指向我想要的文本的链接。所以这是一个“三层深度”的 URL 结构,其 URL 命名约定不使用顺序结尾,如 1、2、3、4 等。

我是 Python 新手,所以我把它分解成几个步骤。

首先,我用它从带有实质性文本的单个 URL中提取文本(即,三层深度):

import requests
from bs4 import BeautifulSoup
 
url = 'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

href_elem = soup.find('div', class_='mca-content mca-toc')

with open("Rsync_Test.txt", "w") as f:
    print(href_elem.text,"PAGE_END", file = f)
f.close()

第二,我创建了一个 URL 列表并将其导出到 .txt 文件:

import os
from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/sections_index.html")
soup = BeautifulSoup(html_page, "html.parser")
url_base="https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/"

for link in soup.findAll('a'):
    print(url_base+link.get('href')[2:])

os.chdir("/home/rsync/Downloads/")
with open("All_URLs.txt", "w") as f:
    for link in soup.findAll('a'):
        print(url_base+link.get('href')[2:], file = f)
f.close()

第三,我尝试从生成的 URL 列表中抓取文本:

import os
import requests
from bs4 import BeautifulSoup

url_lst = [    'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
    ]

for link in url_lst:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    href_elem = soup.find('div', class_='mca-content mca-toc')
    
    for link in url_lst:
        with open("Rsync_Test.txt", "w") as f:
            print(href_elem.text,"PAGE_END", file = f)
    f.close()

我的计划是将它们放在一个脚本中(在弄清楚如何从基本 URL 中提取三层深度的 URL 之后)。但是第三个脚本会在自身上进行迭代,而不为每个 URL 打印单独的页面,从而只生成最后一个 URL 中的文本。

欢迎任何关于如何修复第三个脚本的提示,以便它从第二个脚本的所有 16 个 URL 中抓取并打印文本!就像关于如何“将其整合”成不那么复杂的想法一样。

标签: pythonweb-scraping

解决方案


您正在迭代url_list两次。

假设您希望将每个 href 的文本写入文件,删除重复的 for 循环,将结果保存到列表抓取数据中,然后将该列表写入其自己的 for 循环中的文件

import os
import requests
from bs4 import BeautifulSoup

url_lst = [    'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0010/0250-0190-0010-0010.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0020/0250-0190-0010-0020.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0030/0250-0190-0010-0030.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0040/0250-0190-0010-0040.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0050/0250-0190-0010-0050.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0060/0250-0190-0010-0060.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0070/0250-0190-0010-0070.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0080/0250-0190-0010-0080.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0090/0250-0190-0010-0090.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0100/0250-0190-0010-0100.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0110/0250-0190-0010-0110.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0120/0250-0190-0010-0120.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0130/0250-0190-0010-0130.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0140/0250-0190-0010-0140.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0150/0250-0190-0010-0150.html',
'https://leg.mt.gov/bills/mca/title_0250/chapter_0190/part_0010/section_0160/0250-0190-0010-0160.html'
    ]

new_url_list = []

for link in url_lst:
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    href_elem = soup.find('div', class_='mca-content mca-toc')
    
    new_url_list.append(href_elem.text)
MyFile=open('output.txt','w', encoding='utf-8')

for link in new_url_list:
     MyFile.write(link)
MyFile.close()

这将输出一个像这样的文件


Montana Code Annotated 2019



      TITLE 25. CIVIL PROCEDURE

    


      CHAPTER 19. UNIFORM DISTRICT COURT RULES

    


      Part 1. Rules

    


      Form Of Papers Presented For Filing

    




Rule 1 - Form of Papers Presented for Filing.



                (a) Papers Defined. The word "papers" as used in this Rule includes all documents and copies except exhibits and records on appeal from lower courts.

            


                (b) Pagination, Printing, Etc. All papers shall be:

            


                (1) Typewritten, printed or equivalent;

            


                (2) Clear and permanent;

            


                (3) Equally legible to printing;

            


                (4) Of type not smaller than pica;

            


                (5) Only on standard quality opaque, unglazed, recycled paper, 8 1/2" x 11" in size.

            


                (6) Printed one side only, except copies of briefs may be printed on both sides. The original brief shall be printed on one side.

            


                (7) Lines unnumbered or numbered consecutively from the top;

            


                (8) Spaced one and one-half or double;

            


                (9) Page numbered consecutively at the bottom; and

            


                (10) Bound firmly at the top. Matters such as property descriptions or direct quotes may be single spaced. Extraneous documents not in the above format and not readily conformable may be filed in their original form and length.

            


                (c) Format. The first page of all papers shall conform to the following:

依此类推,直到数据中的规则 16。


推荐阅读