首页 > 解决方案 > 需要在 Jupyter 中提取 html 链接

问题描述

我浏览了尽可能多的存储库,发现编写代码从维基百科页面中提取元素以及每个城市的元素都缺失了。

    import pandas as pd
    url='https://en.wikipedia.org/wiki/List_of_cities_in_New_York'

    df=pd.read_html(url, header=0)[0]

    df.head()


    import pandas
    import requests
    from bs4 import BeautifulSoup
    website_text = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_New_York').text
    soup = BeautifulSoup(website_text,'xml')

    table = soup.find('table',{'class':'wikitable sortable'})

    table_rows = table.find_all('tr')

    data = []
         for row in table_rows:
              data.append([t.text.strip() for t in row.find_all('td')])
              df = pandas.DataFrame(data, columns=['City', 'PostalCode', 
                  'Population','IncorpDate','FIPS_Sub','FIPS_Place'])
              df = df[~df['PostalCode'].isnull()]  # to filter out bad rows
    df.head()

    df.to_csv('ny_cities22.csv', encoding='utf-8')

我知道这可能是我遗漏的东西,但我无法弄清楚代码。

谢谢。

标签: dataframejupyter-notebookwikipedia

解决方案


推荐阅读