首页 > 解决方案 > How to use pd.DataFrame method to manually create a dataframe from info scraped using beautifulsoup4

问题描述

I made it to the point where all tr data data has been scraped and I am able to get a nice printout. But when I go to implement the pd.DataFrame as in df= pd.DataFrame({"A": a}) etc, I get a syntax error

Here is a list of my imported libraries in the Jupyter Notebook:

import pandas as pd
import numpy as np
import bs4 as bs
import requests
import urllib.request
import csv
import html5lib
from pandas.io.html import read_html
import re

Here is my code:

source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs.BeautifulSoup(source,'html.parser')

table_rows = soup.find_all('tr')
table_rows

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

texas_info = pd.DataFrame({
        "title": Texas 
        "Zip Code" : [Zip Code], 
        "City" :[City],
})

texas_info.head()

I expect to get a dataframe with two columns, one being the 'Zip Code' and the other the 'Cities'

标签: pandasweb-scrapingbeautifulsoup

解决方案


尝试创建 DataFrame 并执行for循环以将表中的每一行附加到 DataFrame 中。

    df = pd.DataFrame()
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        print(row)
        zipCode = row[0] # assuming first column
        city = row[1] # assuming second column

        df = df.append({"Zip Code": zipCode, "City" : city}, ignore_index=True)

如果您只需要这两列,则不应包含title在 DataFrame 中(这将创建另一列);由于缺少逗号,该行也恰好是发生语法错误的地方。


推荐阅读