首页 > 解决方案 > Extract dictionaries from HTML scraped text

问题描述

I extracted the HTML code from a website (scraping restaurant reviews) and I ended up with the part I need in the form of a dictionary. I managed to get all the scripts with the same tag using the code below, but I do not know how to filter out the tags to get only the one with the reviews in it and convert it into dictionaries and eventually to csv file.

This is the (most of the) script tag that I need to keep: enter image description here

This is the code I used to download all the HTML codes for the reviews pages and store them in text files:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
from selenium import webdriver
import codecs
import os
os.system('cls')


PATH = "C:\\Users\\HCES\\Downloads\\chromedriver.exe"
driver = webdriver.Chrome(PATH)


for i in range(1,450):
    completeName = os.path.join('C:\\Users\\HCES\\Desktop\\jana\\scraped files', ("index{}.txt").format(i))
    file_object = codecs.open(completeName, "w", "utf-8")
    driver.get("https://www.zomato.com/beirut/divvy-ashrafieh/reviews?page={}&sort=dd&filter=reviews-dd".format(i))
    file_object.write(driver.page_source)
    print("Page {} is written.".format(i))

driver.quit()

This is the code I used to print out only the script tags:

from selenium import webdriver
import codecs
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

for x in range(1,2):
    revCode = open("index{}.txt".format(x), "r", encoding="utf8")
    content = revCode.read()
    soup = BeautifulSoup(content, 'lxml')
    for script_tag in soup.find_all('script'):
        print(script_tag.text, script_tag.next_sibling)

Your help is very much appreciated as I need this for work

标签: pythonjsonseleniumweb-scrapingbeautifulsoup

解决方案


You can use json library to get the data inside tag in json format:

import json
...

data = soup.find('script', {"type": "application/ld+json"})
json_data = json.loads(data.string)

Now you can access any value with given key.


推荐阅读