首页 > 解决方案 > 如何从div内部提取文本

问题描述

我试图提取这个: 在此处输入图像描述

从这个链接:https ://www.arabam.com/ilan/sahibinden-satilik-peugeot-407-2-0-hdi-comfort/sahibinden-peugeot-407-1-6-hdi-comfort-2008-model/ 12776039

scrapy用来提取信息。

编辑:我尝试以这种方式提取文本,但没有:response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")

如果有人想重新创建,那么只需复制粘贴此代码并运行。您可以获取任何页面,只需要提取该信息。

import scrapy
from scrapy.spiders import SitemapSpider
from scrapy.crawler import CrawlerProcess
import googletrans
# from googletrans import Translator
from translate import Translator

class Myspider(SitemapSpider):
    name = 'spidername'
    sitemap_urls = ['https://www.arabam.com/sitemap/otomobil_1.xml']
    sitemap_rules = [
        ('/otomobil/', 'parse'),
        # ('/category/', 'parse_category'),
    ]
    def parse(self,response):


            for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a/@href").extract():
                # / html / body / div[3] / div[6] / div[4] / div / div[2] / table / tbody / tr / td[4] / div / a
                checks = str(td.split("/")[3]).split("-")

                for items in checks:
                    if items.isdigit():

                        if int(items) > 2001:

                            url = "https://www.arabam.com/"+ td
                            yield scrapy.Request(url, callback=self.parse_dir_contents)


    def parse_dir_contents(self,response):
        ##some other stuff im scraping

        overview1 = response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")
        print(response)
        print("s"+ str(overview1))



process = CrawlerProcess({

})

process.crawl(Myspider)
process.start() # the script will block here until the c

罗林完成了

编辑:预期的输出是得到这些确切的键值对。

编辑:在答案中使用标签我得到这个:

[......or Kaputu: ', ' Orijinal ', '  ', 'Sol Ön Çamurluk: ', ' Boyanmış ', '  ', 'Ön Tampon: ', ' Orijinal ', '  ', 'Arka Tampon: ', ' Orijinal ', '  ', 'Belirtilmemiş', 'Orijinal', 'Boyalı', 'Değişmiş', '   ', '  ', ' Tramer tutarı yok ', '  ', '  ', '  ', 'ARAÇ BİLGİLERİ', '  ', ' ', 'DONANIM', '\xa0', '  ', '\xa0', '  ', '\xa0', '  ', '\xa0', '  ', '\xa0', '  ', '\xa0', '  ', '  ', 'KREDİ', '  ', '  ', 'SPONSORLU BAĞLANTILAR', " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030262883-0'); }); ", " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030358839-0'); }); "]

编辑:我已经尝试通过 Selenium 获得它仍然没有运气

 element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
        d.execute_script("arguments[0].scrollIntoView();", element)
        element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
        print(element)
        overview1 = element.text

编辑:由于元素位于屏幕中间,它不会进入视图。有什么办法可以滚动到底部然后到中间。我试过这段代码不起作用:

element = d.find_element_by_xpath('/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div')  # you can use ANY way to locate element
        coordinates = element.location_once_scrolled_into_view  # returns dict of X, Y coordinates
        d.execute_script('window.scrollTo({}, {});'.format(coordinates['x'], coordinates['y']))

标签: pythonseleniumweb-scrapingscrapy

解决方案


我使用 selenium 编写了以下代码来测试 xpath(我之前没有使用过 scrapy):

from selenium import webdriver
from time import sleep

url = 'https://www.arabam.com/ilan/sahibinden-satilik-peugeot-407-2-0-hdi-comfort/sahibinden-peugeot-407-1-6-hdi-comfort-2008-model/12776039'


driver = webdriver.Chrome()

driver.get(url)
driver.execute_script("window.scrollTo(0, 1080);")

sleep(1)

overview_info = [ data for section in driver.find_elements_by_xpath("//div[@class='col-md-6 genel-bakis']") for data in section.text.split("\n")]
enguine_info = [ data for section in driver.find_elements_by_xpath("//div[@class='col-md-6 motor-ve-performans']") for data in section.text.split("\n")]

print("VEHICLE INFORMATION")
for i in range(0,len(overview_info)-1,2):
    print(overview_info[i] + ": " + overview_info[i+1])
for i in range(0,len(enguine_info)-1,2):
    print(enguine_info[i] + ": " + enguine_info[i+1])

driver.quit()

这给了我以下输出:

在此处输入图像描述

输出是您在图片中突出显示的内容,因此我建议您使用以下路径:

#Get the text in the general section
"//div[@class='col-md-6 genel-bakis']//text()"
#Get text in the engine and performance section
"//div[@class='col-md-6 motor-ve-performans']//text()"

推荐阅读