python - 如何从div内部提取文本
问题描述
我scrapy
用来提取信息。
编辑:我尝试以这种方式提取文本,但没有:response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")
如果有人想重新创建,那么只需复制粘贴此代码并运行。您可以获取任何页面,只需要提取该信息。
import scrapy
from scrapy.spiders import SitemapSpider
from scrapy.crawler import CrawlerProcess
import googletrans
# from googletrans import Translator
from translate import Translator
class Myspider(SitemapSpider):
name = 'spidername'
sitemap_urls = ['https://www.arabam.com/sitemap/otomobil_1.xml']
sitemap_rules = [
('/otomobil/', 'parse'),
# ('/category/', 'parse_category'),
]
def parse(self,response):
for td in response.xpath("/html/body/div[3]/div[6]/div[4]/div/div[2]/table/tbody/tr/td[4]/div/a/@href").extract():
# / html / body / div[3] / div[6] / div[4] / div / div[2] / table / tbody / tr / td[4] / div / a
checks = str(td.split("/")[3]).split("-")
for items in checks:
if items.isdigit():
if int(items) > 2001:
url = "https://www.arabam.com/"+ td
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self,response):
##some other stuff im scraping
overview1 = response.xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div/div[2]/dl[1]/dd/span")
print(response)
print("s"+ str(overview1))
process = CrawlerProcess({
})
process.crawl(Myspider)
process.start() # the script will block here until the c
罗林完成了
编辑:预期的输出是得到这些确切的键值对。
编辑:在答案中使用标签我得到这个:
[......or Kaputu: ', ' Orijinal ', ' ', 'Sol Ön Çamurluk: ', ' Boyanmış ', ' ', 'Ön Tampon: ', ' Orijinal ', ' ', 'Arka Tampon: ', ' Orijinal ', ' ', 'Belirtilmemiş', 'Orijinal', 'Boyalı', 'Değişmiş', ' ', ' ', ' Tramer tutarı yok ', ' ', ' ', ' ', 'ARAÇ BİLGİLERİ', ' ', ' ', 'DONANIM', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', '\xa0', ' ', ' ', 'KREDİ', ' ', ' ', 'SPONSORLU BAĞLANTILAR', " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030262883-0'); }); ", " googletag.cmd.push(function () { googletag.display('div-gpt-ad-1547030358839-0'); }); "]
编辑:我已经尝试通过 Selenium 获得它仍然没有运气
element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
d.execute_script("arguments[0].scrollIntoView();", element)
element = d.find_element_by_xpath("/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div")
print(element)
overview1 = element.text
编辑:由于元素位于屏幕中间,它不会进入视图。有什么办法可以滚动到底部然后到中间。我试过这段代码不起作用:
element = d.find_element_by_xpath('/html/body/div[3]/div[6]/div[3]/div/div[1]/div[3]/div/div[3]/div/div') # you can use ANY way to locate element
coordinates = element.location_once_scrolled_into_view # returns dict of X, Y coordinates
d.execute_script('window.scrollTo({}, {});'.format(coordinates['x'], coordinates['y']))
解决方案
我使用 selenium 编写了以下代码来测试 xpath(我之前没有使用过 scrapy):
from selenium import webdriver
from time import sleep
url = 'https://www.arabam.com/ilan/sahibinden-satilik-peugeot-407-2-0-hdi-comfort/sahibinden-peugeot-407-1-6-hdi-comfort-2008-model/12776039'
driver = webdriver.Chrome()
driver.get(url)
driver.execute_script("window.scrollTo(0, 1080);")
sleep(1)
overview_info = [ data for section in driver.find_elements_by_xpath("//div[@class='col-md-6 genel-bakis']") for data in section.text.split("\n")]
enguine_info = [ data for section in driver.find_elements_by_xpath("//div[@class='col-md-6 motor-ve-performans']") for data in section.text.split("\n")]
print("VEHICLE INFORMATION")
for i in range(0,len(overview_info)-1,2):
print(overview_info[i] + ": " + overview_info[i+1])
for i in range(0,len(enguine_info)-1,2):
print(enguine_info[i] + ": " + enguine_info[i+1])
driver.quit()
这给了我以下输出:
输出是您在图片中突出显示的内容,因此我建议您使用以下路径:
#Get the text in the general section
"//div[@class='col-md-6 genel-bakis']//text()"
#Get text in the engine and performance section
"//div[@class='col-md-6 motor-ve-performans']//text()"
推荐阅读
- scala - Spark 不符合预期类型 TraversableOnce
- arrays - 如何在foreach循环中为perl创建变量数组名
- c# - 我可以从命令行按顺序运行多个 nunit 测试吗?(使用 nunit 控制台运行程序)
- python - 使用索引列表获取另一个列表中的值
- android - AutoTextView 最大尺寸问题
- python - 您可以流式传输使其“热门”的帖子吗?
- python - 获得“没有名为 Spotipy 的模块”
- python - 二维数组,三消游戏类型
- java - 将对象作为参数传递并获取对象作为回报
- ios - 使用 Firebase-iOS 观察实时聊天数据时应用程序崩溃