xpath - 带有scrapy shell的Xpath
问题描述
我正在尝试使用以下 xpath 命令在此页面https://techmoran.com/category/startups/上使用 scrapy shell 抓取启动文章:
>>> n=response.xpath('//article[contains(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href').getall()
>>> len(n)
30
该命令仅返回 30 篇文章而不是 2880 篇文章,当我在 Chrome 中的检查器上尝试 //article[contains(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href' 时可以看到。我如何获得其余的文章?
解决方案
这是一个无限卷轴,每页有30篇文章。
找到 url 和请求:
scrapy shell
In [1]: url = 'https://techmoran.com/?ajax-request=jnews'
In [2]: page = 1
In [3]: data = {
"lang": "en_US",
"action": "jnews_module_ajax_jnews_block_34",
"module": "true",
"data[filter]": "0",
"data[filter_type]": "all",
"data[current_page]": f"{page}",
"data[attribute][header_icon]": "",
"data[attribute][first_title]": "",
"data[attribute][second_title]": "",
"data[attribute][url]": "",
"data[attribute][header_type]": "heading_6",
"data[attribute][header_background]": "",
"data[attribute][header_secondary_background]": "",
"data[attribute][header_text_color]": "",
"data[attribute][header_line_color]": "",
"data[attribute][header_accent_color]": "",
"data[attribute][header_filter_category]": "",
"data[attribute][header_filter_author]": "",
"data[attribute][header_filter_tag]": "",
"data[attribute][header_filter_text]": "All",
"data[attribute][post_type]": "post",
"data[attribute][content_type]": "all",
"data[attribute][number_post]": "30",
"data[attribute][post_offset]": "7",
"data[attribute][unique_content]": "disable",
"data[attribute][include_post]": "",
"data[attribute][exclude_post]": "",
"data[attribute][include_category]": "5",
"data[attribute][exclude_category]": "",
"data[attribute][include_author]": "",
"data[attribute][include_tag]": "",
"data[attribute][exclude_tag]": "",
"data[attribute][gallerycat]": "",
"data[attribute][sort_by]": "latest",
"data[attribute][date_format]": "default",
"data[attribute][date_format_custom]": "Y/m/d",
"data[attribute][pagination_mode]": "scrollload",
"data[attribute][pagination_nextprev_showtext]": "",
"data[attribute][pagination_number_post]": "30",
"data[attribute][pagination_scroll_limit]": "0",
"data[attribute][el_id]": "",
"data[attribute][el_class]": "",
"data[attribute][scheme]": "",
"data[attribute][column_width]": "auto",
"data[attribute][title_color]": "",
"data[attribute][accent_color]": "",
"data[attribute][alt_color]": "",
"data[attribute][excerpt_color]": "",
"data[attribute][css]": "",
"data[attribute][excerpt_length]": "20",
"data[attribute][paged]": "1",
"data[attribute][pagination_align]": "center",
"data[attribute][pagination_navtext]": "false",
"data[attribute][pagination_pageinfo]": "false",
"data[attribute][boxed]": "false",
"data[attribute][boxed_shadow]": "false",
"data[attribute][box_shadow]": "false",
"data[attribute][push_archive]": "true",
"data[attribute][ads_type]": "googleads",
"data[attribute][ads_position]": "5",
"data[attribute][ads_random]": "false",
"data[attribute][ads_image]": "",
"data[attribute][ads_image_link]": "",
"data[attribute][ads_image_alt]": "",
"data[attribute][ads_image_new_tab]": "true",
"data[attribute][google_publisher_id]": "<script+data-ad-client=\"ca-pub-4793005812421081\"+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>",
"data[attribute][google_slot_id]": "<script+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>+<ins+class=\"adsbygoogle\"++++++style=\"display:block\"++++++data-ad-format=\"fluid\"++++++data-ad-layout-key=\"-cs+6-1e-q8+12p\"++++++data-ad-client=\"ca-pub-4793005812421081\"++++++data-ad-slot=\"8266744001\"></ins>+<script>++++++(adsbygoogle+=+window.adsbygoogle+||+[]).push({});+</script>",
"data[attribute][google_desktop]": "auto",
"data[attribute][google_tab]": "auto",
"data[attribute][google_phone]": "auto",
"data[attribute][code]": "",
"data[attribute][ads_class]": "inline_module",
"data[attribute][column_class]": "jeg_col_3o3",
"data[attribute][class]": "jnews_block_34"
}
In [4]: req = scrapy.FormRequest(url=url, formdata=data)
In [5]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <POST https://techmoran.com/?ajax-request=jnews> (referer: None)
######## Here you'll get a json (content = response.json()['content']), scrape your articles from it. #######
In [6]: page += 1
........
........
........
推荐阅读
- android-studio - 运行 Android Studio 应用程序时出现 Dex 文件错误
- r - 重复列在另一列中的唯一值的次数
- android - 无法为 SoftwareComponentInternal 获取未知属性“发布”-Maven 发布插件项目 gradle
- python-3.x - 如何使用 iso-8859-1 对 pandas 进行编码?
- keil - 如何解决错误 - L6236E:没有部分匹配选择器 - 没有部分是 FIRST/LAST
- ios - 是否可以在对象数组中使用 UIImage?
- c# - SshKeyGenerator 生成的公钥/私钥不起作用
- asp.net-core - 在 Windows 中使用 CLI 在 .Net 5.0 中创建 WebAPI 时出错
- html - 如何根据视口纵横比实现 CSS 响应性?
- google-chrome - WebUSB API 无法找到兼容设备