web-scraping - wget下载nofollow链接

问题描述

我想用 wget 抓取/抓取一个 wordpress 网站。
问题：wget 将下载文档/链接，尽管它们具有rel=nofollow属性。是的，我确实允许 robots.txt。

例子：

wget --mirror --page-requisites --adjust-extension --convert-links --restrict-file-names=windows --no-parent --span-hosts --domains=randomascii.wordpress.com,wp.com https://randomascii.wordpress.com/about/

现在打开about文件夹，几秒钟后，您将看到数十个来自 nofollow 链接的 html 文件：index.html@share=reddit.html, index.html@share=twitter.html, index.html@replytocom=74214.html...

GNU Wget 1.20.3 built on msys.

-cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/openssl

Wgetrc:
    /etc/wgetrc (system)
Locale:
    /usr/share/locale
Compile:
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib -DHAVE_LIBSSL
    -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe
Link:
    gcc -DHAVE_LIBSSL -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe
    -pipe -lmetalink -lexpat -lpcre2-8 -luuid -lssl -lcrypto -lz -lz
    -lpsl -lidn2 -liconv -lunistring -lgpgme -lassuan -lgpg-error
    ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a -liconv -lintl
    /usr/lib/libunistring.dll.a

标签： web-scrapingweb-crawlerwget

web-scraping - wget下载nofollow链接

问题描述

解决方案

推荐阅读