首页 > 解决方案 > Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title

问题描述

What I'm trying to do

I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.

The problems I'm having

I've gotten this to work but I can't figure out two problems in the process_words method:

  1. How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
  2. If a post's URL is not at post.data['url'], where is it?
  3. Also, is there a more efficient, cleaner way to do this?

The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.

Example result

We have a blog with 3 posts:

And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:

I love mountain biking and bicycles in general. 

The plugin would process that sentence and output it as:

I love mountain biking and <a href="https://example.com/link/to/bicycles/">bicycles</a> in general. 

My current code (UPDATED 1)

# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'


module Jekyll

    # Replace the first occurance of each post title in the content with the post's title hyperlink
    module HyperlinkFirstWordOccurance
        POST_CONTENT_CLASS = "page__content"
        BODY_START_TAG = "<body"
        ASIDE_START_TAG = "<aside"
        OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
        CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!

        class << self
            # Public: Processes the content and updates the 
            # first occurance of each word that also has a post
            # of the same title, into a hyperlink.
            #
            # content - the document or page to be processes.
            def process(content)
                @title = content.data['title']
                @posts = content.site.posts

                content.output = if content.output.include? BODY_START_TAG
                                    process_html(content)
                                else
                                    process_words(content.output)
                                end
            end


            # Public: Determines if the content should be processed.
            #
            # doc - the document being processes.
            def processable?(doc)
                (doc.is_a?(Jekyll::Page) || doc.write?) &&
                    doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
            end


            private

            # Private: Processes html content which has a body opening tag.
            #
            # content - html to be processes.
            def process_html(content)
            content.output = if content.output.include? ASIDE_START_TAG
                    head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
                            else
                    head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
                            end
                body_content, *rest = tail.partition("</body>")

                processed_markup = process_words(body_content)

                content.output = String.new(head) << opener << processed_markup << rest.join
            end

            # Private: Processes each word of the content and makes
            # the first occurance of each word that also has a post
            # of the same title, into a hyperlink.
            #
            # html = the html which includes all the content.
            def process_words(html)
                page_content = html
                @posts.docs.each do |post|
                    post_title = post.data['title'] || post.name
                    post_title_lowercase = post_title.downcase
                    if post_title != @title
                        if page_content.include?(" " + post_title_lowercase + " ") ||
                            page_content.include?(post_title_lowercase + " ") ||
                            page_content.include?(post_title_lowercase + ",") ||
                            page_content.include?(post_title_lowercase + ".")
                            page_content = page_content.sub(post_title_lowercase, "<a href=\"#{ post.url }\">#{ post_title.downcase }</a>")
                        elsif page_content.include?(" " + post_title + " ") ||
                            page_content.include?(post_title + " ") ||
                            page_content.include?(post_title + ",") ||
                            page_content.include?(post_title + ".")
                            page_content = page_content.sub(post_title, "<a href=\"#{ post.data['url'] }\">#{ post_title }</a>")
                        end
                    end
                end
                page_content
            end
        end
    end
end


Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
  # code to call after Jekyll renders a post
  Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end

Update 1

Updated my code with @Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.

Also improved checking and replacing the correct term.

PS: The code base example I started with working on my plugin was @Keith Mifsud's jekyll-target-blank plugin

标签: rubypluginsjekyllhooknokogiri

解决方案


这段代码看起来很熟悉 :) 我建议您查看 Rspecs 测试文件来测试您的问题:https ://github.com/keithmifsud/jekyll-target-blank

我会尽力回答您的问题,抱歉,我无法在撰写本文时亲自测试这些问题。

如何仅在帖子的主要内容复制文本中搜索帖子标题,而不是帖子或目录之前的元标记(这也是在主要帖子复制文本之前生成的)?我无法让它与 Nokigiri 一起使用,尽管这似乎是这里的首选工具。

您的要求是:

1)忽略<body></body>标签外的内容。

这似乎已经在process_html()方法中实现了。此方法说明了唯一的过程body_content,它应该按原样工作。你有测试吗?你是怎么调试的?相同的字符串拆分在我的插件中起作用。即只处理正文中的内容。

2) 忽略目录 (TOC) 中的内容。我建议您process_html()通过进一步拆分body_content变量来扩展该方法。在 TOC 的开始和结束标记之间搜索内容(按 id、css 类等)并将其排除,然后将其添加回process_words字符串之前或之后的位置。

3) 是否使用Nokigiri插件?这个插件非常适合解析 html。我认为您正在解析字符串,然后创建 html。所以 vanilla Ruby 和 URI 插件就足够了。如果需要,您仍然可以使用它,但它不会比在 ruby​​ 中拆分字符串更快。

如果帖子的 URL 不在 post.data['url'] 中,它在哪里?

我认为您应该有一种方法来获取所有帖子标题,然后将“单词”与数组匹配。您可以从文档本身获取所有帖子集合, doc.site.posts然后 foreach 帖子返回标题。该process_words()方法可以检查每个工作以查看它是否与数组中的项目匹配。但是,如果标题由多个单词组成怎么办?

另外,有没有更有效、更清洁的方法来做到这一点?

到目前为止,一切都很好。我将从解决问题开始,然后针对速度和编码标准进行重构。

我再次建议您使用测试来帮助您解决这个问题。

让我知道我是否可以提供更多帮助:)


推荐阅读