arrays - Hiw 在 Ruby 中合并数组中的元素集
问题描述
我有一个数组,里面装满了来自抓取网站的美国州名。
问题是所有有两个单词的状态最终都成为数组中的单独元素。“纽约”在数组中为["New", "York"]
.
我需要为数组中的每个两个字状态解决这个问题。
["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
"Connecticut", "Florida", "Georgia", "Idaho", "Illinois" "Indiana",
"Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
"Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New", "Hampshire", "New", "Jersey",
"New", "Mexico", "New", "York", "North", "Carolina", "North", "Dakota",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "South", "Carolina",
"South", "Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia",
"Washington", "West", "Virginia", "Wisconsin","Wyoming"]
def scrape_koa_states
doc = Nokogiri::HTML(open("https://koa.com/campgrounds/"))
title = doc.search("h4").text
title_array = title.split
title_array = title_array.delete_if{|ele| ele == "in"}
title_array.map {|ele| ele.gsub!("Campgrounds", "")}
new_array = title_array[2, 56]
binding.pry
end
title = doc.search("h4").text
=> "KOA NewsletterCampgrounds in AlabamaCampgrounds in AlaskaCampgrounds in ArizonaCampgrounds in ArkansasCampgrounds in CaliforniaCampgrounds in ColoradoCampgrounds in ConnecticutCampgrounds in FloridaCampgrounds in GeorgiaCampgrounds in IdahoCampgrounds in IllinoisCampgrounds in IndianaCampgrounds in IowaCampgrounds in KansasCampgrounds in KentuckyCampgrounds in LouisianaCampgrounds in MaineCampgrounds in MarylandCampgrounds in MassachusettsCampgrounds in MichiganCampgrounds in MinnesotaCampgrounds in MississippiCampgrounds in MissouriCampgrounds in MontanaCampgrounds in NebraskaCampgrounds in NevadaCampgrounds in New HampshireCampgrounds in New JerseyCampgrounds in New MexicoCampgrounds in New YorkCampgrounds in North CarolinaCampgrounds in North DakotaCampgrounds in OhioCampgrounds in OklahomaCampgrounds in OregonCampgrounds in PennsylvaniaCampgrounds in South CarolinaCampgrounds in South DakotaCampgrounds in TennesseeCampgrounds in TexasCampgrounds in UtahCampgrounds in VermontCampgrounds in VirginiaCampgrounds in WashingtonCampgrounds in West VirginiaCampgrounds in WisconsinCampgrounds in WyomingCampgrounds in AlbertaCampgrounds in British ColumbiaCampgrounds in ManitobaCampgrounds in New BrunswickCampgrounds in Newfoundland and LabradorCampgrounds in Nova ScotiaCampgrounds in OntarioCampgrounds in Prince Edward IslandCampgrounds in Quebec"
解决方案
代码存在多个问题。
search
返回一个NodeSet:
doc.search('h4').class # => Nokogiri::XML::NodeSet
NodeSet 是一个集合,就像一个数组,并且text
,当提取 NodeSet 的节点文本时,会将结果一起连接成一个字符串。
doc.search('h4').text[0..40] # => "KOA NewsletterCampgrounds in AlabamaCampg"
从中恢复是一场噩梦,我们看到人们经常问这个问题,因为他们没有阅读文档。
这是来自 NodeSet 的text
文档:
获取所有包含的 Node 对象的内部文本
注意:这将连接 NodeSet 中所有 Node 对象的文本:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>') doc.css('d').text # => "foobar"
相反,如果要返回 NodeSet 中所有节点的文本:
doc.css('d').map(&:text) # => ["foo", "bar"]
因此,如示例所示,使用:
doc.search('h4').map(&:text)[0..4] # => ["KOA Newsletter", "Campgrounds in Alabama", "Campgrounds in Alaska", "Campgrounds in Arizona", "Campgrounds in Arkansas"]
请注意,我正在使用[0..4]
now 的一部分来减少数组的输出。你不想那样做。除非你要。
接下来,CSS 和 XPath 有办法在节点内部探查以查看它们的子节点是什么,所以让它们吧。CSS 几乎总是更具可读性,所以我大部分时间都使用它:
doc.search('h4:contains("Campgrounds")').map(&:text)[0..4] # => ["Campgrounds in Alabama", "Campgrounds in Alaska", "Campgrounds in Arizona", "Campgrounds in Arkansas", "Campgrounds in California"]
现在清理结果很简单:
doc.search('h4:contains("Campgrounds")').map { |h4|
h4.text[15..-1]
}
结果是:
# => ["Alabama",
# "Alaska",
# "Arizona",
# "Arkansas",
# "California",
# "Colorado",
# "Connecticut",
# "Florida",
# "Georgia",
# "Idaho",
# "Illinois",
# "Indiana",
# "Iowa",
# "Kansas",
# "Kentucky",
# "Louisiana",
# "Maine",
# "Maryland",
# "Massachusetts",
# "Michigan",
# "Minnesota",
# "Mississippi",
# "Missouri",
# "Montana",
# "Nebraska",
# "Nevada",
# "New Hampshire",
# "New Jersey",
# "New Mexico",
# "New York",
# "North Carolina",
# "North Dakota",
# "Ohio",
# "Oklahoma",
# "Oregon",
# "Pennsylvania",
# "South Carolina",
# "South Dakota",
# "Tennessee",
# "Texas",
# "Utah",
# "Vermont",
# "Virginia",
# "Washington",
# "West Virginia",
# "Wisconsin",
# "Wyoming",
# "Alberta",
# "British Columbia",
# "Manitoba",
# "New Brunswick",
# "Newfoundland and Labrador",
# "Nova Scotia",
# "Ontario",
# "Prince Edward Island",
# "Quebec"]
推荐阅读
- c# - 我没有按回车键,但 ReadLine() 说我做到了(不小心让你无法回答旧的,所以重新发布)
- html - css滚动条超过元素大小
- javascript - 在 Tabulator.js 中移动行时自动滚动
- iis - .NET5 Windows 身份验证提示
- java - 有没有一种方法可以在 android studio 中为一个 Activity 使用两个或更多类?
- sql - 雪花 - 如何多行注释
- postgresql - Postgres 单表查询返回带有大 in 子句的重复项
- django - 可以对这些模型执行单个复杂查询而不是执行多个查询吗?
- r - R 检查数据集是否包含“USA”或“TheUnitedStatesOfAmerica”
- python - 无论架构如何,迁移学习模型都提供 0 准确度