ruby - 在保持单词列顺序的同时拆分单词字符串数组
问题描述
我有一个字符串
WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,\n KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.\n CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE\n SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,\n Vs MR SANGAY GURMEY CENTRAL GOVT.\n THE SECRETARY, MINISTRY BHUTIA COUNSEL\n OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE\n AND ORS. MR. ZANGPO SHERPA, BHUTIA\n AMICUS CURIAE MS POLLIN RAI, ASST.\n GOVT. ADVOCATE\n
我使用 '\n' 字符拆分。结果是
[" WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,",
" KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.",
" CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE",
" SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,",
" Vs MR SANGAY GURMEY CENTRAL GOVT.",
" THE SECRETARY, MINISTRY BHUTIA COUNSEL",
" OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE",
" AND ORS. MR. ZANGPO SHERPA, BHUTIA",
" AMICUS CURIAE MS POLLIN RAI, ASST.",
" GOVT. ADVOCATE"]
我想为每行提取 4 列(即将字符串数组转换为矩阵)。此外,提取的字符串应该属于关联的列。例如'政府。最后一个字符串中的 ADVOCATE' 应该属于提取为 ['', '', '', 'GOVT. 提倡']
我正在使用docsplit库来解析具有表格数据的 pdf。问题是 pdf 中的每一行都有内部表,类似于下面指定的字符串数组。
我尝试获取每列单词的第一个字符的索引作为参考,并使用这些值来处理字符串,但无法找到有效的解决方案。
解决方案
这是我根据上面的评论的解决方案:
require 'pp'
test_array = [" WP(PIL)/7/2013 PUBLIC AND PANCHAYAT MS PEMA BHUTIA MR. S.K. CHETTRI,",
" KABI LUNGCHUK MS PANILA THEENGH ASST. GOVT.",
" CONSTITUENCY, NORTH MS MON MAYA SUBBA ADVOCATE",
" SIKKIM MS TASHI DOMA SHERPA MR. KARMA THINLAY,",
" Vs MR SANGAY GURMEY CENTRAL GOVT.",
" THE SECRETARY, MINISTRY BHUTIA COUNSEL",
" OF SURFACE TRANSPORT MR. JORGAY NAMKA MR THINLAY DORJEE",
" AND ORS. MR. ZANGPO SHERPA, BHUTIA",
" AMICUS CURIAE MS POLLIN RAI, ASST.",
" GOVT. ADVOCATE"]
class ColumnAnalyzer
attr_reader :columns
attr_accessor :array
def initialize(array)
@array = array
analyze
end
def analyze
lefts = Array.new
rights = Array.new
@array.each do |line|
pos_left = Array.new
deconstruct = line.dup
col = 0
while m = deconstruct.match(/\s\s[^\s]{1}/) do
left = m.offset(0)[0]+1
pos_left[col] = col == 0 ? left : left + pos_left[col-1]
col += 1
deconstruct = deconstruct[left+1..-1]
end
lefts.push pos_left
pos_right = Array.new
deconstruct = line.dup
col = 0
while m = deconstruct.match(/[^\s]{1}\s\s/) do
right = m.offset(0)[0]
pos_right[col] = col == 0 ? right : right + pos_right[col-1]
col += 1
deconstruct = deconstruct[right+1..-1]
end
pos_right.push line.length
rights.push pos_right
end
cols_l = lefts.collect { |a| a.size }.max
cols_r = rights.collect { |a| a.size }.max
cols = [cols_l,cols_r].max # no. of columns
@columns = Array.new
(0..cols-1).each do |col|
@columns[col] = Hash.new
@columns[col][:l] = lefts.map { |a| a[col] }.min
lefts.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
rights.select { |a| a.size < cols }.map! { |a| a.unshift 0 }
end
(0..cols-1).each do |col|
@columns[col][:r] = rights.map { |a| a[col] }.max
end
end
def extract
data = Array.new
@array.each do |line|
line_array = Array.new
@columns.each do |col|
line_array.push line[col[:l]..col[:r]].strip!
end
data.push line_array
end
data
end
end
ca = ColumnAnalyzer.new(test_array)
data = ca.extract
pp ca.columns
pp data
=> [{:l=>7, :r=>21}, {:l=>28, :r=>54}, {:l=>62, :r=>85}, {:l=>87, :r=>113}]
[["WP(PIL)/7/2013",
"PUBLIC AND PANCHAYAT",
"MS PEMA BHUTIA",
"MR. S.K. CHETTRI,"],
["", "KABI LUNGCHUK", "MS PANILA THEENGH", "ASST. GOVT."],
["", "CONSTITUENCY, NORTH", "MS MON MAYA SUBBA", "ADVOCATE"],
["", "SIKKIM", "MS TASHI DOMA SHERP", "MR. KARMA THINLAY,"],
["", "Vs", "MR SANGAY GURMEY", "CENTRAL GOVT."],
["", "THE SECRETARY, MINISTRY", "BHUTIA", "COUNSEL"],
["", "OF SURFACE TRANSPORT", "MR. JORGAY NAMKA", "MR THINLAY DORJEE"],
["", "AND ORS.", "MR. ZANGPO SHERPA,", "BHUTIA"],
["", "", "AMICUS CURIAE", "MS POLLIN RAI, ASST."],
["", "", "", "GOVT. ADVOCATE"]]
推荐阅读
- assembly - 如果我尝试“手动”访问它们,我只打印部分字符串
- hibernate - 如何使可嵌入类的 id 一起唯一的 spring data jpa 以便不能输入重复项
- azure-devops - 通过 REST 创建包含所有元数据(如 buildNumber)的 Azure Devops/Pipeline 版本
- python-3.x - 如何从带有复选框的下拉框中保存选定的输入?我的下拉框没问题,但复选框不起作用
- excel - 根据日期删除行
- python - 如何从 Python 中的 URL 下载 pdf 并将其转换为全文(用于数据集传递的 HTML/字典)?
- python - Ruby 相当于后台调度程序?
- python - 将 curl 请求转换为 python 代码时出现错误请求
- rust - 是否可以在项目范围内启用内部属性?
- php - PHP 8方法覆盖来自同一BaseClass的不同类型