java - 使用 Java 为 TextFile 内容中的每个单词建立索引
问题描述
我正在尝试使用 java 索引文本文件中的每个单词
索引意味着我在这里表示单词的索引..
这是我的示例文件https://pastebin.com/hxB8t56p (我要索引的实际文件要大得多)
这是我到目前为止尝试过的代码
ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);
String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;
while((strLine=br.readLine())!=null) {
String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
if (nums.matches(".*[0-9].*")) {
songnum = Integer.parseInt(nums); // Parse string to int
}
String regex = ".*\\d+.*";
boolean result = strLine.matches(regex);
if (result == true) { // check if strLine contain digit
count = 1;
}
answer = songnum + "." + count + "(" + text + ")";
count++;
sen.add(answer); // added songnum + line number and text to sen
}
for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
for (int j = 0; j < ar.size(); j++) {
if (sen.get(i).contains(ar.get(j))) {
if (!ar.get(j).isEmpty()) {
String x = ar.get(j) + " - " + sen.get(i);
x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
String[] sp = x.split("\\s+");
word.add(sp[0]); // each word in the poem is added to the word arraylist
fin.add(x); // word+poem number+line number
}
}
}
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);
(change in blossom. - 0.2,1.2, & the - 0.1,1.2, & then - 0.1,1.2)
解决方案
我将首先为您粘贴的示例复制预期的输出,然后查看代码以了解如何更改它:
诗歌.txt
0.And then the day came,
to remain blossom.
1.more painful
then the blossom.
预期产出
[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1,1.2, then - 0.1,1.2, to - 0.2]
正如@Pal Laden 在评论中指出的那样,一些词(the
, and
)没有被索引。出于索引目的,可能会忽略停用词。
代码的当前输出是
[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]
因此,假设您修复了停用词,那么您实际上已经非常接近了。您的fin
数组包含word+poem number+line number
,但它应该包含word+*list* of poem number+line number
。有几种方法可以解决这个问题。首先,我们需要删除停用词:
// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x);
现在,让我们解决列表问题。最简单(但丑陋)的方法是在最后修复“fin”:
List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
String[] parts = s.split(" - ");
if (parts[0].equals(prevWord)) {
prevLocs += "," + parts[1];
} else {
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
prevWord = parts[0];
prevLocs = parts[1];
}
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
System.out.println(fixed);
如何以正确的方式做到这一点(TM)
您的代码可以大大改进。特别是,ArrayList
对所有东西都使用 flat s 并不总是最好的主意。地图非常适合构建索引:
// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);
// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));
// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
line = line.toLowerCase().trim(); // remove spaces on both sides
// update locations
Matcher m = countPattern.matcher(line);
if (m.matches()) {
poemCount = Integer.parseInt(m.group(1));
lineCount = 1;
line = m.group(2); // ignore number for word-finding purposes
} else {
lineCount ++;
}
// read words in line, with locations already taken care of
for (String word: line.split(" ")) {
if ( ! toIgnore.contains(word)) {
if ( ! terms.containsKey(word)) {
terms.put(word, new ArrayList<>());
}
terms.get(word).add(poemCount + "." + lineCount);
}
}
}
// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);
这给了我[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, to - 0.2]
。我没有修复停用词列表以获得完美匹配,但这应该很容易做到。
推荐阅读
- django - 在 Google Cloud Run 中运行 Django 的 createsuperuser
- php - CSV 在使用 PHP 显示 HTML 表之前检查列
- google-app-engine - Betfair Non-Interactive (Bot) 登录在 Google App Engine 区域 eu-west2(伦敦)中不起作用
- spring - Thymeleaf:在html标签上拆分字符串
- java - Spring MVC 控制器请求映射无法正常工作
- google-apps-script - 创建广告并分配创意和展示位置
- python - 使用 Pillow 将绘制的长文本分成多行
- javascript - 如何根据对象属性向渲染图像添加“活动”类?
- ruby - POST(不是 GET)wikidata 查询
- python - 如何在 python 中找到包含 30 只股票的数据集中的年化收益率