首页 > 解决方案 > 使用 Java 为 TextFile 内容中的每个单词建立索引

问题描述

我正在尝试使用 java 索引文本文件中的每个单词

索引意味着我在这里表示单词的索引..

这是我的示例文件https://pastebin.com/hxB8t56p (我要索引的实际文件要大得多)

这是我到目前为止尝试过的代码

ArrayList<String> ar = new ArrayList<String>();
ArrayList<String> sen = new ArrayList<String>();
ArrayList<String> fin = new ArrayList<String>();
ArrayList<String> word = new ArrayList<String>();
String content = new String(Files.readAllBytes(Paths.get("D:\\folder\\poem.txt")), StandardCharsets.UTF_8);

String[] split = content.split("\\s"); // Split text file content
for(String b:split) {
    ar.add(b); // added into the ar arraylist //ar contains every line of poem
}
FileInputStream fstream = null;
String answer = "";fstream=new FileInputStream("D:\\folder\\poemt.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int count = 1;
int songnum = 0;

while((strLine=br.readLine())!=null) {
    String text = strLine.replaceAll("[0-9]", ""); // Replace numbers from txt
    String nums = strLine.split("(?=\\D)")[0]; // get digits from strLine
    if (nums.matches(".*[0-9].*")) {
        songnum = Integer.parseInt(nums); // Parse string to int
    }
    String regex = ".*\\d+.*";
    boolean result = strLine.matches(regex);
    if (result == true) { // check if strLine contain digit
        count = 1;
    }
    answer = songnum + "." + count + "(" + text + ")";
    count++;
    sen.add(answer); // added songnum + line number and text to sen
}

for(int i = 0;i<sen.size();i++) { // loop to match and get word+poem number+line number
    for (int j = 0; j < ar.size(); j++) {
        if (sen.get(i).contains(ar.get(j))) {
            if (!ar.get(j).isEmpty()) {
                String x = ar.get(j) + " - " + sen.get(i);
                x = x.replaceAll("\\(.*\\)", ""); // replace single line sentence
                String[] sp = x.split("\\s+");
                word.add(sp[0]); // each word in the poem is added to the word arraylist
                fin.add(x); // word+poem number+line number
            }
        }
    }
}
Set<String> listWithoutDuplicates = new LinkedHashSet<String>(fin); // Remove duplicates
fin.clear();fin.addAll(listWithoutDuplicates);
Locale lithuanian = new Locale("ta");
Collator lithuanianCollator = Collator.getInstance(lithuanian); // sort array
Collections.sort(fin,lithuanianCollator);
System.out.println(fin);   


    (change in blossom. - 0.2,1.2, &  the - 0.1,1.2, & then - 0.1,1.2)

标签: javaarraysarraylist

解决方案


我将首先为您粘贴的示例复制预期的输出,然后查看代码以了解如何更改它:

诗歌.txt

0.And then the day came,
  to remain blossom.
1.more painful
  then the blossom.

预期产出

[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1,1.2, then - 0.1,1.2, to - 0.2]

正如@Pal Laden 在评论中指出的那样,一些词(the, and)没有被索引。出于索引目的,可能会忽略停用词。

代码的当前输出是

[blossom. - 0.2, blossom. - 1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, the - 0.1, the - 1.2, then - 0.1, then - 1.2, to - 0.2]

因此,假设您修复了停用词,那么您实际上已经非常接近了。您的fin数组包含word+poem number+line number,但它应该包含word+*list* of poem number+line number。有几种方法可以解决这个问题。首先,我们需要删除停用词:

// build stopword-removal set "toIgnore"
String[] stopWords = new String[]{ "a", "the", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

if ( ! toIgnore.contains(sp[0)) fin.add(x); // only process non-ignored words
// was: fin.add(x); 

现在,让我们解决列表问题。最简单(但丑陋)的方法是在最后修复“fin”:

List<String> fixed = new ArrayList<>();
String prevWord = "";
String prevLocs = "";
for (String s : fin) {
    String[] parts = s.split(" - ");
    if (parts[0].equals(prevWord)) {
        prevLocs += "," + parts[1];
    } else {
        if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);
        prevWord = parts[0];
        prevLocs = parts[1];
    }
}
// last iteration
if (! prevWord.isEmpty()) fixed.add(prevWord + " - " + prevLocs);

System.out.println(fixed);

如何以正确的方式做到这一点(TM)

您的代码可以大大改进。特别是,ArrayList对所有东西都使用 flat s 并不总是最好的主意。地图非常适合构建索引:

// build stopwords
String[] stopWords = new String[]{ "and", "a", "the", "to", "of", "more", /*others*/ };
Set<String> toIgnore = new HashSet<>();
for (String s: stopWords) toIgnore.add(s);

// prepare always-sorted, quick-lookup set of terms
Collator lithuanianCollator = Collator.getInstance(new Locale("ta"));
Map<String, List<String>> terms = new TreeMap<>((o1, o2) -> lithuanianCollator.compare(o1, o2));

// read lines; if line starts with number, store separately
Pattern countPattern = Pattern.compile("([0-9]+)\\.(.*)");
String content = new String(Files.readAllBytes(Paths.get("/tmp/poem.txt")), StandardCharsets.UTF_8);
int poemCount = 0;
int lineCount = 1;
for (String line: content.split("[\n\r]+")) {
    line = line.toLowerCase().trim(); // remove spaces on both sides

    // update locations
    Matcher m = countPattern.matcher(line);
    if (m.matches()) {
        poemCount = Integer.parseInt(m.group(1));
        lineCount = 1;
        line = m.group(2); // ignore number for word-finding purposes
    } else {
        lineCount ++;
    }

    // read words in line, with locations already taken care of
    for (String word: line.split(" ")) {
        if ( ! toIgnore.contains(word)) {
            if ( ! terms.containsKey(word)) {
                terms.put(word, new ArrayList<>());
            }
            terms.get(word).add(poemCount + "." + lineCount);
        }
    }
}

// output formatting to match that of your code
List<String> output = new ArrayList<>();
for (Map.Entry<String, List<String>> e: terms.entrySet()) {
    output.add(e.getKey() + " - " + String.join(",", e.getValue()));
}
System.out.println(output);

这给了我[blossom. - 0.2,1.2, came, - 0.1, day - 0.1, painful - 1.1, remain - 0.2, to - 0.2]。我没有修复停用词列表以获得完美匹配,但这应该很容易做到。


推荐阅读