首先eclipse搭建一个java项目,项目结构如下:
common:放公共类,如常量、工具类、dto等
demos:放控制类,相当于程序执行入口
service:信息检索逻辑实现,包括切词、词频统计、词的权重计算、构建向量空间模型、检索等。
test:不用说了,肯定测试用的
两个配置文件:paoding分词器核心配置文件和项目属性配置文件
依赖的jar包:
集成paoding分词器:
官网下载paoding解压后,将jar包和paoding.dic.home.properties拷贝到项目里,将dic文件放到磁盘E:/paoding/
配置paoding.dic.home.properties
paoding.dic.home=E:/paoding/dic
自定义词典:
paoding分词器可以自定义词典以及停用词典,打开dic目录,新建一个.dic文件即可
测试一下分词器好不好使:
AnalyzerTest添加
1 protected String dissect(String input) { 2 try { 3 TokenStream ts = analyzer.tokenStream("", new StringReader(input)); 4 Token token; 5 sb.setLength(0); 6 while ((token = ts.next()) != null) { 7 sb.append(token.termText()).append('/'); 8 } 9 if (sb.length() > 0) { 10 sb.setLength(sb.length() - 1); 11 } 12 return sb.toString(); 13 } catch (Exception e) { 14 e.printStackTrace(); 15 return "error"; 16 } 17 } 18 19 /** 20 * 分词测试 21 */ 22 public void testDisect(){ 23 try { 24 //String content = ContentReader.readText(Demo.class); 25 String content ="你问我爱你有多深"; 26 String result = dissect(content); 27 System.out.println(result); 28 } catch (Exception e) { 29 // TODO Auto-generated catch block 30 e.printStackTrace(); 31 } 32 }
运行结果:
ok,paoding分词器没什么问题。接下来进入正题
文档集是30篇word文档,DOC_PATH = "E:/Doc"
工具类方法:将doc文档转为字符串,用了第三方jar包tm-extractors-0.4.jar
/** * 将word文档转换为存文本 * @param file * @return * @throws Exception */ public static String wordToStr(File file) throws Exception{ FileInputStream in = new FileInputStream(file); WordExtractor extractor = null; String text = null; // 创建WordExtractor extractor = new WordExtractor(); // 对doc文件进行提取 text = extractor.extractText(in); return text.trim(); }
思路:打开文档集文件,获取所有.doc文档,循环将doc文档转化为String,然后对String切词,切词后进行词频统计,每篇文档保存一个中间结果到本地磁盘,中间结果的存储结构为
(文档编号/词项/词频)
词频统计算法:
1 /** 2 * 统计列表中的每个词出现的次数(词频统计) 3 * @param list 4 * @return 5 */ 6 public static Map count(List list){ 7 Map<String,Integer> map = new HashMap<>(); 8 String temp = ""; 9 for(int i=0 ; i<list.size();i++){ 10 temp = (String) list.get(i); 11 if(map == null){ 12 map.put(temp, 1); 13 }else{ 14 if(map.get(temp)==null){ 15 map.put(temp, 1); 16 } 17 else{ 18 int num = map.get(temp); 19 num++; 20 map.put(temp, num); 21 } 22 } 23 } 24 return map; 25 }
整个过程代码:
1 /** 2 * 每一篇文档成成一个词项文本(中间结果) 3 * 文本每行格式 :文档ID/词项/词频 4 * @throws Exception 5 */ 6 public void docToTerms() throws Exception{ 7 long start = System.currentTimeMillis(); 8 prop = MyTools.getProp("application.properties"); 9 if(prop==null){ 10 return; 11 } 12 String DOC_PATH = prop.getProperty("doc_path"); 13 14 Analyzer analyzer = new PaodingAnalyzer(); 15 File root = new File(DOC_PATH); 16 File[] docs = root.listFiles(); 17 18 try { 19 String maxStr = ""; 20 for(int i=0;i<docs.length;i++){ 21 String docStr = MyTools.wordToStr(docs[i]); 22 List<String> tokenList = new ArrayList<>(); 23 tokenList = MyTools.dissectToList(analyzer, docStr); 24 Map<String,Integer> tfmap = new HashMap<>(); 25 tfmap = MyTools.count(tokenList); 26 27 StringBuilder sb = new StringBuilder(); 28 29 int max=1; 30 Iterator iter = tfmap.entrySet().iterator(); 31 while(iter.hasNext()){ 32 Map.Entry enter = (Map.Entry)iter.next(); 33 String word = (String)enter.getKey(); 34 int wordNum = (int)enter.getValue(); 35 36 if(wordNum>max){ 37 max = wordNum; 38 } 39 40 String docId = docs[i].getName().substring(3); 41 docId= docId.substring(0,docId.length()-4); 42 sb.append(docId+"/"+word+"/"+wordNum+"\r\n"); 43 } 44 maxStr += (max+"/"); 45 46 47 //每一篇文档成成一个词项文本 48 String docId = docs[i].getName().substring(3); 49 docId= docId.substring(0,docId.length()-4); 50 String fileStr = "E:/index2/termDoc/"+docId+".txt"; 51 File file = new File(fileStr); 52 if(!file.exists()){ 53 file.createNewFile(); 54 } 55 FileOutputStream fo = new FileOutputStream(file); 56 OutputStreamWriter os = new OutputStreamWriter(fo); 57 BufferedWriter bw = new BufferedWriter(os); 58 bw.write(sb.toString().trim()); 59 60 bw.close(); 61 os.close(); 62 fo.close(); 63 64 //统计每片文档中的最大词频数 65 File maxtffile = new File("E:/index2/maxtf.txt"); 66 if(!maxtffile.exists()){ 67 maxtffile.createNewFile(); 68 } 69 FileOutputStream fo1 = new FileOutputStream(maxtffile); 70 OutputStreamWriter os1 = new OutputStreamWriter(fo1); 71 BufferedWriter bw1 = new BufferedWriter(os1); 72 bw1.write(maxStr); 73 74 bw1.close(); 75 os1.close(); 76 fo1.close(); 77 78 } 79 80 } catch (Exception e) { 81 // TODO Auto-generated catch block 82 e.printStackTrace(); 83 } 84 85 long end = System.currentTimeMillis(); 86 87 double runTimes = (end-start)/1000.0; 88 System.out.print("文档词频统计中间文件生成时间是:"+runTimes+"秒"); 89 90 }
代码过程中统计了每篇文档的最大词频数maxStr,也序列化到本地,方便词频的归一化:词频归一化=词频/最大词频
使用的java旧I/O处理,并没有考虑多线程问题,流的关闭有瑕疵,也懒得改了。
30篇文档集,产生了30个中间结果,取一个中间文件部分如下:文档ID/词项/词频
1/浪费/1
1/法权/1
1/提高/2
1/重视/2
1/利用率/1
1/引起/1
1/保证/1
1/效益/1
1/利于/4
下面的词频归一化和权重的计算都是基于上述中间结果的,weight(docId,term) = tf(docId,term) * idf(docId,term)
tf(docId,term):表示某一文档中,词项的词频(该词在这篇文档中出现的次数),这里要归一化表示为tf/maxtf。例如:1/浪费/1,maxtf=4,则tf(1,浪费)=1/4
idf(docId,term):倒置文档频率,词项term在文档集中出现的次数(在文档i中出现记1,不出现记0)表示为文档频率,倒置文档频率为文档频率的倒数。这里也用数学方法处理一下idf取对数log(idf)。
weight(docId,term):表示词的权重
至于这部分怎么算:
1 /** 2 * 统计文档出现的所有需要建索引的词项 3 * @return 4 */ 5 public List<String> getIndexTerms(){ 6 7 long start = System.currentTimeMillis(); 8 double threshold = Constant.threshold; 9 Set<String> set = new HashSet<>(); 10 File file = new File("E:/index2/DocIndex(tfidf).txt"); 11 List<Inverted> list = MyTools.readDocIndex(file); 12 for(Inverted invert : list){ 13 if(invert.getFrequency()>threshold){ 14 set.add(invert.getLexicalItem()); 15 //System.out.println(invert.getFrequency()); 16 } 17 18 } 19 List<String> result = new ArrayList<>(set); 20 21 long end = System.currentTimeMillis(); 22 double runTimes = (end-start)/1000.0; 23 System.out.println("统计文档所有需要建索引de词项运行时间是:"+runTimes+"秒"); 24 return result; 25 } 26 27 /* 28 * 将所有的文档放入hashMap数组中 <词项, 词频> 29 * 文档编号-1,对应数组编号 30 */ 31 public Map<String,Double>[] getMapArray(){ 32 33 long start = System.currentTimeMillis(); 34 35 File file = new File(Constant.TERMDOC_PATH); 36 File[] fileList = file.listFiles(); 37 Map<String,Double>[] map = new Map[30]; 38 for(int i=0;i<Constant.DOC_NUMS;i++){ 39 List<Inverted> list = MyTools.readDocIndex(fileList[i]); 40 map[(Constant.docnum[i]-1)] = new HashMap<String,Double>(); 41 for(Inverted inverted:list){ 42 map[(Constant.docnum[i]-1)].put(inverted.getLexicalItem(), inverted.getFrequency()); 43 } 44 } 45 46 long end = System.currentTimeMillis(); 47 double runTimes = (end-start)/1000.0; 48 System.out.println("文档转成hash表运行时间是:"+runTimes+"秒"); 49 50 return map; 51 52 } 53 54 /** 55 * 统计词-文档数 56 */ 57 public void countTermDocNum(){ 58 long start = System.currentTimeMillis(); 59 List<String> result = getAllTerms(); 60 Map<String,Double>[] map = getMapArray(); 61 System.out.println(result.size()); 62 for(String s:result){ 63 int num=0; 64 for(int i=0;i<Constant.DOC_NUMS;i++){ 65 if(map[i].containsKey(s)){ 66 num++; 67 } 68 } 69 System.out.println(s+":"+num); 70 } 71 long end = System.currentTimeMillis(); 72 double runTimes = (end-start)/1000.0; 73 System.out.println("运行时间是:"+runTimes+"秒"); 74 } 75 76 /** 77 * 建立词项文档数 78 * @throws Exception 79 */ 80 public void countTermDocNumCopy() throws Exception{ 81 long start = System.currentTimeMillis(); 82 List<String> result = getAllTerms(); 83 Map<String,Double>[] map = getMapArray(); 84 File file = new File("E:/index2/DocIndex(df).txt"); 85 if(!file.exists()){ 86 file.createNewFile(); 87 } 88 FileOutputStream fo = new FileOutputStream(file); 89 OutputStreamWriter os = new OutputStreamWriter(fo); 90 BufferedWriter bw = new BufferedWriter(os); 91 List<Inverted> list = MyTools.readDocIndex(new File("E:/index2/DocIndex(tf).txt")); 92 for(Inverted inverted:list){ 93 String term = inverted.getLexicalItem(); 94 int num=0; 95 for(int i=0;i<Constant.DOC_NUMS;i++){ 96 if(map[i].containsKey(term)){ 97 num++; 98 } 99 } 100 //inverted.setFrequency(num); 101 bw.write(inverted.getDocId()+"/"+inverted.getLexicalItem()+"/"+num); 102 bw.write("\r\n"); 103 } 104 bw.close(); 105 os.close(); 106 fo.close(); 107 //System.out.println(result.size()); 108 109 long end = System.currentTimeMillis(); 110 double runTimes = (end-start)/1000.0; 111 System.out.println("建立词项文档数运行时间是:"+runTimes+"秒"); 112 } 113 114 /** 115 * 生成idf 116 * @throws Exception 117 */ 118 public void getDocIdf() throws Exception{ 119 long start = System.currentTimeMillis(); 120 List<String> result = getAllTerms(); 121 Map<String,Double>[] map = getMapArray(); 122 File file = new File("E:/index2/DocIndex(idf).txt"); 123 if(!file.exists()){ 124 file.createNewFile(); 125 } 126 FileOutputStream fo = new FileOutputStream(file); 127 OutputStreamWriter os = new OutputStreamWriter(fo); 128 BufferedWriter bw = new BufferedWriter(os); 129 List<Inverted> list = MyTools.readDocIndex(new File("E:/index2/DocIndex(df).txt")); 130 for(Inverted inverted:list){ 131 String term = inverted.getLexicalItem(); 132 int num=0; 133 for(int i=0;i<Constant.DOC_NUMS;i++){ 134 if(map[i].containsKey(term)){ 135 num++; 136 } 137 } 138 //inverted.setFrequency(num); 139 bw.write(inverted.getDocId()+"/"+inverted.getLexicalItem()+"/"+(Math.log(Constant.DOC_NUMS/num))/Math.log(10)); 140 bw.write("\r\n"); 141 } 142 bw.close(); 143 os.close(); 144 fo.close(); 145 //System.out.println(result.size()); 146 147 long end = System.currentTimeMillis(); 148 double runTimes = (end-start)/1000.0; 149 System.out.println("生成idf文件运行时间是:"+runTimes+"秒"); 150 } 151 152 /** 153 * 生成tf*idf文件 154 * @throws Exception 155 */ 156 public void getTfIdf() throws Exception{ 157 long start = System.currentTimeMillis(); 158 File fileTfIdf = new File("E:/index2/DocIndex(tfidf).txt"); 159 if(!fileTfIdf.exists()){ 160 fileTfIdf.createNewFile(); 161 } 162 163 FileOutputStream fo = new FileOutputStream(fileTfIdf); 164 OutputStreamWriter os = new OutputStreamWriter(fo); 165 BufferedWriter bw = new BufferedWriter(os); 166 167 List<Inverted> listTf = MyTools.readDocIndex(new File("E:/index2/DocIndex(tf).txt")); 168 List<Inverted> listIdf = MyTools.readDocIndex(new File("E:/index2/DocIndex(idf).txt")); 169 170 for(int i=0;i<listTf.size();i++){ 171 double tf = listTf.get(i).getFrequency(); 172 double idf = listIdf.get(i).getFrequency(); 173 bw.write(listTf.get(i).getDocId()+"/"+listTf.get(i).getLexicalItem()+"/"+tf*idf); 174 bw.write("\r\n"); 175 } 176 bw.close(); 177 os.close(); 178 fo.close(); 179 long end = System.currentTimeMillis(); 180 double runTimes = (end-start)/1000.0; 181 System.out.println("生成tf*idf文件运行时间是:"+runTimes+"秒"); 182 } 183
选定一个阈值,将权重低于这个阈值的词项删掉,剩下的所有词项放到一个集合(注意集合的性质:元素的唯一性)里面。然后基于该集合的所有词项,建立一个词项文档的倒排
建立倒排代码:
1 /** 2 * 倒排 3 * @throws Exception 4 */ 5 public void invertedIndex() throws Exception{ 6 long start = System.currentTimeMillis(); 7 List<String> result = getAllTerms(); 8 Map<String,Double>[] map = getMapArray(); 9 System.out.println(result.size()); 10 11 File file = new File("E:/index2/inverted.txt"); 12 FileOutputStream fo = new FileOutputStream(file); 13 OutputStreamWriter os = new OutputStreamWriter(fo); 14 BufferedWriter bw = new BufferedWriter(os); 15 16 for(String s:result){ 17 18 List<Integer> sortDoc = new ArrayList<>(); 19 for(int i=0;i<Constant.DOC_NUMS;i++){ 20 if(map[i].containsKey(s)){ 21 sortDoc.add(i+1); 22 } 23 } 24 bw.write(s+":"); 25 String str = ""; 26 for(int i:sortDoc){ 27 str+=i+"/"; 28 29 } 30 bw.write(str.substring(0, str.length()-1)); 31 bw.write("\r\n"); 32 //System.out.println(s+":"+sortDoc); 33 } 34 bw.close(); 35 os.close(); 36 fo.close(); 37 long end = System.currentTimeMillis(); 38 double runTimes = (end-start)/1000.0; 39 System.out.println("建立倒排运行时间是:"+runTimes+"秒"); 40 }
倒排结果部分展示:
正式:16
老:2
仅是:4/11/24
退耕:24
者:12/27
最有:22
深远:23/27
公司:25
长方:25/27/28
内部:3/11
决策:2/9/20/29
为国:22
专家:3
准备:11
领域:3/21/22/24
法制:30
挥了:12/22
加产:17
倍:17
供求:23
知情权:29
为最:26
加人:22
登场:18
逐渐:5/27
耗费:27
确立:5/7/27
突击队:12
就是:9/10/12/16/22/24/27
就是:9/10/12/16/22/24/27:表示“就是”这个词项在文档9、文档10、文档12、...、文档27都出现了
检索部分:
输入一个query,对query切词
方式1(布尔):建立词项布尔表达式,然后在倒排库里找,不断求交集和并集的,得到相关文档集。缺点是无法计算文档与query的相关度。
方式2(向量空间):对上一节得到的词项,将每篇文档建立一个向量,词项出现为1,不出现为0。然后对query也建立向量,利用SVM的公式求文档与查询的相似度。
对文档的建模
1 /** 2 * 构建文档向量(向量空间模型) 3 * 问题:得到的文档向量是一个稀疏矩阵,存储空间有待优化 4 * @throws Exception 5 */ 6 public void getVector() throws Exception{ 7 8 long start = System.currentTimeMillis(); 9 //List<String> result = getAllTerms(); 10 List<String> result = getIndexTerms(); 11 Map<String,Double>[] map = getMapArray(); 12 13 File file = new File("E:/index2/DocVector1.txt"); 14 FileOutputStream fo = new FileOutputStream(file); 15 OutputStreamWriter os = new OutputStreamWriter(fo); 16 BufferedWriter bw = new BufferedWriter(os); 17 18 for(int i=0;i<Constant.DOC_NUMS;i++){ 19 List<Double> list = new ArrayList<>(); 20 for(String s : result){ 21 if(map[i].containsKey(s)){ 22 list.add(map[i].get(s)); 23 }else{ 24 list.add(0.0); 25 } 26 } 27 bw.write((i+1)+":"); 28 String str = ""; 29 for(double j:list){ 30 str+=j+"/"; 31 32 } 33 bw.write(str.substring(0, str.length()-1)); 34 bw.write("\r\n"); 35 } 36 bw.close(); 37 os.close(); 38 fo.close(); 39 long end = System.currentTimeMillis(); 40 double runTimes = (end-start)/1000.0; 41 System.out.println("构建向量空间运行时间是:"+runTimes+"秒"); 42 } 43
好了,接下来的就自己实现吧~