Word Segmentation and Word discovery

 

  Reference & Comment
1

Ogawa, Yasushi; Matsuda, Toru 
1999 
Overlapping statistical segmentation for effective indexing of Japanese text Information Processing & Management, Volume: 35, Issue: 4 pp. 463-480

2 Jens Kohlmorgen, Steven Lemm. 
2001. 
A Dynamic HMM for On-line Segmentation of Sequential Data. 
To appear in Proceedings of NIPS-2001. 

和wordseg不太相关

3 Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming. 
2001. 
Dimitar Kazakov, Suresh Manandhar. 
Machine Learning, 43 (1/2):121-162, April 2001. (C) Kluwer Academic Publishers 

不错,可是是用来做Morph的(recommended)

4 A Statistical Model for Word Discovery in Transcribed Speech 
2001. 
Anand Venkataraman Computational Linguistics Volume 27 Number 3 Pages 351 - 379, 2001.
5 Sun Maosong, Shen Dayang, and Huang Changning, 
1997. 
Cseg & tag1.0: A Practical Word Segmenter and POS Tagger for Chinese Texts, 
Fifth Conference on Applied Natural Language Processing, Washington, DC. USA, pp.119-126, 1997.3.31-4.3 

Supervised 的survey

6 Tom B.Y.Lai, Sun Maosong,, Benjamin K. Tsou, S. Caesar Lun, 
1997. 
Chinese Word Segmentation and Part-of-Speech Tagging in One Step, 
Proceedings of Rocling X International Conference 1997 Research on Computational Linguistics, Taipei, Taiwan, China, August 22-24, pp.229-236, 1997. 

分而治之策略

7 W. J. Teahan. 
Text Classification and Segmentation Using Minimum Cross-Entropy. 
In Proceedings of the International Conference on Content-based Multimedia Information Access (RIAO 2000), pages 943-961. C.I.D.-C.A.S.I.S, Paris,France, 2000. 
ISBN 2-905450-07-X. 

和下一篇一样

8 W. J. Teahan, Y. Wen*, R. McNab*, and I. H. Witten*. 
A Compression-based Algorithm for Chinese Word Segmentation. 
Computational Linguistics, 26(3):375-393, 2000. 
ISSN 0891-2017. 

Supervised Word Segmentation,最短路算法框架

9

A. Stolcke & E. Shriberg 
(1996), 
Automatic linguistic segmentation of conversational speech. 
Proc. Intl. Conf. on Spoken Language Processing, vol. 2, pp. 1005-1008, Philadelphia, PA.

10 A. Stolcke, E. Shriberg, R. Bates, M. Ostendorf, D. Hakkani, M. Plauche, G. Tur, & Y. Lu 
(1998). 
Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words. 
Proc. Intl. Conf. on Spoken Language Processing, vol. 5, pp. 2247-2250, Sydney, Australia
11

Deb Roy 
2000. 
A Computational Model of Word Learning from Multimodal Sensory Input. 
International conference of Cognitive Modeling, Groningen, Netherlands, March 2000

12

Michael R. Brent and Xiaopeng Tao 
2001. 
Chinese Text Segmentation With MBDP-1: Making the Most of Training Corpora ACL2001. 

没怎么看懂,感觉不太好

13 Ando, R. K. and Lee, L. 
2000. 
Mostly-Unsupervised Statistical Segmentation of Japanese: Application to Kanji. 
ANLP-NAACL

Mutual Information 体系,可以借鉴(recommended)

14 Baker, D., Hofmann, T., McCallum, A. and Yang, Y. 
A Hierarchical Probabilistic Model for Novelty Detection in Text. 
Unpublished manuscript. 

和分词没什么关系

15 Brand, M. 
1999. 
Structure learning in conditional probability models via an entropic prior and parameter extinction. 
In Neural Computation, vol.11, page 1155-1182
 

下面一篇的Journal版

16 M. Brand, 
1998. 
An entropic estimator for structure discovery. 
To appear, NIPS98
 

虽然和wordseg不太相关,但是……太赞了,无语的赞(strongly recommended!)

17 M. Brand, 
1999, 
Pattern discovery via entropy minimization. 
To appear, Uncertainty99 (AI & Statistics) 

和上一篇一样

18 Brent1999 Brent, M. 
1999. 
An efficient, probabilistically sound algorithm for segmentation and word discovery. 
Machine Learning, 34, 71-106.
19 Brent, M.R. & T. A. Cartwright. 
1996. 
Distributional regularity and phonotactic constraints are ueful for segmentation. 
In Computational Approaches to Language Acquisition, ed. Michael Brent. Cambridge, MA, MIT Press.
20 Brent, M. R. 
1999. 
Speech segmentation and word discovery: A computational perspective. 
Trends in Cognitive Science, 3, 294-301.
21 Dahan and Brent, M. 
1999. 
On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. 
In Journal of Experimental Psychology:General Vol. 128,pp. 165-185
22 Brown1991 Brown, E. K. , Miller, J. 
1991. 
Syntax:A Linguistic Introduction to Sentence Structure. 
Publisher: HarperCollins ,London
23 Jing-Shin Chang and Keh-Yih Su, 
1997, 
An Unsupervised Iterative Method for Chinese New Lexicon Extraction, 
InInternational Journal of Computational Linguistics & Chinese Language Processing. 

太差了,废话又多,就是EM,何必弄那么复杂呢?

24 Chang, Jing-Shin, Yi-Chung Lin and Keh-Yih Su. 
1995. 
Automatic Construction of a Chinese Electronic Dictionary. 
Proceedings of the Third Workshop on Very Large Corpora, pp. 107-120, MIT, June, 1995. 

就是上一篇

25 Brian Clarkson and Alex Pentland. 
1999. 
Unsupervised clustering of ambulatory audio and video. 
In In International Conference on Acoustics, Speech and Signal Processing, volume VI, pages 3037-3040. IEEE, 1999.
26 Deligne, S. and Bimbot, F. 
1995. 
Language Modeling by Variable Length Sequences:Theoretical Formulation and Evaluation of Multigrams. 
ICASSP,1995
27 S. Deligne, F. Yvon, and F. Bimbot. 
1995. 
Variable-length sequence matching for phonetic transcription using joint multigrams. 
In EUROSPEECH.
28 Deligne, S.; Yvon, F.; and Bimbot, F. 
1996. 
Introducing statistical dependencies and structural constraints in variable-length sequence models. 
In Miclet, L., and de la Higuera, C., eds., Grammatical Inference: Learning Syntax from Sentences, Lecture Notes in Artificial Intelligence 1147. Springer. 156-167.
29 de Marken, C. 
1995. 
The Unsupervised Acquisition of a Lexicon from Continuous Speech. 
Technical Report A.I. Memo No. 1558, AI Lab., MIT. Cambridge, Massachusetts.
30 Ge, X., Pratt, W. and Smyth, P. 
1999. 
Discovering Chinese Words from Unsegmented Text. 
SIGIR-99,pages 271-272.
 

EM体系。paper中报道的实验结果很好,还需实际验证(recommended)

31 Goldsmith, J. 
2001. 
Unsupervised Learning of the Morphology of a Natural Language. 
to appear in Computational Linguistics 2001.
32 A. Hanjalic, R.L. Lagendijk, J. Biemond. 
1999. 
Automatically Segmenting Movies into Logical Story Units. 
In D.P. Huijsmans, A.W.M. Smeulders (eds.): Lecture Notes in Computer Science 1614: Visual Information and Information Systems, ISBN 3-540-66079-8, pages 229-236, Springer Verlag 1999 (Proceedings of the Third International Conference VISUAL '99, Amsterdam (NL), June 1999)
33 Hua, Y. 
2000. 
Unsupervised word induction using MDL criterion. 
ISCSL2000, Beijing.
还不错,EM体系和MDL的结合。(recommended)
34 Kit, C. and Wilks, Y. 
1999. 
Unsupervised Learning of Word Boundary with Description Length Gain. 
In Proceedings CoNLL99 ACL Workshop. Bergen.
 

有新意,但有缺陷。可以用来初始化EM(recommended)

35 Kit, C. 
2000. 
Unsupervised Lexical Learning as Inductive Inference 
PhD thesis, University of Sheffield, UK, 2000.
36 Ponte, J. M. and Croft, W. B. 
1996. 
Useg: A retargetable word segmentation procedure for information retrievals. 
In Symposium on Document Analysis and Information Retrival 96 (SDAIR).
37 Peng,Fuchun and Schuurmans, Dale 
2001. 
Self-supervised Chinese Word Segmentation. 
The 4th Internation Symposium on Intelligent Data Analysis(IDA2001), September, 2001, Lisbon, Portugal.
38 Peng,Fuchun and Schuurmans, Dale 
2001. 
A Hierarchical EM Approach to Word Segmentation, 
To appear in Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS 2001), Nov. 2001, Tokyo, Japan.

EM体系,但是想法比较繁琐。

39 Sproate, R. and Shih, C. 
1990. 
A statistical method for finding word boundaries in Chinese text. 
Computer Processing of Chinese and Oriental Languages, 4:336-351.
40 Zhao, J., Gao, J., Chang, E. and Li, M. 
2000. 
Lexicon optimization for Chinese language modeling. 
International Symposium Conference on Spoken Language Processing, Beijing.
41 Su, K., Wu, M., & Chang, J. 
1994. 
A Corpus-Based Approach to Automatic Compound Extraction. 
ACL Proceedings: 32nd Annual Meeting of the Association for Computational Linguistics, (Las Cruces, NM, June 1994), ACL, Morristown, NJ, pp.242-247.
42 Wu, M.-W. and K.-Y. Su, 
1993. 
Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count. 
Proceedings of ROCLING VI, pp. 207-216, Nantou, Taiwan, ROC, Sep. 1993.
43 Chen, K., & Chen, H. 
1994. 
Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation. 
ACL Proceedings: 32nd Annual Meeting of the Association for Computational Linguistics, (Las Cruces, NM, June 1994),ACL, Morristown, NJ, pp. 234-241.
44 Jin, Wanying. 
1992. 
Chinese Segmentation and its Disambiguation. 
MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico.
45 Kok-Wee Gan, Martha Palmer, Kim-Teng Lua 
1996. 
A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception. Computational Linguistics, Volume 22,531-553,1996.
46 Sun Maosong, Shen Dayang, Benjamin K. Tsou 
1998. 
Chinese Work Segmentation without Using Lexicon and Hand-crafted Training Data.
COLING-ACL 1998: 1265-1271