首页 > 解决方案 > How to know if your TF-IDF calculation is correct?

问题描述

Background Info: I'm just getting started to learn NLP, have gone through materials for basic CS course, watched some videos and read a bit...

My Approach: Use specific technique(s) learned, and to write my own code (not to use tools yet), to test against some texts, then, see if I get them right or almost right.

My Challenge: Not knowing where appropriate to post questions for my learning and get informative responses.

The first technique I'm learning is TF-IDF, goal is to extract the most important and central ideas of a text.

Text URL, https://news.yahoo.com/oxford-scientists-working-coronavirus-vaccine-100847254.html

Using TF, I get the following top "concepts" or "words" for this text (with respective scores):

oxford, 7
scientists, 6
coronavirus, 4
vaccine, 14
falling, 5
adam, 3 

With a slightly different calculation formula, I get the following top "words":

scientists 
coronavirus 
vaccine 
uk 
falling 
bell 
might ( this one is obviously false positive, but I don't want to add it to stop word list since its noun form would be meaningful in another context)

Using TF-IDF, I get the following three top "sentences": 22.532130774077 17.891164215124 16.190527222538

(1) Oxford scientists working on a coronavirus vaccine say the chances of success are now 50%
(2) " Scientists at Oxford are working with global pharmaceutical company AstraZeneca Plc to produce the vaccine
(3) They say that's because the number of people with the virus in the UK is falling too quickly 

Their respective scores:
22.532130774077
17.891164215124
16.190527222538 

Could anyone run some tests against this text and see what results you get? By comparing my results with that of expert I would know know how I'm doing.

Thanks in advance.

标签: tf-idf

解决方案


推荐阅读