python - 如何计算字典中单词的词频?
问题描述
我有一本像下面这样的字典:
[{'mississippi': 1, 'worth': 1, 'reading': 1}, {'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1}, {'considering': 1, 'missouri': 1, 'main': 1, 'branch': 1, 'longest': 1, 'river': 1, 'world--four': 1}, {'seems': 1, 'safe': 1, 'crookedest': 1, 'river': 1, 'part': 1, 'journey': 1, 'uses': 1, 'cover': 1, 'ground': 1, 'crow': 1, 'fly': 1, 'six': 1, 'seventy-five': 1}, {'discharges': 1, 'water': 1, 'st': 1}, {'lawrence': 1, 'twenty-five': 1, 'rhine': 1, 'thirty-eight': 1, 'thames': 1}, {'river': 1, 'vast': 1, 'drainage-basin:': 1, 'draws': 1, 'water': 1, 'supply': 1, 'twenty-eight': 1, 'states': 1, 'territories': 1, 'delaware': 1, 'atlantic': 1, 'seaboard': 1, 'country': 1, 'idaho': 1, 'pacific': 1, 'slope--a': 1, 'spread': 1, 'forty-five': 1, 'degrees': 1, 'longitude': 1}, {'mississippi': 1, 'receives': 1, 'carries': 1, 'gulf': 1, 'water': 1, 'fifty-four': 1, 'subordinate': 1, 'rivers': 1, 'navigable': 1, 'steamboats': 1, 'hundreds': 1, 'flats': 1, 'keels': 1}, {'area': 1, 'drainage-basin': 1, 'combined': 1, 'areas': 1, 'england': 1, 'wales': 1, 'scotland': 1, 'ireland': 1, 'france': 1, 'spain': 1, 'portugal': 1, 'germany': 1, 'austria': 1, 'italy': 1, 'turkey': 1, 'almost': 1, 'wide': 1, 'region': 1, 'fertile': 1, 'mississippi': 1, 'valley': 1, 'proper': 1, 'exceptionally': 1}]
我想将其更改为我想要的输出,如下所示,以计算两个目标词之间的相似度得分:
river 4
ground: 1
journey: 1
longitude: 1
main: 1
world--four: 1
contrary: 1
cover: 1
delaware: 1
remarkable: 1
vast: 1
forty-five: 1
crookedest: 1
territories: 1
spread: 1
country: 1
longest: 1
fly: 1
atlantic: 1
crow: 1
supply: 1
seems: 1
idaho: 1
seaboard: 1
states: 1
ways: 1
degrees: 1
part: 1
twenty-eight: 1
pacific: 1
branch: 1
water: 1
considering: 1
six: 1
safe: 1
commonplace: 1
draws: 1
drainage-basin: 1
uses: 1
seventy-five: 1
slope--a: 1
missouri: 1
mississippi 3
area: 1
steamboats: 1
germany: 1
reading: 1
france: 1
proper: 1
fifty-four: 1
turkey: 1
exceptionally: 1
areas: 1
carries: 1
combined: 1
flats: 1
receives: 1
england: 1
italy: 1
scotland: 1
wales: 1
almost: 1
navigable: 1
austria: 1
region: 1
wide: 1
spain: 1
subordinate: 1
drainage-basin: 1
hundreds: 1
keels: 1
portugal: 1
water: 1
gulf: 1
ireland: 1
rivers: 1
valley: 1
fertile: 1
worth: 1
water 3
steamboats: 1
spread: 1
country: 1
states: 1
longitude: 1
fifty-four: 1
pacific: 1
vast: 1
subordinate: 1
carries: 1
keels: 1
flats: 1
supply: 1
receives: 1
atlantic: 1
forty-five: 1
river: 1
rivers: 1
idaho: 1
mississippi: 1
seaboard: 1
navigable: 1
discharges: 1
degrees: 1
twenty-eight: 1
drainage-basin: 1
hundreds: 1
st: 1
gulf: 1
draws: 1
delaware: 1
territories: 1
slope--a: 1
drainage-basin 2
area: 1
spread: 1
country: 1
states: 1
mississippi: 1
longitude: 1
france: 1
proper: 1
vast: 1
turkey: 1
forty-five: 1
areas: 1
combined: 1
germany: 1
exceptionally: 1
valley: 1
supply: 1
fertile: 1
atlantic: 1
italy: 1
river: 1
idaho: 1
wales: 1
almost: 1
seaboard: 1
spain: 1
austria: 1
region: 1
degrees: 1
twenty-eight: 1
wide: 1
england: 1
portugal: 1
water: 1
ireland: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
scotland: 1
slope--a: 1
area 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
england: 1
turkey: 1
exceptionally: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
journey 1
ground: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
seems 1
ground: 1
journey: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
states 1
spread: 1
country: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
slope--a 1
spread: 1
country: 1
states: 1
degrees: 1
longitude: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
twenty-eight: 1
river: 1
idaho: 1
remarkable 1
contrary: 1
river: 1
commonplace: 1
ways: 1
vast 1
spread: 1
country: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
pacific: 1
forty-five: 1
water: 1
seaboard: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
forty-five 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
pacific: 1
water: 1
seaboard: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
twenty-eight: 1
river: 1
idaho: 1
crookedest 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
carries 1
mississippi: 1
steamboats: 1
navigable: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
germany 1
area: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
longest 1
main: 1
river: 1
world--four: 1
branch: 1
missouri: 1
considering: 1
flats 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
rivers: 1
receives: 1
supply 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
twenty-eight: 1
river: 1
idaho: 1
receives 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
crow 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
scotland 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
spain: 1
italy: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
country 1
spread: 1
idaho: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
thames 1
thirty-eight: 1
rhine: 1
lawrence: 1
twenty-five: 1
england 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
region: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
navigable 1
mississippi: 1
steamboats: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
austria 1
area: 1
germany: 1
mississippi: 1
france: 1
proper: 1
region: 1
turkey: 1
england: 1
areas: 1
combined: 1
exceptionally: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
rhine 1
thirty-eight: 1
thames: 1
lawrence: 1
twenty-five: 1
part 1
ground: 1
journey: 1
seems: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
twenty-eight 1
spread: 1
country: 1
states: 1
degrees: 1
longitude: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
branch 1
main: 1
longest: 1
river: 1
world--four: 1
missouri: 1
considering: 1
hundreds 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
st 1
water: 1
discharges: 1
considering 1
main: 1
longest: 1
river: 1
world--four: 1
branch: 1
missouri: 1
six 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
fly: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
gulf 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
flats: 1
rivers: 1
receives: 1
ireland 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
valley: 1
safe 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
commonplace 1
contrary: 1
river: 1
remarkable: 1
ways: 1
draws 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
supply: 1
delaware: 1
territories: 1
atlantic: 1
twenty-eight: 1
river: 1
idaho: 1
delaware 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
territories: 1
atlantic: 1
supply: 1
twenty-eight: 1
river: 1
idaho: 1
thirty-eight 1
thames: 1
rhine: 1
lawrence: 1
twenty-five: 1
longitude 1
spread: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
twenty-eight: 1
river: 1
idaho: 1
world--four 1
main: 1
longest: 1
river: 1
branch: 1
missouri: 1
considering: 1
lawrence 1
thirty-eight: 1
thames: 1
rhine: 1
twenty-five: 1
ground 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
steamboats 1
mississippi: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
spread 1
seaboard: 1
country: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
idaho 1
spread: 1
country: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
reading 1
mississippi: 1
worth: 1
almost 1
area: 1
germany: 1
austria: 1
france: 1
proper: 1
england: 1
turkey: 1
exceptionally: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
mississippi: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
contrary 1
river: 1
remarkable: 1
commonplace: 1
ways: 1
cover 1
ground: 1
journey: 1
seems: 1
part: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
fly: 1
france 1
area: 1
germany: 1
austria: 1
mississippi: 1
proper: 1
england: 1
turkey: 1
exceptionally: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
spain 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
pacific 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
twenty-eight: 1
river: 1
idaho: 1
turkey 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
fifty-four 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
hundreds: 1
keels: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
subordinate 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
water: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
territories 1
spread: 1
idaho: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
supply: 1
atlantic: 1
slope--a: 1
river: 1
country: 1
combined 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
exceptionally 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
england: 1
turkey: 1
region: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
region 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
twenty-five 1
thirty-eight: 1
thames: 1
lawrence: 1
rhine: 1
rivers 1
mississippi: 1
steamboats: 1
navigable: 1
carries: 1
fifty-four: 1
keels: 1
hundreds: 1
subordinate: 1
water: 1
gulf: 1
flats: 1
receives: 1
fly 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
seventy-five: 1
river: 1
atlantic 1
spread: 1
longitude: 1
country: 1
states: 1
degrees: 1
slope--a: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
river: 1
supply: 1
twenty-eight: 1
idaho: 1
italy 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
main 1
world--four: 1
longest: 1
river: 1
branch: 1
missouri: 1
considering: 1
areas 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
england: 1
turkey: 1
exceptionally: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
seaboard 1
spread: 1
country: 1
states: 1
degrees: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
fertile 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
ways 1
contrary: 1
river: 1
remarkable: 1
commonplace: 1
discharges 1
water: 1
st: 1
degrees 1
spread: 1
country: 1
states: 1
longitude: 1
twenty-eight: 1
drainage-basin: 1
vast: 1
forty-five: 1
water: 1
seaboard: 1
pacific: 1
draws: 1
delaware: 1
territories: 1
atlantic: 1
supply: 1
slope--a: 1
river: 1
idaho: 1
wide 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
proper 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
england: 1
turkey: 1
exceptionally: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
keels 1
mississippi: 1
steamboats: 1
navigable: 1
water: 1
fifty-four: 1
hundreds: 1
subordinate: 1
carries: 1
gulf: 1
flats: 1
rivers: 1
receives: 1
portugal 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
ireland: 1
valley: 1
worth 1
mississippi: 1
reading: 1
uses 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
fly: 1
seventy-five: 1
river: 1
seventy-five 1
ground: 1
journey: 1
seems: 1
part: 1
cover: 1
crow: 1
crookedest: 1
six: 1
safe: 1
uses: 1
river: 1
fly: 1
valley 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
wales: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
missouri 1
main: 1
longest: 1
river: 1
branch: 1
world--four: 1
considering: 1
wales 1
area: 1
germany: 1
austria: 1
mississippi: 1
france: 1
proper: 1
exceptionally: 1
turkey: 1
england: 1
areas: 1
combined: 1
scotland: 1
italy: 1
spain: 1
almost: 1
fertile: 1
region: 1
wide: 1
drainage-basin: 1
portugal: 1
ireland: 1
valley: 1
第一行是目标词及其在整个词典中的频率。下面是相关词及其在与目标词相同的句子中的频率。与第一本词典一样,与“mississippi”相关的profile将包含对“worth”和“reading”的引用,它们在句子中的词频为1,但mississippi的词频在整个词典中为3。我想按降序对目标词的词频进行排序。任何人都可以帮忙吗?
解决方案
从您想要的输出和代码中都不清楚您到底想要实现什么,但如果它只是计算单个句子中的单词,那么策略应该是:
- 阅读您
common.txt
的内容set
以进行快速查找。 - 阅读您的
sample.txt
并拆分.
以获取单个句子。 - 清除所有非单词字符(您必须定义它们或使用正则表达式
\b
来捕获单词边界)并用空格替换它们。 - 拆分空格并计算步骤 1中不存在的
set
单词。
所以:
import collections
with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"" # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
# read the whole file to include linebreaks and split on `.` to get individual sentences
sentences = [s for s in f.read().split(".") if s.strip()] # ignore empty sentences
for sentence in sentences: # iterate over each sentence
sentence = sentence.translate(trans_table) # replace the interpunction with spaces
word_counter = collections.defaultdict(int) # a string:int default dict for counting
for word in sentence.split(): # split the sentence and iterate over the words
if word.lower() not in common_words: # count only words not in the common.txt
word_counter[word.lower()] += 1
sentences_counter.append(word_counter) # add the current sentence word count
注意:在 Python 2.x 上使用string.maketrans()
而不是str.maketrans()
.
这将生成sentences_counter
包含 中每个句子的字典计数sample.txt
,其中键是实际单词,其关联值是单词计数。您可以将结果打印为:
for i, v in enumerate(sentences_counter):
print("Sentence #{}:".format(i+1))
print("\n".join("\t{}: {}".format(w, c) for w, c in v.items()))
这将产生(对于您的样本数据):
第 1 句: 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 句子#2: 密西西比州:1 山谷:1 正确:1 异常:1
请记住,(英语)语言比这更复杂 - 例如,“一只猫在生气时会摇尾巴,所以远离它。 ”取决于你如何对待撇号,会有很大的不同。此外,点不一定表示句子的结尾。如果你想做严肃的语言分析,你应该研究NLP 。
更新:虽然我看不到重复每个单词重复数据的有用性(计数不会在一个句子中改变)如果你想打印每个单词并将所有其他计数嵌套在下面,你可以添加一个内部打印时循环:
for i, v in enumerate(sentences_counter):
print("Sentence #{}:".format(i+1))
for word, count in v.items():
print("\t{} {}".format(word, count))
print("\n".join("\t\t{}: {}".format(w, c) for w, c in v.items() if w != word))
这会给你:
第 1 句: 1区 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 排水盆 1 面积:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 伟大的 1 面积:1 排水盆:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 组合 1 面积:1 排水盆:1 伟大的:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 区域 1 面积:1 排水盆:1 伟大的:1 合计:1 英格兰:1 威尔士:1 宽:1 地区:1 肥沃的:1 英格兰 1 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 威尔士:1 宽:1 地区:1 肥沃的:1 威尔士 1 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 宽:1 地区:1 肥沃的:1 宽 1 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 地区:1 肥沃的:1 区域 1 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 肥沃的:1 肥沃的 1 面积:1 排水盆:1 伟大的:1 合计:1 领域:1 英格兰:1 威尔士:1 宽:1 地区:1 句子#2: 密西西比 1 山谷:1 正确:1 异常:1 谷 1 密西西比州:1 正确:1 异常:1 正确的 1 密西西比州:1 山谷:1 异常:1 异常 1 密西西比州:1 山谷:1 正确:1
随意删除句号打印并减少选项卡缩进之一,以从您的问题中获得更多所需的输出。如果您更喜欢的话,您还可以构建一个树状字典,而不是将所有内容打印到 STDOUT。
更新 2:如果你愿意,你不必set
为common_words
. 在这种情况下,它几乎可以与 a 互换,list
因此您可以使用列表推导而不是集合推导(即用方括号替换大括号),但是通过 a查找是list
一种操作,O(n)
而set
查找是一种O(1)
操作,因此set
这里首选 a。更不用说自动重复数据删除的附带好处,以防common.txt
有重复的单词。
至于collections.defaultdict()
它的存在只是为了节省我们一些编码/检查,方法是在请求时自动将字典初始化为一个键 - 没有它你必须手动完成:
with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"" # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
# read the whole file to include linebreaks and split on `.` to get individual sentences
sentences = [s for s in f.read().split(".") if s.strip()] # ignore empty sentences
for sentence in sentences: # iterate over each sentence
sentence = sentence.translate(trans_table) # replace the interpunction with spaces
word_counter = {} # initialize a word counting dictionary
for word in sentence.split(): # split the sentence and iterate over the words
word = word.lower() # turn the word to lowercase
if word not in common_words: # count only words not in the common.txt
word_counter[word] = word_counter.get(word, 0) + 1 # increase the last count
sentences_counter.append(word_counter) # add the current sentence word count
更新 3:如果您只想要一个原始单词列表,就像您上次更新问题时所显示的那样,您甚至不需要考虑句子本身 - 只需在插值列表中添加一个点,阅读文件行按行,在空格上拆分并像以前一样计算单词:
import collections
with open("common.txt", "r") as f: # open the `common.txt` for reading
common_words = {l.strip().lower() for l in f} # read each line and and add it to a set
interpunction = ";,'\"." # define word separating characters and create a translation table
trans_table = str.maketrans(interpunction, " " * len(interpunction))
sentences_counter = [] # a list to hold a word count for each sentence
word_counter = collections.defaultdict(int) # a string:int default dict for counting
with open("sample.txt", "r") as f: # open the `sample.txt` for reading
for line in f: # read the file line by line
for word in line.translate(trans_table).split(): # remove interpunction and split
if word.lower() not in common_words: # count only words not in the common.txt
word_counter[word.lower()] += 1 # increase the count
print("\n".join("{}: {}".format(w, c) for w, c in word_counter.items())) # print the counts
推荐阅读
- react-native - 在 iOS 设备上运行时对代码的更改未反映本机反应
- postgresql - kubernetes 如何将 pod 暴露给集群机器之外的东西?
- r - 我如何删除负值和零值?
- python - 过滤 pytrends 数据并获取数据和国家列
- node.js - 使用 bytenode 时调试 javascript
- php - 将 PDO 与 mdb 一起使用时出错:PDOException: 数据源名称无效
- python - 无法以“应用程序”模式启动 Chrome(Python-Selenium)
- mongodb - mongodb 如果可能按数字排序,否则按字符串排序
- javascript - 点击后如何调整视频大小
- javascript - 未显示 PayPal 稍后付款按钮