首页 > 解决方案 > 比较两个用逗号分隔的列表,找出每个列表中的共同元素和不同元素

问题描述

我有以下两个成分列表:

  1. Calcium Carbonate, Aqua, Sorbitol, Aroma, Poloxamer 407, Sodium  Monofluorophosphate (1450 ppm F), Cocamidopropyl Betaine, Zinc Oxide, Benzyl Alcohol, Cellulose Gum, Zinc Citrate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Xanthan Gum, Sodium Saccharin, Sucralose, Limonene, CI 77891.
    
  2. Calcium Carbonate, Aqua, Sorbitol, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol, Xanthan Gum, Limonene, CI 77891.
    

我想知道的是:

  1. 哪些元素有共同点
  2. 哪些元素存在于一个中,但不存在于另一个中

我做了一些在 python 中可以工作的东西,但我想要一个更简单的 bash 实现。

import sys
from collections import OrderedDict
import os
from copy import deepcopy
from itertools import combinations

my_ingredients_dict = OrderedDict()

for f in sys.argv[1:]:
        with open(f, 'r') as myfile:
                as_a_set = set([ s.strip() for s in myfile.readlines()[0].split(',')])
                my_ingredients_dict[os.path.basename(f)] = as_a_set
all_ing_list = my_ingredients_dict.values()

common_ingredients = OrderedDict()
divergent_ingredients = OrderedDict()


for agent1, agent2 in combinations(my_ingredients_dict, 2):
    agent_key = str(agent1)+"___AND___"+str(agent2)
    agent_common = my_ingredients_dict[agent1] & my_ingredients_dict[agent2]
    if agent_common:
        common_ingredients[agent_key] = agent_common
    agent_1_but_not_in_agent_2_key = "STUFF_IN__"+str(agent1)+"__BUT_NOT_IN__"+str(agent2)
    agent1_vs_agent2 = my_ingredients_dict[agent1] - my_ingredients_dict[agent2]
    if agent1_vs_agent2:
        divergent_ingredients[agent_1_but_not_in_agent_2_key] = agent1_vs_agent2

    agent_2_but_not_in_agent_1_key = "STUFF_IN__"+str(agent2)+"__BUT_NOT_IN__"+str(agent1)
    agent2_vs_agent1 = my_ingredients_dict[agent2] - my_ingredients_dict[agent1]
    if agent2_vs_agent1:
        divergent_ingredients[agent_2_but_not_in_agent_1_key] = agent2_vs_agent1

print "========= COMMON ==============\n"
for key,val in common_ingredients.items():
        print key, val
print "=========================================\n"

print "============== DIVERGENT =========== \n"
for key, val in divergent_ingredients.items():
        print key,val
print "======================================\n"

关于 gawk 解决方案,如果我给出以下列表,代码会产生错误的结果:

(一个)

Arginine 8%,Calcium Carbonate, Aqua, Sorbitol, Bicarbonate, Sodium Lauryl Sulfate, Sodium Monofluorophosphate (1450 ppm F), Aroma, Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Titanium Dioxide, Benzyl Alcohol, Sodium Saccharin, Xanthan Gum, Limonene

(二)

Arginine 8%,Aqua , Calcium Carbonate, Sorbitol, Hydrated Silica, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Tricalcium Phosphate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol,Xanthan Gum, Limonene, Titanium Dioxide

来自 gawk 的结果:

Common:
Cellulose Gum
Sodium Bicarbonate
Sorbitol
Sodium Monofluorophosphate (1450 ppm F)
Sodium Saccharin
Calcium Carbonate, Aqua, Sorbitol, Aroma, Poloxamer 407, Sodium Monofluorophosphate (1450 ppm F), Cocamidopropyl Betaine, Zinc Oxide, Benzyl Alcohol, Cellulose Gum, Zinc Citrate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Xanthan Gum,Sodium Lauryl Sulfate
Aroma
Titanium Dioxide
Calcium Carbonate, Aqua, Sorbitol, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol, Xanthan Gum, Limonene, CI 77891.
Tetrasodium Pyrophosphate
Limonene

a:
Xanthan Gum
Benzyl Alcohol
Aqua
Arginine 8%,Calcium Carbonate
Bicarbonate

b:
Tricalcium Phosphate
Hydrated Silica
Calcium Carbonate
Arginine 8%,Aqua
Benzyl Alcohol,Xanthan Gum

我的 python 脚本的结果:

========= COMMON ==============

a.txt___AND___b.txt set(['Sorbitol', 'Xanthan Gum', 'Tetrasodium Pyrophosphate', 'Sodium Saccharin', 'Aqua', 'Titanium Dioxide', 'Sodium Bicarbonate', 'Arginine 8%', 'Calcium Carbonate', 'Sodium Monofluorophosphate (1450 ppm F)', 'Sodium Lauryl Sulfate', 'Benzyl Alcohol', 'Limonene', 'Cellulose Gum', 'Aroma'])
=========================================

============== DIVERGENT ===========

STUFF_IN__a.txt__BUT_NOT_IN__b.txt set(['Bicarbonate'])
STUFF_IN__b.txt__BUT_NOT_IN__a.txt set(['Hydrated Silica', 'Tricalcium Phosphate'])
======================================

标签: bash

解决方案


使用 sort、bash 和 uniq:

哪些元素有共同点

sort <(sed 's/, /\n/g' file1) <(sed 's/, /\n/g' file2) | uniq -d

哪些元素存在于一个中,但不存在于另一个中

sort <(sed 's/, /\n/g' file1) <(sed 's/, /\n/g' file2) | uniq -u

推荐阅读