bash - 比较两个用逗号分隔的列表,找出每个列表中的共同元素和不同元素
问题描述
我有以下两个成分列表:
Calcium Carbonate, Aqua, Sorbitol, Aroma, Poloxamer 407, Sodium Monofluorophosphate (1450 ppm F), Cocamidopropyl Betaine, Zinc Oxide, Benzyl Alcohol, Cellulose Gum, Zinc Citrate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Xanthan Gum, Sodium Saccharin, Sucralose, Limonene, CI 77891.
Calcium Carbonate, Aqua, Sorbitol, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol, Xanthan Gum, Limonene, CI 77891.
我想知道的是:
- 哪些元素有共同点
- 哪些元素存在于一个中,但不存在于另一个中
我做了一些在 python 中可以工作的东西,但我想要一个更简单的 bash 实现。
import sys
from collections import OrderedDict
import os
from copy import deepcopy
from itertools import combinations
my_ingredients_dict = OrderedDict()
for f in sys.argv[1:]:
with open(f, 'r') as myfile:
as_a_set = set([ s.strip() for s in myfile.readlines()[0].split(',')])
my_ingredients_dict[os.path.basename(f)] = as_a_set
all_ing_list = my_ingredients_dict.values()
common_ingredients = OrderedDict()
divergent_ingredients = OrderedDict()
for agent1, agent2 in combinations(my_ingredients_dict, 2):
agent_key = str(agent1)+"___AND___"+str(agent2)
agent_common = my_ingredients_dict[agent1] & my_ingredients_dict[agent2]
if agent_common:
common_ingredients[agent_key] = agent_common
agent_1_but_not_in_agent_2_key = "STUFF_IN__"+str(agent1)+"__BUT_NOT_IN__"+str(agent2)
agent1_vs_agent2 = my_ingredients_dict[agent1] - my_ingredients_dict[agent2]
if agent1_vs_agent2:
divergent_ingredients[agent_1_but_not_in_agent_2_key] = agent1_vs_agent2
agent_2_but_not_in_agent_1_key = "STUFF_IN__"+str(agent2)+"__BUT_NOT_IN__"+str(agent1)
agent2_vs_agent1 = my_ingredients_dict[agent2] - my_ingredients_dict[agent1]
if agent2_vs_agent1:
divergent_ingredients[agent_2_but_not_in_agent_1_key] = agent2_vs_agent1
print "========= COMMON ==============\n"
for key,val in common_ingredients.items():
print key, val
print "=========================================\n"
print "============== DIVERGENT =========== \n"
for key, val in divergent_ingredients.items():
print key,val
print "======================================\n"
关于 gawk 解决方案,如果我给出以下列表,代码会产生错误的结果:
(一个)
Arginine 8%,Calcium Carbonate, Aqua, Sorbitol, Bicarbonate, Sodium Lauryl Sulfate, Sodium Monofluorophosphate (1450 ppm F), Aroma, Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Titanium Dioxide, Benzyl Alcohol, Sodium Saccharin, Xanthan Gum, Limonene
(二)
Arginine 8%,Aqua , Calcium Carbonate, Sorbitol, Hydrated Silica, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Tricalcium Phosphate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol,Xanthan Gum, Limonene, Titanium Dioxide
来自 gawk 的结果:
Common:
Cellulose Gum
Sodium Bicarbonate
Sorbitol
Sodium Monofluorophosphate (1450 ppm F)
Sodium Saccharin
Calcium Carbonate, Aqua, Sorbitol, Aroma, Poloxamer 407, Sodium Monofluorophosphate (1450 ppm F), Cocamidopropyl Betaine, Zinc Oxide, Benzyl Alcohol, Cellulose Gum, Zinc Citrate, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Xanthan Gum,Sodium Lauryl Sulfate
Aroma
Titanium Dioxide
Calcium Carbonate, Aqua, Sorbitol, Sodium Lauryl Sulfate, Aroma, Sodium Monofluorophosphate (1450 ppm F), Cellulose Gum, Sodium Bicarbonate, Tetrasodium Pyrophosphate, Sodium Saccharin, Benzyl Alcohol, Xanthan Gum, Limonene, CI 77891.
Tetrasodium Pyrophosphate
Limonene
a:
Xanthan Gum
Benzyl Alcohol
Aqua
Arginine 8%,Calcium Carbonate
Bicarbonate
b:
Tricalcium Phosphate
Hydrated Silica
Calcium Carbonate
Arginine 8%,Aqua
Benzyl Alcohol,Xanthan Gum
我的 python 脚本的结果:
========= COMMON ==============
a.txt___AND___b.txt set(['Sorbitol', 'Xanthan Gum', 'Tetrasodium Pyrophosphate', 'Sodium Saccharin', 'Aqua', 'Titanium Dioxide', 'Sodium Bicarbonate', 'Arginine 8%', 'Calcium Carbonate', 'Sodium Monofluorophosphate (1450 ppm F)', 'Sodium Lauryl Sulfate', 'Benzyl Alcohol', 'Limonene', 'Cellulose Gum', 'Aroma'])
=========================================
============== DIVERGENT ===========
STUFF_IN__a.txt__BUT_NOT_IN__b.txt set(['Bicarbonate'])
STUFF_IN__b.txt__BUT_NOT_IN__a.txt set(['Hydrated Silica', 'Tricalcium Phosphate'])
======================================
解决方案
使用 sort、bash 和 uniq:
哪些元素有共同点
sort <(sed 's/, /\n/g' file1) <(sed 's/, /\n/g' file2) | uniq -d
哪些元素存在于一个中,但不存在于另一个中
sort <(sed 's/, /\n/g' file1) <(sed 's/, /\n/g' file2) | uniq -u
推荐阅读
- python - 如何让 Sphinx 识别 `:param` 元素?
- javascript - 旋转画布 JS 中的特定元素
- ios - 您可以将 Firebase 添加到 Swift 包中的单元测试中吗?
- laravel - 即使在锚标记的 href 中设置了正确的 id,我的控制器也会在后面加上一个 id
- json - 需要帮助 RxSwift MVVM:无法在线将数据 JSON 加载到表格视图单元格
- c++ - Need help fixing this C++ code (Quadratic Formula solver)
- java - 如何替换html字符串上的常量值?
- database - 想根据以下要求创建表
- ios - 在 iOS 15 中选择退出 SwiftUI 文本 Markdown 支持
- r - 如何从 Excel 导出数据以使用 timevis 在 R 中创建时间线