python - 如何增强脚本以从多个 CSV 文件中查找
问题描述
我需要增强下面的脚本,它需要一个包含近一百万行的输入文件。针对每一行,它在 3 个查找文件中具有不同的值,我打算将它们作为逗号分隔值添加到我的输出中。
下面的脚本工作正常,但完成这项工作需要几个小时。我正在寻找一个真正快速的解决方案,它对系统的影响也较小。
#!/bin/bash
while read -r ONT
do
{
ONSTATUS=$(grep "$ONT," lookupfile1.csv | cut -d" " -f2)
CID=$(grep "$ONT." lookupfile3.csv | head -1 | cut -d, -f2)
line1=$(grep "$ONT.C2.P1," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
line2=$(grep "$ONT.C2.P2," lookupfile2.csv | head -1 | cut -d"," -f2,7 | sed 's/ //')
echo "$ONT,$ONSTATUS,$CID,$line1,$line2" >> BUwithPO.csv
} &
done < inputfile.csv
inputfile.csv 包含如下所示的行:
343OL5:LT1.PN1.ONT1
343OL5:LT1.PN1.ONT10
225OL0:LT1.PN1.ONT34
225OL0:LT1.PN1.ONT39
343OL5:LT1.PN1.ONT100
225OL0:LT1.PN1.ONT57
lookupfile1.csv 包含:
343OL5:LT1.PN1.ONT100, Down,Locked,No
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes
343OL5:LT1.PN1.ONT1, Down,Unlocked,No
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes
lookupfile2.csv 包含:
225OL0:LT1.PN1.ONT34.C2.P1, +123125302766,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P1, +123125334019,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
225OL0:LT1.PN1.ONT57.C2.P2, +123125334819,REG,DigitMap,Unlocked,_media_BNT,FD_BSFU.xml,
343OL5:LT1.PN1.ONT100.C2.P11, +123128994019,REG,DigitMap,Unlocked,_media_ANT,FD_BSFU.xml,
lookupfile3.csv 包含:
343OL5:LT1.PON1.ONT100.SERV1,12-654-0330
343OL5:LT1.PON1.ONT100.C1.P1,12-654-0330
343OL5:LT7.PON8.ONT75.SERV1,12-664-1186
225OL0:LT1.PN1.ONT34.C1.P1.FLOW1,12-530-2766
225OL0:LT1.PN1.ONT57.C1.P1.FLOW1,12-533-4019
输出是:
225OL0:LT1.PN1.ONT57, Up,Unlocked,Yes,12-533-4019,+123125334019,FD_BSFU.xml,+123125334819,FD_BSFU.xml
225OL0:LT1.PN1.ONT34, Up,Unlocked,Yes,12-530-2766,+123125302766,FD_BSFU.xml,
343OL5:LT1.PN1.ONT1, Down,Unlocked,No,,,
343OL5:LT1.PN1.ONT100, Down,Locked,No,,,
343OL5:LT1.PN1.ONT10,,,,
225OL0:LT1.PN1.ONT39, Up,Unlocked,Yes,,,
解决方案
As you'll see, the bottleneck will be executing grep
within the loop multiple times. You can increase the efficiency by creating a look-up table with associative arrays.
If awk
is available, please try the following:
[Update]
#!/bin/bash
awk '
FILENAME=="lookupfile1.csv" {
sub(",$", "", $1);
onstatus[$1] = $2
}
FILENAME=="lookupfile2.csv" {
split($2, a, ",")
if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
}
FILENAME=="lookupfile3.csv" {
split($0, a, ",")
if (match(a[1], ".+\\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME=="inputfile.csv" {
print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
}
' lookupfile1.csv lookupfile2.csv lookupfile3.csv inputfile.csv > BUwithPO.csv
{EDIT]
If you need to specify absolute paths to the files, please try:
#!/bin/bash
awk '
FILENAME ~ /lookupfile1.csv$/ {
sub(",$", "", $1);
onstatus[$1] = $2
}
FILENAME ~ /lookupfile2.csv$/ {
split($2, a, ",")
if (sub("\\.C2\\.P1,$", "", $1)) line1[$1] = a[1]","a[6]
else if (sub("\\.C2\\.P2,$", "", $1)) line2[$1] = a[1]","a[6]
}
FILENAME ~ /lookupfile3.csv$/ {
split($0, a, ",")
if (match(a[1], ".+\\.ONT[0-9]+")) {
ont = substr(a[1], RSTART, RLENGTH)
cid[ont] = a[2]
}
}
FILENAME ~ /inputfile.csv$/ {
print $0","onstatus[$0]","cid[$0]","line1[$0]","line2[$0]
}
' /path/to/lookupfile1.csv /path/to/lookupfile2.csv /path/to/lookupfile3.csv /path/to/inputfile.csv > /path/to/BUwithPO.csv
Hope this helps.
推荐阅读
- reactjs - 我在我的反应应用程序中收到此错误“指示是否通过指定其 SameSite 属性在跨站点请求中发送 cookie”
- android - 如何在 Android Chromium 浏览器应用程序中设置代理
- c - fputs() 产生错误的结果
- react-native - 具有多个类别的多选过滤器
- c# - 从 Web 表单的母版页获取表单值
- elasticsearch - 无法在 m1 芯片上运行旧的 elasticsearch(使用rosetta2)
- django - 有没有办法在 django 的 extra_context 字典中使用路径参数?
- python - 'NoneType' 对象没有属性 '_jvm' pandas split
- android - firebase google 登录结果代码始终为 0
- python - 涉及easygui窗口的Python Pynput热键问题