unix - 在 UNIX 中保持相同顺序从另一个文件中查找一个文件的内容
问题描述
我有一个文件,例如:
Colcht.WP_006104309.1
Moopro.WP_070396948.1
Mastes.WP_027845098.1
Phowil.WP_068791039.1
Cyaapo.WP_015218744.1
Gemher.WP_017295578.1
Lyncon.WP_039726659.1
Glokil.WP_023171831.1
Noscya.WP_087539356.1
Photen.WP_073607805.1
Hydriv.WP_073598454.1
Lepoha.WP_088893428.1
Nodnod.WP_017300904.1
Noscya.WP_087540001.1
Spisub.WP_017307136.1
Scy0HK.WP_073635112.1
PlaSR0.WP_054467905.1
我曾经 grep -A 1 -F -f
从另一个文件 grep 获取与上面文本文件中每一行有关的信息。结果的一个子集如下:
>Cyaapo.WP_015218744.1
MNIVVVGLSHKTAAVEVREKLSIPEAKIEDSIRHLLTYPHIEEVAIISTCNRLEIYAVVKETEQGVKEITQFLAEIGNLALLELRRHLFILLHQDAIRHLMRVAAGLESLVLGEGQILAQVKTTHKLGMKYNGMSRLLDRLFKQAISAGKRVRTETNIGTGAVSISSAAVELVDTKIEDLSSQKVTIIGAGKMSRLLVQHLLAKGVEDIIIVNRSHNRSQELAKQFPQANLKLNLLEDMMTMVAQSDIVFTSTGATQPILDKNNLSSLSINHSLMLVDISVPRNVASDVTELEFIKSYNVDDLKAVVAQNHASRREMAREAENLLEEEIEAFELWWQSLETVPTISCLRSKIEQIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRTQKDIEARRKALHTLQTLFDLEVSEQLI
>Gemher.WP_017295578.1
MNIIVVGLSHKTAAVEVREKLSIPEAKIEESIKHLLSYPHIEEVGIISTCNRLEIYAVVKETEQGVKEITQFLAEIGHLSLHSLRRHLFILLHQDAIRHLMRVSAGLESLVLGEGQILAQVKNTHKLSTKYQGMGRLLDRLFKQAMSAGKRVRTETNIGTGAVSISSAAVELVDMKLDDLSRQKVSIIGAGKMSRLLVQHLISKGVSDITIVNRSVSRSKELAKQFPQIELKLNLLEEMMEIVRDSDIVFTSTGATQPILDKNNLCSIECHHSIMLVDISVPRNVASDVEELDFIVAYNVDDLKAVVAQNQASRREMAREAELLLEEEIEAFELWWQSLETVPTISCLRSKIEEIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRSQQDIEARRKALQTLQTLFNLEISEQFG
>Glokil.WP_023171831.1
MQIAVIGLSHRTAPVEIREKVSIPEQQVAEYVSRLRSCSQIAECAILSTCNRLEIYAVLRDSEHGLREVTQFLAESKGVALPMLQRHLFTLLHQDAVMHLMRVAAGLDSIVLGEGQILAQIKVTHKLAQQGKGVDRILNQLFKAAITGGKRVREETDIGKGAISVSSAAVEMAMRKKNRRSLQDQRCLVVGAGKMGELVLRHLISKGARQIIVLNRSLEKAAQMVEQFGLMLPVATIDELGNHLGAADLIFTCTSASEPLINYERLSQVRREQPLMIFDIAVPRNVAVDVEELSNVHLFNVDHLKQVVEENRAYRQLMVQQCEDILLQQLDEFLDWWRNLEAVPTINSLRQKVETIREQELEKALSRLGTEFGEKHQGIIDSLTRAIVNKILHDPMVQLRAQRDVEARRRALQTLQTLFNLEPLGSNPEPPVL
>Hydriv.WP_073598454.1
MNIAVVGLSHKTAPVEIREKLSIQEAKLESALAHLRSYPHIIEVAIISTCNRLEIYAIATETDQGVREISQFLSEIGHIPLDRLRRYLFILLHQDAVRHLMRVAAGLESLVLGEGQILAQVKNTHKLAQKYQSLGQILDRLFKQAMTAGKRVRSETNIGTGAVSISSAAVELAHMKAENLAARRVCIIGAGKMSRLLVQHLLAKGTQQICIVNRSHRRAEELASQFPEVQLKLYPLTEMMSAVAASDIVFTSTAATEPIINRSQLEASLTRDRELMLFDISVPRNVHADVGGMESVQSYNVDDLKAVVAQNYESRRKMAQEAEALLEEEIAAFELWWRSLETVPTISCLRSKVETIREQELEKALSRLGTEFAEKHQEVIEALTRGIVNKILHEPMVQLRAQQDIEARRRCLQSLQMLFNLEIEKQVI
>Lepoha.WP_088893428.1
MNIVVVGLSHRTAPVEVREKLSIPTPQMEAAIAHLRSFPHIEEATILSTCNRLEVYVVTSETEQGVREVTQFLSEYGKISVSQLRPYLFILLHDDAVMHLMRVSAGLDSLVLGEGQILAQVKHTHKVGQQYNGIGRILNRLFNQAITAGKRVRTETSIGTGAVSISSAAVELAQLKVQHLPACRVAILGAGKMSRLVVQHLISKGATQICIVNRSLDSARELAQQFKEAEIRLHLLDEMMHVICNSDLVFTATAATEPLIDRAKLESTIDPLHSLKLFDISVPRNVHADVNELDHVQLFNVDDLKAVVAQNQESRRQMALEAENILDEEVAAFDLWWRSLETVATISELRDKVEAIRAQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRAQQDIEARRRAMQTLRSLFNLEEPASNKA
但是,如您所见,第一个文件 ( 1st: Colcht.WP_006104309.1, 2nd: Moopro.WP_070396948.1, 3rd: Mastes.WP_027845098.1
) 中的顺序不再受到尊重。
预期输出的一个子集将是:
>Colcht.WP_006104309.1
MENYNTSNIDNVLLLKGDDIINLFKNREQDILDLVKLTYKIHGRGDSTLPHSSFLRFPDKNKERIIALPAYLGGEINTAGIKWIASFPGNLARGMERASAILIINSTETGRPQAIMEGSVISAKRTAASAALAAHFLRDRQSLVTVGLIGCGLINFETVRFLLKVRPEIETLFLYDVSLEKSDQFKRKCQQLSQNRELVILDNPDDVFKHSSVIALATTASQPHIVDISACQSDSIILHTSLRDLSPEIILSVDNIVDDIDHVCRAQTSIHLAEQKTGNRDFIRCPLSDILNGVAAPRQNNSQIAVFSPFGLGVLDLALGQLAYQLADETNVGTRLTSFFPVSWLQREDE
>Moopro.WP_070396948.1
METAYQGFAQQQPGDVIVLSASDILSLLAGREKELIEVVRQTYIAHARGESALPPSPFLRFANHPKNRIIAKPAYLGESFETAGIKWISSFPDNYQFGLLRASAVIILNSVKTGFAEAILEGSVISAKRTAASAALAARLLQSETQPESIGIIACGVINFEITRFLLAEFPTVKNLVIFDIDHERAVQYKSRCETNFETPNITIANDINTVLSSTSIISIATTETTPHIFEISACQPGSNILHISLRDFSPEVILSCDNIVDDVEHICSAQTSVHLAEQKINHRHFIRGSIGDILCGKILAKPTPSAITIFSPFGLGILDLAVAKLVHEWGIARNLGTVIPSFGCLPHE
>Mastes.WP_027845098.1
MSNKHHLSFTYLSQEDLLDAGCFDIRMVMDIAEKAMLEFERHHVIFPEKIVQIFNQATQERINCLPATLLDEKVCGVKWVSVFPMNPIEHDQQNLSAIFILSEIETGFPICVMEGTLASNMRVAAIGGLAAKYLARQDSEVIGFIGAGEQAKMHLIAMKAVCPSLKQCRVAAHVVKHEEQFIAELSRLYPEMEFVSTNTNLQKAIEDADILVTATSAQAPLLKATWVKPGTFYSHIGGWEDEFEVALQADKIVCDDWETVTHRTQTLSRMYQEGLINANNIHADLHELVSGKKAGRESQTERIYFNAVGLAYIDIAIAMAMFNRAREKQKGTQLDLQQSMVFEHLGLKSKVKL
>Phowil.WP_068791039.1
MRVISAAEVQAALDFETLVGRLRDIFRRGGEAPARQQYDIAITGEPAQTLLLAPAWQAGRHVGVQIATVTPGNGARGLPAGMGAYLLLDGRSGAPAALIDGPMLTLRRTAAASALASAYLSRPDSARLLMVGTGALAPHLIAAHAAVRPIREVLVWGRTPAKAARLAKAVKLPRVRLAWTEDLEGAVRGADIVACATLSQQPLLRGAWLRPGQHLDLVGAYRPEMRESDGEVFRRARVFVDTRAGALAEAGDLIQALAEGALSAADVAADLFELARGEKAGRRFYDQITLFKATGSALEDLAAAQLTVERA
>Cyaapo.WP_015218744.1
MNIVVVGLSHKTAAVEVREKLSIPEAKIEDSIRHLLTYPHIEEVAIISTCNRLEIYAVVKETEQGVKEITQFLAEIGNLALLELRRHLFILLHQDAIRHLMRVAAGLESLVLGEGQILAQVKTTHKLGMKYNGMSRLLDRLFKQAISAGKRVRTETNIGTGAVSISSAAVELVDTKIEDLSSQKVTIIGAGKMSRLLVQHLLAKGVEDIIIVNRSHNRSQELAKQFPQANLKLNLLEDMMTMVAQSDIVFTSTGATQPILDKNNLSSLSINHSLMLVDISVPRNVASDVTELEFIKSYNVDDLKAVVAQNHASRREMAREAENLLEEEIEAFELWWQSLETVPTISCLRSKIEQIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRTQKDIEARRKALHTLQTLFDLEVSEQLI
有谁知道我如何像上面那样 grep,从第一个文件中保留订单?
非常感谢任何帮助:)
解决方案
如果要保持第一个文件的顺序,最好先解析第二个,存储下一行,然后再解析第一个。
> cat tst.awk
FNR==NR && p {
a[prev]=$0
p=0
next
}
FNR==NR && $0~/^>/ {
prev=substr($0,2)
p=1
next
}
$0 in a {
print ">" $0 RS a[$0]
}
用法:
awk -f tst.awk file2 file1
如果file2
是巨大的并且你没有足够的内存,你可以file2
用你的命令的输出替换grep
(只有有趣的部分file2
)。
awk -f tst.awk <(grep -A 1 -f file1 file2) file1
否则,您仍然可以通过file1 file2
,但您必须保存行的顺序并在该END
部分中完成工作。
> cat tst.awk
FNR==NR {
row[NR]=$0
a[$0]
next
}
p {
next_row[x]=$0
p=0
next
}
substr($0,2) in a {
x=substr($0,2)
p=1
}
END {
for (i=1;i in row;i++)
if (next_row[row[i]])
print ">" row[i] RS next_row[row[i]]
}
用法:
awk -f tst.awk file1 file2
推荐阅读
- angular - Angular Material 选项卡在 ngOnInit 之后导航到选定的选项卡
- excel - 无法获取范围类的 FindNext 属性
- sql - Oracle 查询(PROC)没有给出想要的结果
- modx - Modx 赋予用户添加新用户的权限
- java - HTML 文件中显示的编码字符
- java - 无法在休眠状态下连接到服务器数据库
- spring-boot - spring boot 2:手动加载application.properties
- installation - 下载 Fuchsia 源码——更新项目或包时,由于致命错误,Jiri 钩子无法运行
- oracle - Oracle 报告 REP-0820:无法导入指定的图像
- c - C 预处理器或 C++ 魔术自动为每个文件创建一个对象?