首页 > 解决方案 > 在 UNIX 中保持相同顺序从另一个文件中查找一个文件的内容

问题描述

我有一个文件,例如:

Colcht.WP_006104309.1
Moopro.WP_070396948.1
Mastes.WP_027845098.1
Phowil.WP_068791039.1
Cyaapo.WP_015218744.1
Gemher.WP_017295578.1
Lyncon.WP_039726659.1
Glokil.WP_023171831.1
Noscya.WP_087539356.1
Photen.WP_073607805.1
Hydriv.WP_073598454.1
Lepoha.WP_088893428.1
Nodnod.WP_017300904.1
Noscya.WP_087540001.1
Spisub.WP_017307136.1
Scy0HK.WP_073635112.1
PlaSR0.WP_054467905.1

我曾经 grep -A 1 -F -f从另一个文件 grep 获取与上面文本文件中每一行有关的信息。结果的一个子集如下:

>Cyaapo.WP_015218744.1
MNIVVVGLSHKTAAVEVREKLSIPEAKIEDSIRHLLTYPHIEEVAIISTCNRLEIYAVVKETEQGVKEITQFLAEIGNLALLELRRHLFILLHQDAIRHLMRVAAGLESLVLGEGQILAQVKTTHKLGMKYNGMSRLLDRLFKQAISAGKRVRTETNIGTGAVSISSAAVELVDTKIEDLSSQKVTIIGAGKMSRLLVQHLLAKGVEDIIIVNRSHNRSQELAKQFPQANLKLNLLEDMMTMVAQSDIVFTSTGATQPILDKNNLSSLSINHSLMLVDISVPRNVASDVTELEFIKSYNVDDLKAVVAQNHASRREMAREAENLLEEEIEAFELWWQSLETVPTISCLRSKIEQIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRTQKDIEARRKALHTLQTLFDLEVSEQLI
>Gemher.WP_017295578.1
MNIIVVGLSHKTAAVEVREKLSIPEAKIEESIKHLLSYPHIEEVGIISTCNRLEIYAVVKETEQGVKEITQFLAEIGHLSLHSLRRHLFILLHQDAIRHLMRVSAGLESLVLGEGQILAQVKNTHKLSTKYQGMGRLLDRLFKQAMSAGKRVRTETNIGTGAVSISSAAVELVDMKLDDLSRQKVSIIGAGKMSRLLVQHLISKGVSDITIVNRSVSRSKELAKQFPQIELKLNLLEEMMEIVRDSDIVFTSTGATQPILDKNNLCSIECHHSIMLVDISVPRNVASDVEELDFIVAYNVDDLKAVVAQNQASRREMAREAELLLEEEIEAFELWWQSLETVPTISCLRSKIEEIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRSQQDIEARRKALQTLQTLFNLEISEQFG
>Glokil.WP_023171831.1
MQIAVIGLSHRTAPVEIREKVSIPEQQVAEYVSRLRSCSQIAECAILSTCNRLEIYAVLRDSEHGLREVTQFLAESKGVALPMLQRHLFTLLHQDAVMHLMRVAAGLDSIVLGEGQILAQIKVTHKLAQQGKGVDRILNQLFKAAITGGKRVREETDIGKGAISVSSAAVEMAMRKKNRRSLQDQRCLVVGAGKMGELVLRHLISKGARQIIVLNRSLEKAAQMVEQFGLMLPVATIDELGNHLGAADLIFTCTSASEPLINYERLSQVRREQPLMIFDIAVPRNVAVDVEELSNVHLFNVDHLKQVVEENRAYRQLMVQQCEDILLQQLDEFLDWWRNLEAVPTINSLRQKVETIREQELEKALSRLGTEFGEKHQGIIDSLTRAIVNKILHDPMVQLRAQRDVEARRRALQTLQTLFNLEPLGSNPEPPVL
>Hydriv.WP_073598454.1
MNIAVVGLSHKTAPVEIREKLSIQEAKLESALAHLRSYPHIIEVAIISTCNRLEIYAIATETDQGVREISQFLSEIGHIPLDRLRRYLFILLHQDAVRHLMRVAAGLESLVLGEGQILAQVKNTHKLAQKYQSLGQILDRLFKQAMTAGKRVRSETNIGTGAVSISSAAVELAHMKAENLAARRVCIIGAGKMSRLLVQHLLAKGTQQICIVNRSHRRAEELASQFPEVQLKLYPLTEMMSAVAASDIVFTSTAATEPIINRSQLEASLTRDRELMLFDISVPRNVHADVGGMESVQSYNVDDLKAVVAQNYESRRKMAQEAEALLEEEIAAFELWWRSLETVPTISCLRSKVETIREQELEKALSRLGTEFAEKHQEVIEALTRGIVNKILHEPMVQLRAQQDIEARRRCLQSLQMLFNLEIEKQVI
>Lepoha.WP_088893428.1
MNIVVVGLSHRTAPVEVREKLSIPTPQMEAAIAHLRSFPHIEEATILSTCNRLEVYVVTSETEQGVREVTQFLSEYGKISVSQLRPYLFILLHDDAVMHLMRVSAGLDSLVLGEGQILAQVKHTHKVGQQYNGIGRILNRLFNQAITAGKRVRTETSIGTGAVSISSAAVELAQLKVQHLPACRVAILGAGKMSRLVVQHLISKGATQICIVNRSLDSARELAQQFKEAEIRLHLLDEMMHVICNSDLVFTATAATEPLIDRAKLESTIDPLHSLKLFDISVPRNVHADVNELDHVQLFNVDDLKAVVAQNQESRRQMALEAENILDEEVAAFDLWWRSLETVATISELRDKVEAIRAQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRAQQDIEARRRAMQTLRSLFNLEEPASNKA

但是,如您所见,第一个文件 ( 1st: Colcht.WP_006104309.1, 2nd: Moopro.WP_070396948.1, 3rd: Mastes.WP_027845098.1) 中的顺序不再受到尊重。

预期输出的一个子集将是:

>Colcht.WP_006104309.1
MENYNTSNIDNVLLLKGDDIINLFKNREQDILDLVKLTYKIHGRGDSTLPHSSFLRFPDKNKERIIALPAYLGGEINTAGIKWIASFPGNLARGMERASAILIINSTETGRPQAIMEGSVISAKRTAASAALAAHFLRDRQSLVTVGLIGCGLINFETVRFLLKVRPEIETLFLYDVSLEKSDQFKRKCQQLSQNRELVILDNPDDVFKHSSVIALATTASQPHIVDISACQSDSIILHTSLRDLSPEIILSVDNIVDDIDHVCRAQTSIHLAEQKTGNRDFIRCPLSDILNGVAAPRQNNSQIAVFSPFGLGVLDLALGQLAYQLADETNVGTRLTSFFPVSWLQREDE
>Moopro.WP_070396948.1
METAYQGFAQQQPGDVIVLSASDILSLLAGREKELIEVVRQTYIAHARGESALPPSPFLRFANHPKNRIIAKPAYLGESFETAGIKWISSFPDNYQFGLLRASAVIILNSVKTGFAEAILEGSVISAKRTAASAALAARLLQSETQPESIGIIACGVINFEITRFLLAEFPTVKNLVIFDIDHERAVQYKSRCETNFETPNITIANDINTVLSSTSIISIATTETTPHIFEISACQPGSNILHISLRDFSPEVILSCDNIVDDVEHICSAQTSVHLAEQKINHRHFIRGSIGDILCGKILAKPTPSAITIFSPFGLGILDLAVAKLVHEWGIARNLGTVIPSFGCLPHE
>Mastes.WP_027845098.1
MSNKHHLSFTYLSQEDLLDAGCFDIRMVMDIAEKAMLEFERHHVIFPEKIVQIFNQATQERINCLPATLLDEKVCGVKWVSVFPMNPIEHDQQNLSAIFILSEIETGFPICVMEGTLASNMRVAAIGGLAAKYLARQDSEVIGFIGAGEQAKMHLIAMKAVCPSLKQCRVAAHVVKHEEQFIAELSRLYPEMEFVSTNTNLQKAIEDADILVTATSAQAPLLKATWVKPGTFYSHIGGWEDEFEVALQADKIVCDDWETVTHRTQTLSRMYQEGLINANNIHADLHELVSGKKAGRESQTERIYFNAVGLAYIDIAIAMAMFNRAREKQKGTQLDLQQSMVFEHLGLKSKVKL
>Phowil.WP_068791039.1
MRVISAAEVQAALDFETLVGRLRDIFRRGGEAPARQQYDIAITGEPAQTLLLAPAWQAGRHVGVQIATVTPGNGARGLPAGMGAYLLLDGRSGAPAALIDGPMLTLRRTAAASALASAYLSRPDSARLLMVGTGALAPHLIAAHAAVRPIREVLVWGRTPAKAARLAKAVKLPRVRLAWTEDLEGAVRGADIVACATLSQQPLLRGAWLRPGQHLDLVGAYRPEMRESDGEVFRRARVFVDTRAGALAEAGDLIQALAEGALSAADVAADLFELARGEKAGRRFYDQITLFKATGSALEDLAAAQLTVERA
>Cyaapo.WP_015218744.1
MNIVVVGLSHKTAAVEVREKLSIPEAKIEDSIRHLLTYPHIEEVAIISTCNRLEIYAVVKETEQGVKEITQFLAEIGNLALLELRRHLFILLHQDAIRHLMRVAAGLESLVLGEGQILAQVKTTHKLGMKYNGMSRLLDRLFKQAISAGKRVRTETNIGTGAVSISSAAVELVDTKIEDLSSQKVTIIGAGKMSRLLVQHLLAKGVEDIIIVNRSHNRSQELAKQFPQANLKLNLLEDMMTMVAQSDIVFTSTGATQPILDKNNLSSLSINHSLMLVDISVPRNVASDVTELEFIKSYNVDDLKAVVAQNHASRREMAREAENLLEEEIEAFELWWQSLETVPTISCLRSKIEQIREQELEKALSRLGSEFAEKHQEVIEALTRGIVNKILHDPMVQLRTQKDIEARRKALHTLQTLFDLEVSEQLI

有谁知道我如何像上面那样 grep,从第一个文件中保留订单?

非常感谢任何帮助:)

标签: unixawkcommand-linegrep

解决方案


如果要保持第一个文件的顺序,最好先解析第二个,存储下一行,然后再解析第一个。

> cat tst.awk
FNR==NR && p {
    a[prev]=$0
    p=0
    next
}

FNR==NR && $0~/^>/ {
    prev=substr($0,2)
    p=1
    next
}

$0 in a {
    print ">" $0 RS a[$0]
}

用法:

awk -f tst.awk file2 file1

如果file2是巨大的并且你没有足够的内存,你可以file2用你的命令的输出替换grep(只有有趣的部分file2)。

awk -f tst.awk <(grep -A 1 -f file1 file2) file1

否则,您仍然可以通过file1 file2,但您必须保存行的顺序并在该END部分中完成工作。

> cat tst.awk
FNR==NR {
    row[NR]=$0
    a[$0]
    next
}

p {
    next_row[x]=$0
    p=0
    next
}

substr($0,2) in a {
    x=substr($0,2)
    p=1
}

END {
    for (i=1;i in row;i++)
        if (next_row[row[i]])
            print ">" row[i] RS next_row[row[i]]
}

用法:

awk -f tst.awk file1 file2

推荐阅读