机器学习&恶意代码静态检测

分析工具
方法概述
references：

分析工具

readelf

elfparser

ninja

GDB

IDAPro

Strings

python库：pyelftools、lief

方法概述

数据/特征	算法模型	优点	缺点
二进制文件	byte-ngram [7]、malConv [8][9]	不需要解析格式	序列超长，malconv卷积复杂度高
二进制文件	图像处理[1]	不需要解析格式	文件大小不同，图像大小不一致；加壳的数据分布会被打乱
二进制文件	字节（熵）直方图[2]	不需要解析格式
字符串信息	nlp	获取信息方便	缺少很多信息；数据格式乱
ELF结构信息	ML [3] [6]		格式解析复杂；特征工程多
反汇编asm	源码分析、opcode [4,5]	贴近人读信息	需要反汇编
反汇编asm	FCG	利用程序执行逻辑	需要反汇编；有难度

怎么从原始elf样本中提取特征？下面的方法

二进制灰度图

参考[1]

然后将不同大小的图片归一化，作为后续算法模型的输入

字节（熵）直方图

统计0-255字节的直方图；

使用1024字节长度，步长256的滑动窗口，每个窗口生成1024个字节熵，实际是一个8*256大小的图，将其变为16*16的，在变为256维的向量 [2]。

字符串信息

使用strings命令扫描文件，结合ascii码之类

"__lseek64",
"__strndup",
"__gconv_modules_db",
", version ",
"expand_dynamic_string_token",
"pvalloc",
"_L_lock_4841",
"confstr",
"free_category",
"/etc/suid-debug",
"_IO_mem_sync",
"__pthread_rwlock_rdlock",
"__DTOR_LIST__",
"__strchrnul",
"__argz_stringify",
"pthread_cancel",
"__exit_funcs",

ELF结构信息

利用ELF文件的组成信息，[3]用了383个特征。

还有EMBER中提到的方法，例如导入导出表等[6]。

源码分析与OPcode

源码分析需要反汇编为汇编代码，使用大模型[4]

opcode使用，示例如下,图片来自[5]

FCG

DeepCG、Asm2vec

references：

【1】Malware Images: Visualization and Automatic Classification. https://vision.ece.ucsb.edu/sites/vision.ece.ucsb.edu/files/publications/nataraj_vizsec_2011_paper.pdf

【2】Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features. https://www.cse.fau.edu/~xqzhu/courses/cap6619/deep.neural.network.based.malware.detection.pdf

【3】ELF-Miner: using structural knowledge and data mining methods to detect new (Linux) malicious executables. https://link.springer.com/content/pdf/10.1007/s10115-011-0393-5.pdf

【4】PalmTree: Learning an Assembly Language Model for Instruction Embedding. https://dl.acm.org/doi/pdf/10.1145/3460120.3484587

【5】Detecting unknown malicious code by applying classification techniques on OpCode patterns. https://security-informatics.springeropen.com/track/pdf/10.1186/2190-8532-1-1.pdf

https://xz.aliyun.com/t/6705

【6】EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. https://arxiv.org/pdf/1804.04637.pdf. https://github.com/elastic/ember.

【7】An Investigation of Byte N-Gram Features for Malware Classification. http://www.edwardraff.com/publications/investigation_byte_ngrams.pdf

【8】MalConv: Malware Detection by Eating a Whole EXE. https://aaai.org/ocs/index.php/WS/AAAIW18/paper/viewFile/16422/15577

【9】Learning the PE Header, Malware Detection with Minimal Domain Knowledge. https://arxiv.org/ftp/arxiv/papers/1709/1709.01471.pdf. https://github.com/jaketae/deep-malware-detection