首页 > 解决方案 > C lexer, understanding documentation, preprocessing tokens


My goal is to build a parser for a reasonable subset C and right now I'm at the start, implementing the lexer. Answers to a similar question on the same topic pointed towards the International Standard for C (700 pages of documentation) and the Yacc grammar webpage.

I would welcome any help with understanding the documentation: Is it true that the following picture from the documentation represents grammar rules, where the notation C -> (A, B) means that all occurrences of AB in that order get replaced by C?

identifier -> identifier-nondigit | (identifier,identifier-nondigit) | (identifier,digit)
identifier-nondigit -> nondigit | universal-character-name | other
digit -> 0 | 1 | 2 | ... | 9
non-digit -> _ | a | b | ... | z | A | ... | Z 

enter image description here

I think I am confused because the documentation introduces 'preprocessing tokens' which I thought would be just labels of sequences of characters in the source produced without backtracking.

I.e. something like:

"15647  \n  \t abdsfg8rg \t" -> "DWLDLW"
// D .. digits, W ... whitespace, L ... letters

It seems like the lexer is doing the same thing as the parser (just building a tree). What is the reason for introducing the preprocessing tokens and tokens?

Does it mean that the processing should be done 'in two waves'? I was expecting the lexer to just use some regular expressions and maybe a few rules. But it seems like the result of lexing is a sequence of trees that can have the roots keyword, identifier, constant, string-literal, punctuator. enter image description here

Thank you for any clarifications.

标签: cparsingtokenlexer



预处理标记是 C 预处理器的输入。在将 C 源代码转换为可执行文件的过程中,预处理标记和中间空格首先由预处理器操作,然后预处理标记和空格的结果流在转换为(标准的词;也许“重新解释为”更好地传达了这个想法)一个标记流。语言标准的第 节对所有这些进行了概述。


  • 标识符-->标识符枚举常量(选择是上下文相关的,但在实践中可以通过避免在语义分析之前进行区分来解决)。
  • pp-number --> integer-constantfloating-constant,视情况而定(constant 的两种选择
  • character-constant --> character-constant (常量的替代方案之一)
  • 字符串文字-->字符串文字
  • 标点符号-->标点符号
  • 删除预处理指令后剩余的任何其他内容 --> 一个或多个单字符标记


词汇分析的更多细节在标准的第 6.4 节及其小节中提供。




基本上,标准 C 实际上是两种语言合二为一:预处理语言和 C 语言本身。早期,它们由完全不同的程序处理,预处理是可选的。预处理器对它所操作的单元有一个视图,这与 C 语法的分类并不完全一致。 Preprocessing-tokens是预处理器的语法分析和数据的单位,而tokens是C 的语法分析的单位。


