语法分析(一)

语法分析主要用于处理上下文无关(CFG:contex free grammar)文法。

1. 术语

1.1. Formal Language

Def: let $\sum$ be a set of characters (an alphabet). An language over $\sum$ is a set of strings of characters drawn from $\sum$.

意味着语言L是一个映射,将语法映射到语义。

\[L(reg) = a\;set\;of\;strings.\]

例如,L(G)表示由G生成的语句集合,也即由G生成的语言。

1.2. context free grammar

Def: For a language L, its CFG defines the sets of strings of symbols that are valid sentences in L.

CFG由(T,NT,S,PR)共同构成。

1.3. sentence

Def: a string of symbols that can be derived from the rules of a grammar.

1.4. production

Def: Each rule in a CFG is called a production

1.5. Nonterminal symbol

Def: a syntactic variable used in a grammar’s productions

1.6. Terminal symbol

Def: a word that can occur in a sentence

1.7. Word

Def: A word consists of a lexeme and its syntactic category. Words are represented in a grammar by their syntactic category.

2. if-then-else的匹配

文法出现二义性通常由于两种原因导致:

  1. precedence:运算符优先级没有指明
  2. associativity:运算符关联性没有指明

例如else的匹配问题,通常我们希望else匹配到最近的前一个未匹配的if语句,可用的几种文法如下:

1
2
3
stmt -> if Expr then stmt
     |  if Expr then WithElse else stmt
withelse -> if Expr then withelse else withelse

或者等价的(下方unmatched提到stmt中,matched和withelse相同):

1
2
3
4
5
stmt -> matchedstmt
     | unmatchedstmt
matchedstmt -> if expr then matchedstmt else matchedstmt
unmatchedstmt -> if expr then stmt
              |  if expr then matchedstmt else unmatchedstmt

通过以上文法可以观察到,parse的关键在于if-then-else结构的then部分,需要限制如果当前能够使用else匹配,则要保证then部分必须要么(I)不存在if相关(if-then或者if-then-else)结构;要么(II)then部分含有的if结构均为if-then-else类型。

但是下面的文法不行

1
2
3
stmt -> if expr then stmt
     |  matchedstmt
matchedstmt -> if expr then matchedstmt else stmt

问题在于matchedstmt的else部分不应当为stmt,导致matchedstmt可以匹配if-then-else-if-then。反例如下

1
2
3
4
5
6
7
8
9
if then if then else if then else
// 解析方式1
if then 
    | if then else
                | if then else
// 解析方式2
if then else
     | if then else
                | if then