Download Presentation
## CYK Parser

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CYK Parser**Von Carla und Cornelia Kempa**Example grammar**• Number(s) Integer | Real • Integer Digit | Integer Digit • Real Integer Fraction Scale • Fraction .Integer • Scale e Sign Integer | Empty • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Empty ɛ • Sign + | -**Example Sentence: 32.5e+1**• 1. concentrate on the substrings of the input sentence**32.5e +1 is in the language**• What problems can we already see in this example?**Another complication: Ɛ- rules**Input : 43.1**The ɛ- Problem**Shortest substrings of any input sentence : ɛ-substrings We must compute Rɛ the set of non-terminals that derive ɛ Rɛ = { Empty, Scale }**Non- empty substrings of the input sentence**• Input : z = z1 z2 z3 z4 ….zn • Compute the set of Non-Terminals that derive the substring of z starting at position i, of length l.**Terminology (also on the handout)**• i index we are starting at • l length of this substring • R s i,l set of Non-Terminals deriving the substring s i, l • S i, 0 = ɛ • Set of Non- Terminals that derive ɛ : R s i,0 = R ɛ**The set of Non- Terminals deriving the substring s i, l :**R si, l 1.) substrings of length 0 S i, 0 = ɛ and R si, l = Rɛ 2.) short substrings 3.) longer substrings (say l = j ) All the information on substrings with l < j is available**Check each RH-side (Right-Hand -side) in the grammar to see**if it derives s i, l • L A1 ….Am S i, l ( divided into m segments (= possibly empty)) A1 first segment of s i, l A2 second segment of s i, l …. ….**A 1 ….Am s i,l**• So A1 first part of s i,l (let´s say A1 has to derive a first part of s i, l of length k) A1 s i, k A1 is in the set R s i,k**A 1 ….Am s i,l**• Assuming this A2…Am has to derive the rest: A2 … Am Si+k, l-k This is attempted for every k**Problems with this Approach**1) Consider A2…Am m could be 1 and A1 a Non-terminal We are Dealing with a unit- rule A1 must derive the whole substring s i, l and thus be a member of R s i, l But that´s the set we are computing right now …**Solution to this problem**• A1 s i, l • Somewhere along the derivation there must be a first step not using a unit rule A1 B … C * s i, l C is the first Non-Terminal using a non-unit-rule in the derivation**Solution cont.**At some stage C is added to Rs i, l If we repeat the process again and again At some point B will be added and in the next step A1 will be added We have to repeat the process again and again until no new Non-Terminals are added to R s i,l**Problem 2**Ɛ-rules Consider all but one of the At derive Ɛ B A1 A2 A3 A4 A5 …. At B and A1 - t are Non-Terminals A2 – At derive Ɛ So what stays is : B A1 A unit-rule**We have computed all the Rs i,l**• If S is a member of Rs 1, n the start symbol derives z (=s 1, n) (the input string)**CYK recognition with a grammar in ****- form:**• What are the Restrictions we want to have on our grammar ?**Useful Restrictions**• No ɛ- rules • No unit-rules • Limit the length of the right- hand side of each rule, say to two • What we get out of this: • A a • A BC • Where a is a terminal and ABC are Non- Terminals**Chomsky-Normal-Form…**(… not only to annoy students ) • Perfect grammar for CYK**How CYK works for a grammar in CNF**• Rɛ is empty • R s i, 1 can be read directly from the rules (A a) A rule A BC can never derive a single terminal**Procedure**• Iteratively (as before) : • 1) Fill the sets R s, 1 directly • 2) Process all substrings of length 1 • 3) Process all substrings of length 2 • 4) Process all substrings of length l • For the first step we use the rules of the form A a • For all the following steps we have to use the rules of the form: A BC**CYK and CNF**Question the CYK-Parser has to answear is: Does such a k exist?**Answearing this question is easy:**• Just try all possibilities • no problem since you are a computer ;-) • Range : from 1 to (l-1) • All the sets R s i,k and R s i+k , l-1 have already been computed at this point**Transform our sample CF-grammar into Chomsky Normal Form**• Overview • 1) eliminate ɛ-rules • 2) eliminate unit-rules • 3) remove non-productive non-terminals • 4) remove non –reachable non-terminals • 5) modify the rest until all grammar rules are of the form A a , A BC**Our number grammar in CNF**• Number(s) 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Number(s) Integer Digit • Number(s) N1 Scale´ | Integer Fraction • N1 Integer Fraction • Integer 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Integer Integer Digit • Fraction T1 Integer • T1 . • Scale ´ N2 Integer • N2 T2 Sign • T2 e • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Sign + | -**Building the recognition table**• Input : Our example grammar in CNF input sentence: 32.5 e + 1**Building the recognition table**• 1) bottom-row : read directly from the grammar (rules of the form A a ) • 2) Check each RHS in the grammar**Check each RHS of the grammar**• Two Ways: Example: 2.5 e ( = s 2, 4) • 1) check each RHS e.g N1 Scale´ • 2) compute possible RH-Sides from the recognition table**How this is done**1) N1 not in R s 2, 1 or R s 2, 2 N1 is a member of R s 2, 3 But Scale´ is not a member of R s 5, 1 2) R s 2, 4 is the set of Non- Terminals that have a RHS AB where either: A in R s 2, 1 and B in R s 3, 3 A in R s 2, 2 and B in R s 4, 2 A in R s 2, 3 and B in R s 5, 1 Possible combinations: N1 T2 or Number T2 In our grammar we do not have such a RHS, so nothing is added to R s 2, 4.**Computing R s i, l:follow the arrows V and W simultaneously**A BC , B a member of a set on the V arrow , C a member of a set on the W arrow**Comparison**• This process is much less complicated than the one we saw before • Why?**Conclusion**• This process is much less complicated • Reasons: 1) We do not have to repeat the process again and again until no new Non-Terminals are added to R s i,l (The substrings we are dealing with are really substrings and cannot be equal to the string we start with)**Reasons cont.**2) We only have to find one place where the substring must be split into two A B C Here !