Statistické metody zpracování přirozených jazyků II
Z ωικι.matfyz.cz
Statistické metody zpracování přirozených jazyků II | ||||
|
Písemka
2005/2006
16. 5. 2006 místo přednášky (písemka delší než 2 hodiny)
Témata na písemkové otázky:
- shift-reduce parsing – příklad: tabulky stavů, parsování vzorku
- tagging (HMM) – příklad na Viterbi algoritmus: pro daný výstup najít nejpravděpodobnější posloupnost stavů
- pravděpodobnostní parsing – z derivačního stromu odvodit gramatiku + ppsti, ...
- překvapení, abychom se u zkoušky něco naučili
2006/2007
2008/2009
Předběžné informace o zkoušce -- co je třeba znát
Morfologie -- nebývá až tak detailně
- Princip transducerů
- Definice HMM, rozlišení metod tagování
- Max. entropy
- Log-linear model
- Přetrénování
- Trellis, HMM, Baum-Welch
Parsing -- nejdůležitější
- Základy -- Chomského hierarchie gramatik a umístění CFG v ní
- Chomského normální forma, normalizace
- Závislostní a složkové stromy, převod
- Linearizace (závorkování)
- Sestavování shift-reduce tabulek (pozor na chyby, zapomínání pravidel atp.)
- Pravděpodobnostní parsing
- Umět extrahovat gramatiku ze stromu (každé pravidlo jen 1x!) s pravděpodobnostmi
- Spočítat pravděpodobnost parsu (derivačního stromu) v PCFG, pravděpodobnost stringu (který může mít víc možností parsu)
- Jak se evaluuje parsing?
- Lexikalizace gramatiky
- Vyhlazování (MLE odhad pravděpodobnostního rozdělení) -- pamatovat si metody ze ZS
- Řešíme výskyt jevů, o kterých nemáme informace z trénovacích dat,
- a zároveň kompenzujeme příliš vysokou pravděpodobnost viděných jevů
Písemka
21.5.2009
- entropie, cross-entropie (spočítat na jedné větě)
- max. entropy (vzorec, features -- spočítat hodnoty na jedné větě)
- teoreticky: způsoby tagování
- vytvořit shift-reduce tabulky a parsovat řetězec
- z linearizovaných stromů vyrobit normální, extrahovat z nich gramatiku a napsat pravděpodobnosti, teoreticky chart parsing
- velmi podobná písemce z minulých let
Zápočtové úlohy
- link na Brillův tagger (měl by jít zkompilovat bez větších úprav kódu)
To-Remember
Morphology & Tagging
- Task, formally: $ A^{+} \to T $ (simplified), split morphology & tagging (disambiguation): $ A^{+} \to 2^{(L,C_1,C_2,\dots,C_n)}\to T $. Tagging must look at context.
- Tagset: influenced by linguistics as well as technical decisions.
- Tag ~ n-tuple of grammar informations, often thought of as a flat list.
- Eng. tagsets ~ 45 (PTB), 87 (Brown), presentation: short names.
- Other: bigger tagsets -- only positional tags, size: up to 10k.
- Tagging inside morphology: first, find the right tag, then the morphological categories: $ A^{+}\to T\to (L,C_1,\dots,C_n) $.
- Doable for poor flection languages only.
- Possibly only decrease the ambiguity for the purposes of tagging (i.e. morphology doesn't have to be so precise).
- Lemmatization: normally a part of morphology, sometimes (for searching) done separately.
- Stem -- simple code for Eng., no need of a dictionary, now out-dated.
- Possible methods for morphology:
- Word lists: lists of possible tags for each word form in a language
- Works well for Eng. (avg. ca. 2.5 tags/word), not so good for Cze. (avg. ca. 12-15 tags/word).
- Direct coding: splitting into morphemes (problem: split and find possible combinations)
- Finite State Machinery (FSM)
- CFG, DATR, Unification: better for generation than analysis
- Word lists: lists of possible tags for each word form in a language
Finite State Machinery
Finite-State-Automata
- Smarter word form lists: compression of a long word list into a compact automaton.
- Trie + Grammar information, minimize the automaton (automaton reduction)
- Need to minimize the automaton & not overgenerate
Two-Level-Morphology
- phonology + morphotactics, two-level rules, converted to FSA
- solves e.g. Eng. doubling (stopping), Ger. umlaut, Cze. "ský" -->pl. "čtí" etc.
- Finite State Transducer: an automaton, where the input symbols are tuples
- run modes: check (for a sequence of pairs, gives out Y/N) + analysis (computes the "resolved" (upper) member of the pair for a sequence of "surface" symbols) + synthesis (the other way round)
- used mostly for analysis
- ususaly, one writes small independent rules (watch out for collisions), then they're run in parallel and all must hold
- zero-symbols, one side only, check for max. count for a language
- we may eliminate zero-symbols using ordinary FSA on lexicon entries (upper layer alphabet; prefixes: according to them, the possible endings are treated specially)
- FSTs + FSAs may be combined, concatenated; the result is always an FST
Two-level morphology: analysis
- initialize paths $ P := \{\} $
- read input surface symbols, one at a time, repeat (this loop consumes one char of the input):
- at each symbol, until max. consecutive zeroes for given language reached, repeat (this loop just adds one possible zero):
- generate all lexical (upper-level) symbols possibly corresponding to zero (+all possible zeroes on surface)
- prolong all paths in $ P\;\! $ by all such possible $ (x:0)\;\! $ pairs (big expansion)
- check each new path extension against the phonological FST and lexical FSA (lexical symbols only), delete impossible path prefixes.
- generate all possible lexical symbols (get from all FSTs) for the current input symbol, create pairs
- extend all paths with the pairs
- check all paths (next step in FST/FSA) and delete the impossible ones
- at each symbol, until max. consecutive zeroes for given language reached, repeat (this loop just adds one possible zero):
- collect lexical "glosses" from all surviving paths
Rule-Based Tagging
- Rules using: word forms/tags/tag sets for context and current position, combinations
- If-then / regular expression style (blocking negative)
- Implementation: FSA, for all rules -- intersection
- algorithm ~ similar to Viterbi (dynamic programming: if the FSA rejects a path, throw it away)
- May even work, sometimes does not disambiguate enough
- Tagging by parsing: write rules for syntactic analysis and get morphological analysis as a by-product
- in a way, this is the linguistically correct approach
- better, cleaner rules
- more difficult than tagging itself, nobody has ever done it right
HMM Tagging
- probabilistic methods, also applies to feature-based
- noisy channel: input (tags) -> output (words), goal: discover channel input given the output.
- $ p(T|W) = p(W|T)\cdot\frac{p(T)}{p(W)} $, $ \mathrm{argmax}_T p(T|W) = \mathrm{argmax}_T p(W|T)p(T) $.
- two models -- simplification:
- tags depend on limited history (4-5 grams)
- word depends on tag only (1 gram!)
- almost the general HMM
- output words emitted by states (not arcs), states are $ (n-1) $-tuples of tags for an $ n $-gram tag model
- $ (S, s_0, Y, P_S, P_Y) $ -- set of states, initial state, output alphabet (words), set of prob. distributions of transitions, set of prob. distributions for emissions
- supervised learning: use MLE, smoothing:
- p(w|t) -- "add 1" for all possible tag+word pairs using a predefined dictionary (i.e. some 0's kept: p(word|impossible-tag) = 0)
- p(t|context) -- linear interpolation, up to uniform (as for language model)
- old and simple, but still accurate enough (only slower than e.g. neuron networks)
- may be trained even with unsupervised data: unambiguous words help get the disambiguation for the others (improvement depends on language and tagset)
- Baum-Welch algorithm, minimizing the entropy; use heldout data
- training always decreases the entropy, smoothing increases it again (in case of no bigger tagged corpus available, it's a good step to try; supervised is always better)
- Out-of-Vocabulary
- no lists of possible tags
- try all / open class tags (good for non-flective languages), or:
- try to guess possible tags based on suffix/ending or both ends of the word (e.g. for Cze. -- first 3 and last 8 letters) -- train the classifier using rare words from the training data only!
- Running the tagger
- Viterbi, remember to handle unknown words, or:
- assign always the best tag at each word, but consider all possibilities for previous tags; introduces some errors, but might get better accuracy
Transformation-Based Tagging
- Not noisy changel, not probabilistic, but statistical -- uses training data (combination of supervised/unsupervised), learning rules
- criterion: accuracy -- "objective function"
- training: stepwise greedy-select
- iterate: pre-anotate using current rules (intermediate data), select the rule from the pool of possible ones (from templates) that contributes best to the improvement of the error rate
- stopping criterion: no or too small improvement possible; prone to overtraining!
- heldout possible (afterwards, remove rules that degrade performance on heldout data)
- rule types: context, lexical (looks at parts of the word)
- application of the rules -- left-to-right (a rule may be applied on part of its output) / delayed
- improved version: Fast-TBL(Transformation-Based Learning), there's no parallelized version
- old method (90's -- was the best one), faster than HMM
- tested for Cze. in the late 90's, not very good results, too many rules -- uncomputable (no way to parallelize it, the rules are weird in the beggining)
- may be used e.g. for named entity recognition (less rules, more effective)
- tagger: input = untagged data + rules from the learner
- applies the rules one-by-one to all the data --> creates $ n $ iterations of intermediate data, the last one of which is the output
- n-best modification (criterion: accuracy + number of tags per word), unsupervised modification (use only unambiguous words for evaluation)
Maximum-Entropy Tagging
- using Max. Entropy model
- feature functions -- take the data and say Y/N (or give out a real number), weighted ($ \sum \lambda = 1 $)
- maximize entropy ~ train the weights
- respect the proportions of counts of features in the data (1's)
- objective function optimization under certain constraints: Lagrange multiplicators
- the only freely adjustable part is the set of features (which is much more than what we may tune in an HMM -- there's only $ n $ for $ n $-grams)
- features may use tags only at positions to the left (and dynamic programming ~ Viterbi) + words on all positions, but don't have to do it at all (it's only very useful to do so)
- the model in general: $ p(y,x) = \frac{1}{Z}e^{\sum_{i=1}^{N}\lambda_i f_i(y,x)} $, find $ \lambda_i $ satisfying the model and constraints ($ E_p(f_i(y,x)) = d_i $ where $ d_i = E'(f_i(y,x)) $ -- empirical expectation of the feature frequency)
- for tagging: $ p(t,x) = \frac{1}{Z}e^{\sum_{i=1}^{N}\lambda_i f_i(t,x)} $, where $ t\in $tagset, $ x~ $context (words and tags, e.g. up to 3 positions L-R (tags left only))
- features may ask any information from this window
- careful for too many details -- it slows things down and there's an imminent danger of overtraining
- best way: select features automatically -- we then have a combination of HMM & Transformation-Based Learning
- rules involve tags at previous places and are selected by machine
- we still have to define the templates for the features
- features are independent --> parallelizable; may be even interesting linguistic phenomena ~ final punctuation -> mode of the verb etc.
- feature selection
- manually: impossible --> greedy selection: iterative scaling algorithm ($ O(n^2) $ -- still too much)
- limiting the number of features: be reasonable in creating the templates
- use contexts which appear in the training data
- use only rarer features (max. ca. 10 times in the training data) -- too frequent features are probably too unspecific -- get rid of them, e.g. if they have a distribution close to uniform (use relative entropy)
- do not use all combinations of context
- feature types
- any, even lexical features -- words with most errors etc.
- the best way of tagging, the only problem is that we don't know what the optimal features are (the neural networks/perceptron are the best for Cze.)
Feature-Based Tagging
Tato část je neúplná a potřebuje rozšířit. Tohle se asi moc nebralo, takže to není tak důležité
- idea: save on computing feature weights, select a few but good features
- training criterion: error rate
- model form: same as for maximum entropy $ p(y|x) = \frac{1}{Z(x)}e^{\sum_{i=1}^N \lambda_i f_i (y,x)} $ -- exponential or loglinear model
Tagger Evaluation
- A must: Test data (S), previously unseen, change frequently if possible
- Formally: Out(w) / True(w) -- for a given word
- Errors(S) = $ \sum_{i=1}^{|S|} \delta(\mathrm{Out}(w_i)\neq \mathrm{True}(w_i)) $
- Correct(S) = $ \sum_{i=1}^{|S|} \delta(\mathrm{True}(w_i)\in \mathrm{Out}(w_i)) $
- Generated(S) = $ \sum_{i=1}^{|S|} |\mathrm{Out}(w_i)| $ -- how many outputs the tagger produced (sum over all data)
Metrics
- Error rate: Err(S) = Errors(S) / |S|
- Accuracy: Acc(S) = 1 - Err(S)
- Multiple / no output:
- Recall: R(S) = Correct(S) / |S| -- must select the right one (possibly among others)
- Precision: P(S) = Correct(S) / Generated(S) -- against too much noise
- no way to improve P + R at the same time, but also no way to tell what's better (depends on the application: Google -- P, Medical test -- R)
- systems with a big difference in P/R are (empirically) worse
- F-Measure: $ F = \frac{1}{\frac{\alpha}{P}+\frac{1-\alpha}{R}} $, usually $ F = \frac{2PR}{R+P} $ (for $ \alpha = 0.5 $)
Parsing
Chomsky Hierarchy
- Type 0 Grammars/Languages -- $ \alpha\to\beta $, where $ \alpha,\beta $ are any strings of nonterminals
- Context-Sensitive Grammars/Languages -- $ \alpha X\beta\to\alpha\gamma\beta $, where $ X $ is a nonterminal, $ \alpha,\beta,\gamma $ any string of terminals and nonterminals (and $ \gamma $ must not be emtpy)
- Context-Free Grammars/Languages -- $ X\to\gamma $, where $ X $ is a nonterminal and $ \gamma $ is any string of terminals and nonterminals
- Regular Grammmars/Languages -- $ X\to\alpha Y $, where $ X,Y $ are nonterminals and $ \alpha $ a string of terminals, $ Y $ might be missing
Chomsky's Normal Form
- for CFG's -- each CFG convertible to normal form
- rules only $ A\to BC $ (two nonterminals), $ A\to\gamma $ (one terminal), $ S\to\varepsilon $ (empty string)
Parsing grammars
- Regular Grammars -- FSA, constant space, linear time
- CFG -- widely used for surface syntax description of natural languages, needed: stack space, $ O(n^3) $ time -- CKY/CYK algorithm
Shift-Reduce CFG Parsing
- CFG with no empty rules ($ N\to\varepsilon $) -- any CFG may be converted; recursion is OK
- Bottom-up, construction of a push-down automaton (non-deterministic); delay rule acceptance until all of it is parsed
- Asymptotically slower than CKY, but fast for usual grammars
- Builds upon a state parsing table ~ graph, edges = transitions (defined by one terminal or nonterminal symbol)
- each state: a special function telling if we output the rule number (even more rules -- ambiguity) -- this is what separates a shift-reduce parser from an FSA
- lex/yacc / flex/bison -- shift-reduce parser generators
Table construction -- dot mechanism
- dots = remember where we are in all the rules which possibly could go through this set, used only for table construction
- take the starting rule and add it to the 1st state (put all rules with the starting symbol on left-hand side into the 1st state, mark the dot at the beginning of the right-hand side of all of them)
- state expansion: for all nonterminals right after the dot in any rule in this state, add the rewriting rules (in which the given nonterminal is on the left-hand side) into this state (and do this recursively until there are no more nonterminals that have not been expanded)
- construction of following states: for each terminal / nonterminal after the dot, create a new state (if there are several rules for the same symbol, create only one state over the transition for this symbol)
- into the new state: add all the rules with the transition symbol just after the dot and move the dot after it + perform state expansion
- reduction states: if there is a rule with the dot at the end in the state, this state is a so-called reduction-state -- in this state, the rule that caused the possibility of moving forward shall be printed (such rule has no expansions --> this leads to finish)
- merge identical states: if the created state has the same rules (with dots at the same positions) as another state, merge the two (otherwise this would never finish for a recursive grammar; merge only after the whole state has been created!)
- problems:
- shift-reduce conflict -- a state is ambiguous (there is a rule with the dot at the end + some rules with dots in the middle ~ state may be reductional, but doesn't have to) -- this leads to backtracking, ambiguous parses
- reduce-reduce conflict -- another kind of ambiguity (more different rules with dot at the end)
- the ambiguity does not occur for special kind of grammars -- $ LR(n) $: for bottom-up parsing, we only need to look $ n $ symbols ahead to prevent backtracking, $ LR(0) $ never get a conflict in a table
- there's no simple algorithm for obtaining the $ n $ for which a given grammar is $ LR(n) $, but we may try for all $ n $'s
- the algorithm complexity copies the complexity of the grammar -- it's only expensive at the points of ambiguity
Parsing
- using parsing stack for states and backtrack stack for whole parser configurations at the points of ambiguity
- backtracking may be implemented in such a way that only the position in the parsing stack and the input need to be stored on the backtrack stack
- we have an empty backtrack stack, the 1st state on the parsing stack and the original string at the input
- from a shift state, follow the transition using the input symbol:
- if there is no such transition and there's nothing on the backtrack stack, FAIL; otherwise take something out of the backtrack stack -- keep the stack saved if there still are more possibilities to follow!)
- if we find the transition, eat one symbol from the input, follow it to the new state and put the state on the parsing stack
- if we are in a reduce state:
- remove as many elements from the parsing stack as there are on the right-hand side of the rule we're reducing over
- put the nonterminal from the left-hand side of the rule on the input
- for conflicts: follow the first path + save the current configuration to the backtrack stack
- PASS condition: empty parsing stack and end of input (possibly continue looking for some other parses if there's something on the backtrack stack)
Treebanks + Evaluation
Treebanks
- Phrase structure (parse) trees ~ bracketing
- Dependency trees
- Data for treebank
- task dependent -- newspaper, jounals, novels, technical, dialogs
- size: the more, the better
- train + development test + evaluation test (never look at it, it's not fair!!!)
- multilevel annotation
- Representation
- Parse / Dependency -- task dependent, 1:n mapping from dependency to parse-tree (in general)
- atributes: word, morphological, syntactic ... at tree nodes / arcs ?, different for leaves?
- reference + bookkeeping attributes
- SGML/XML
Evaluation
- test against evaluation test data -- comparing the output of my parser to manually corrected data, done by someone else and in advance, independent of my algorithms
- rules:
- should be automatic (if possible) -- avoid subjective evaluation (but in e.g. SMT this is inevitable)
- never tune the system using test data (use a small part of training data for this)
- use standard metrics (if possible)
- dependency parser metrics
- dependency recall: $ R_D = \mathrm{Correct}(D) / |S| $, where Correct(D) is the number of correct dependencies (correct head / marked root), |S| is the size of the test data in words
- dependency precision: if output is not a tree -- $ P_D = \mathrm{Correct}(D) / \mathrm{Generated}(D) $, where Generated(D) is the number of output dependencies
- parse tree metrics
- number of nonterminals may not be the same as in the "truth" --> more complicated
- crossing brackets measure: number of crossing brackets between the truth and the result
- labeled precision/recall -- usual computation using bracket labels (phrase markers)
- the bigger label coverage, the better -- recall
- the less brackets, the better -- precision
Probabilistic CFG
- relations among mother & daughter nodes in terms of derivation probability
- probability of a parse tree: $ p(T) = \prod_{i=1}^n p(r(i)) $, where $ p(r(i)) $ is a probability of a rule used to generate the sentence of which the tree T is a parse
- probability of a string is not as trivial, as there may be many trees resulting from parsing the string: $ p(W) = \sum_{j=1}^n p(T_j) $, where $ T_j $'s are all possible parses of the string W.
- assumptions (very strong!):
- independence of context (neighboring subtrees)
- independence of ancestors (upper levels)
- place independence (regardless where the rule appears in the tree) ~ similar to time invariance in HMM
- probability of a rule -- distribution $ r(i): A\to \alpha $ ~ $ 0\leq p(r)\leq 1;\ \ \sum_{r\in\{A\to\dots\}} p(r) = 1 $
- may be estimated by MLE from a treebank following a CFG grammar: counting rules & counting non-terminals
- inside probability: $ \beta_N(p,q) = p(N\to^{\star}\ w_{pq}) $ (probability that the nonterminal $ N $ generates the part of the sentence $ w_p \dots w_q $)
- for CFG in Normal Form may be computed recursively: $ \beta_N(p,q) = \sum_{A,B}\sum_{d=p}^{q-1} p(N\to A\ B)\beta_A(p,d)\beta_B(d+1,q) $
Computing string probability -- Chart Parsing
- create a table $ n \times n $, where $ n $ is the length of the string
- initialize on the diagonal, using $ N\to\alpha $ rules (tuples: nonterminal + probability), compute along the diagonal towards the upper right corner
- fill the cell with a nonterminal + probability that the given part of the string, i.e. row = from, col = to, is generated by the particular rule
- consider more probabilities that lead to the same results & sum them (here: for obtaining the probability of a string, not parsing)
- for parsing: need to store what was the best (most probable) tree -- everything is computable from the table, but it's slow: usually, you want a list of cells / rules that produced this one
- if the CFG is in Chomsky Normal Form, the reverse computation is much more effective
External links
Statistical Parsing
- parsing model: $ p(s|W) = \frac{p(W,s)}{p(W)} = \frac{p(s)}{p(W)} $ since $ p(W,s) = p(s) $ (where $ s $ is a parse and $ W $ the corresponding string -- a parse defines a sentence!)
- therefore: $ \mathrm{argmax}_s p(s|W) = \mathrm{argmax}_s p(s) $ -- we just select the most probable parse
- similar to language model; we don't consider trees, but all the possible parses
Parser Creation
- extract (all used) rules from a treebank
- convert the grammar to the normal form
- apply this back to the treebank (keep track of which rules were affected by the conversion)
- get the counts of the rules --> probability
- smoothen
Smoothing
- the extracted rules cover an infinite number of sentences, but certainly not the whole language
- add poor (missing) rules -- get all possible combinations on the right side
- but ensure their probability is really very low!
- there are many ways of smoothing
- e.g. tie several probabilities to one: $ B\to N\ V $ --> split to first and second nonterminal: $ p(B\to N\ \star) $, then: $ p(B\to N\ V) = p(B\to N\ \star)\cdot p(V|N) $ (only V following N on the right-hand side of any rule, regardless of the left-hand side)
- or, use some linear combination of similar tied probabilities
- may be combined with the original ones before smoothing
- the smoothing is often tuned on data, the best way is selected according to the performance
- if we don't use Chomsky's Normal Form, we may have much more sofisticated ways of smoothing, perhaps even reflecting the linguistic properties of the language, but it'll slow down the process
Lexicalization
- obtain more distinct nonterminals: use lexicalized parse tree (~ dependency tree + phrase labels; no lexicalization needed for dep. trees)
- process of filling in words that were originally only on the terminal nodes -- so that the word is taken from the head of the phrase
- pre-terminals (right above the leaves): assign the word below
- recursive step (up one level -- bottom-up): select one node and copy it up (the "more important one", eg. the preposition for PP, the noun for NP; there are no clean rules, it's a linguistic problem)
- increases the number of rules (up to 100k), but helps -- the smoothing must be very precise (e.g. using the non-lexicalized distribution)
- particular words are important for the parsing of the sentence --> CFG development paradigm
- it's possible to use POS-tags with the words, or POS-tags only
- conditional probabilities: there are too many rules, the data are sparse --> we need to simplify -- assumptions:
- total independence ($ p(\alpha B(head_B) \gamma \dots | A(head_A)) = p(\alpha|A(head_A))\cdot p(B(head_B)|A(head_A))\dots $) is too strong, too inaccurate
- best known heuristics -- decomposition: split the right side of the rule into head + left-of-head + right-of-head
- technical terminal STOP at both sides of the head (?)
- $ p_H(H(head_A)|A(head_A)) $, $ p_L(L_i(l_i)|A(head_A),H) $, $ p_R(R_i(r_i)|A(head_A),H) $
- more conditioning -- distance: absolute is non-zero ? path goes over verb ? over commas ?
- other: complement/adjunct, subcategorization (?)
Remarks
- parsing is still not solved properly, the results are not sufficient
Dependency parsing
- until 2005, done via phrase-tree parsing, the trees were then converted
- McDonald's Parser
- result: a tree -- each word has its parent (or is root)
- initialize: make a total graph, where each edge is rated with a probability (using a perceptron) + find the maximum spanning tree