Statistical Models

  1. Generative Models and Discrimative Models

Generative models try to maximum the joint probability of the token sequence A and label sequence B, while discrimitive models try to maximum the conditional probability of label sequence B conditional on token sequence A.
NB and HMM are generative models; CRF is a discrimitive model.

  1. Posterior probability
    Posterior probability is the conditional probability itself: P(B/A), in contrast to the prior probability P(B), posterior indicates that this probability is the likelihood of event B happens given that the occuring of event A is already known.

  2. Assumptions
    HMM: a. n-order markov assumption b. conditional independent assumption (apply chain rule in conditional paradigm)
    Maximum Likelihood Estimation.

Naive Bayes: a. naive bayes assumption b. conditional independent assumption (apply chain rule in conditional paradigm)
Maximum a posterior estimation.

Generative model can be identified when the model needs to calculate the marginal distribution of the predicted label P(Y)

Miscellaneous Notes

  1. The objective of natural language processing:
    The processing power of computers is huge, we need to take advantage of this processing power. There is a gap between natural language and computer language (formal language). NLP is designed to fill this gap.

  2. A sentence can be syntatically correct while semantically meaningless.
    Semantic grammar should be replaced by syntactic grammar in semantic parsing.

  3. Dependency grammar (dependency relations) are used to generate subject-verb-predict relationship, then this s-v-p relation is used to generate a semantic representation of the sentence (first-order-logic). To interoperate with other semantic systems. this FOL representation may need to be aligned with other representations.

  4. IN pycharm setting, saving theme settings (on your own name) should be performed on the "font" section instead of the "Colors&Fonts" section to change the background color of editor.

  5. In java, a package name must be a fully qualified name start from the top of the package. An import can be performed at the package level or the class level. In contrast, a golang package import can only be performed at the package level, also, the package declaration are at the current package only, fully qualified name is not needed.

  6. If a method has side-effect, then it will modify the value of memory outside of the local space.

  7. C++ function without overload can accept the specified type value as argument, such as sleep() in time.h.

  8. In C/C++ context, the programming API is mostly provided by the OS instead of the language spec/std library itself. Notably, glibc, posix standards, etc. However, C++11 provides standard libraries that are quite convinent. Most c and c++ "standard" libraries are provided by GNU, with a few useful third-party libraries such as boost being popular as well.

NLP Terminologies

  1. Lexical Information
    Information relating to the word itself (big, small, large, etc) in addition to the word classes (part-of-speech tags).

  2. Endocentric and Exocentric
    A grammatical construction (e.g. a phrase or compound word) is said to be endocentric if it fulfills the same linguistic function as one of its parts, and exocentric if it does not.

  3. Isomorphic
    corresponding or similar in form and relations.

  4. Context-Free Grammar
    The grammar described in a phrase structure.

Word Embedding

Two ways of modeling sentences

  1. s = [x, y, z];
    x, y, z represents three slots in the sentence.
    Three dimensions

  2. s = [x0, x1, x2, ... xn]
    xn represents any words in the vocabulary, value represents its existence in the sentence.

Grammar Study Note, Part One

  1. Pos tagging indicates the property of the word itself.
  2. Dependency tagging indicates the function (or relative information) of the word.
  3. Both N-gram model and PCFG model assumes the a Markov property and then applies the conditional probability.

Dependency Grammar Notions

  1. Subject-Predicate relationship
    subject: The person or thing about whom the statement is made
    predicate: The purpose of the predicate is to complete an idea about the subject

Part of Speech Tagging

  1. Ambiguity:
    Local Preference -> The probability of a part-of-speech tag for a specific word in the vocabulary
    Contextual Preference -> The probability of a part-of-speech tag for a specific word in given context.

Context-Free Grammar Study Note, Part Two

Challenges of Parsing CFG Correctly

Syntactic Ambiguity

  1. Part-of-speech Ambiguity
    : The grammar can assign multiple part-of-speech tags (word categories) to a lexicon
  2. Structural Ambiguity
    : The grammar can assign more than one possible parse to a sentence

    • Attachment Ambiguity
      : A particular constituent can be attached to the parse tree at more than one place
    • Coordination Ambiguity
      : The reference of a lexicon has multiple choices
    • Local Ambiguity
      : Some part of the sentence has more that one parse while the whole sentence is not ambiguous
    • etc
      : ...

Phrasal Nodes

  1. a phrasal node is a non-terminal node

CFG Tree v.s. Dependency Grammar Tree

Strucutrally cfg tree has little difference with dp tree.
1. In cfg tree, the internal nodes are phrasal nodes (non-lexicalized). In dp tree, the internal nodes are by definition lexicalized.
2. Both cfg and dp tree has pos tags as the next-to-leaf node. For dp tree this pos tag can be replaced by synthetic parts-of-speech (subject, object, predicate, etc).

图像文法(Image Grammar)


文法在文本领域的应用(比如Context-Free Grammar)可以看成是对自然语言(字符串)在结构上的分解和抽象(分解+抽象=解析),I want to eat a burger => NP VP. 这种分解和抽象后形成的数据结构包含里文本字符串的结构信息(和Metadata相似),文法解析的过程是将自然语言字符串这种非结构化数据结构化的过程,这些结构信息以Tree这种数据结构在计算机里被表示,处理和存储。(非结构化数据=无法抽象出数据模型或者没有整理成预定义的数据模型的数据)



Context-Free Grammar Study Note, Part One


The grammar formalism applies to many other natural languages despite the fact that all discussions are focused on modern English.

  1. Dependency Grammar (Subject, Predicate, Object)
  2. Constituency Grammar (NP, VP)
  3. Others - sub-categorization, etc (Verb followed by infinitive)

Among them, constituency grammar is universally modeled by context-free grammar (CFG), which is also called phrase-structure grammars. The formalism of CFG is equivalent to Backus-Naur Form (BNF).

Relationships that a Parser Models

  1. Constituency Relationships (Constituent Relation)
  2. Grammatically Relationships (Lexical Relation)

Parsing With CFG (Context-Free Grammars)

  1. Top-down (goal-directed search, rationalist tradition)
  2. Bottom-up (data directed search, empiricist tradition)

Data Structure of Parse Tree (CFG tree)

Implemented as a multi-way tree via OOP mechanism. Each internal node represents a non-terminal of CFG;each leaf node represents a terminal of CFG;The value of each node is the symbol of the non-terminal/terminal it represented. The value of each node is a symbol (NP, VP, prefer, etc), the pointers of each node points to its descendants until the reach of leaf nodes (non-terminal terminal pair). The leaf node is a 2-tuple that represents the production rule of a non-terminal to a terminal.

Serialization of a Parse Tree

parse tree = (ROOT (S (NP (PRP I)) (VP (VBP love) (NP (PRP you)))))

A JSON string that contains only list structure with all comma replaced by white space and double quotation replaced by empty string, and optionally squared bracket replaced by parentheses.

In any internal node (within a bracket), only one symbol ("a non-terminal symbol, without any bracket enclosed") can be presented and it is the value of this node. The child node of any internal node is defined to be all the bracketed blocks that lays in the same level with this node. All bracketed blocks is itself a parse tree (may not origin from root)

Parse Tree and Part-of-Speech Tagging

The non-terminal in the leaf node is the part-of-speech tagging of the terminal.

Why Not Context-sensitive Grammar In NLP Application?

Context-sensitive grammar is non-deterministic, which means the production process may not converge.

CFG Data Structure

  1. Search Space Forest
    Search Space -> Search Space Forest (Search Tree in Manning's book)

Constructed by all possible partial parse trees and complete parse trees, among them, partial parse trees act as the internal node and the complete parse trees act as the leaf node (of the tree data structure -> search space forest).

Note: Search Space Forest is actually a and-or tree

  1. Parse Tree
    Parse Tree -> Syntax Parse Tree (Complete Syntax Tree)
    Partial Parse Tree (Incomplete Sytax Tree)

The function of Part-of-Speech Tagging

PoS has both syntactic and semantic meanings. Syntactically, a noun is usually followed by a verb; semantically, a noun typically refer to people, animals, concepts and things.