Microsoft Malmo Note

Agent Command

  1. Method host.sendCommand() signals the beginning of a continuous action. In most cases, the stop of a continuous action requires the terminate signal (such as move 0).

Absolute v.s Continuous Commands



NLP Terminologies

  1. Lexical Information
    Information relating to the word itself (big, small, large, etc) in addition to the word classes (part-of-speech tags).

  2. Endocentric and Exocentric
    A grammatical construction (e.g. a phrase or compound word) is said to be endocentric if it fulfills the same linguistic function as one of its parts, and exocentric if it does not.

  3. Isomorphic
    corresponding or similar in form and relations.

  4. Context-Free Grammar
    The grammar described in a phrase structure.

Math Notation Conventions

Superscript and Subscript

Superscript refers to the index of training examples
Subscript refers to the index of vector elements.

Sequence Notation

A sequence may be named or referred to by an upper-case letter such as "A" or "S". The terms of a sequence are usually named something like "ai" or "an", with the subscripted letter "i" or "n" being the "index" or the counter. So the second term of a sequnce might be named "a2" (pronounced "ay-sub-two"), and "a12" would designate the twelfth term.

The sequence can also be written in terms of its terms. For instance, the sequence of terms ai, with the index running from i = 1 to i = n, can be written as:

Hat Operator

In statistics, the hat is used to denote an estimator or an estimated value, as opposed to its theoretical counterpart. For example, in the context of errors and residuals, the "hat" over the letter ε indicates an observable estimate (the residuals) of an unobservable quantity called ε (the statistical errors).

Prime Notation

Derivative Operator

Linux Tool Note

  1. TCP/UDP Connection Test
    nc -zv 25331

  2. Show all process
    ps aux | less

  3. Uncompress gzip file
    tar -zxvf {file.tar.gz}

  4. List all hardware information

  5. Grep Multiple Patterns
    grep -E '123|abc' filename

  6. Show CPU Information
    cat /proc/cpuinfo

  7. Check if a Process Exist
    ps -ef | grep deplearning

  8. Server Benchmark
    ab -c 1000 -n 50000 http://localhost:8080/

  9. 查看系统日志

tail -f /var/log/syslog

  1. 查看文件夹大小
    du -hs /path/to/directory

Word Embedding

Two ways of modeling sentences

  1. s = [x, y, z];
    x, y, z represents three slots in the sentence.
    Three dimensions

  2. s = [x0, x1, x2, ... xn]
    xn represents any words in the vocabulary, value represents its existence in the sentence.

Probability and Statistics Reviews, Part Two

"Or" Rule

The probability of Event A or Event B happens is the addition of P(A) and P(B) subtracted by the probability of event a and event b happens at the same time.

P(A or B) = P(A) + P(B) - P(A and B)

"Multiplication Rule"

The probability of Event A and Event B happens at the same time is the product of the conditional probability of A given B and the probability of B.
P(A and B) = P(A/B) * P(B)

The Law of Total Probability

The law of total probability is the proposition that if {Bn: n = 1, 2, 3, ...} is a finite or countably infinite partition of a sample space and each event Bn is measurable, then for any event A of the same probability space:
P(A) = SUM( P(A and Bx) ) x <- 0 to n

The law of total probability = "Or" Rule + "Multiplication Rule"

Independence Test

Two events are independent if P(A and B) = P(A)*P(B)

EM Algorithm

In the E-step, the missing data is estimated through the technique of conditional expectation. In the M-step, the non-hidden parameters are estimated through MLE.

  1. E-step
    P(W0 | xi) = a, P(W1 | xi) = b
    E(W) = aW0 + bW1
    if |E(W) - W0| < |E(W) - W1|, then W = W0, else, W = W1.

Conditional Expectation

E(Y | X = x) = Sum( y * P ( y | x ) )

Covariance v.s. Correlation

Similar to : Variance v.s. Standard Deviation

How to Estimate the Parameters of a Statistical Model

A statistical model can take the form of a explicit algebraic expressions with parameters.
Or a model can contain no algebraic expressions but only conditional/joint or other probability measurements (called free parameters). These probability measurements can be think as the sampling of the subject population.

The true value of the probability of event A: PA(X=a) can be estimated by repeat the random experiment (repeat the random process), PA(X=a) ~= ratio(A/all). The limit of this ratio is the true value of the probability of event A (happening).

Distribution and Set

A bionomial random experiment contains several bornouli random experiments.
The subset of a sampling of certain distribution satisfy the same distribution.

Random Variable, Probability and Distribution

Random variable and the probability of a random variable given certain value (an event) refers to a specific random experiment.

The distribution, in contrast, describe the subject in the overall trend (the population , the sample set).

P(X) and P(X=x0)

P(X) is the PDF or PMF of a distribution. P(X=x0) is the probability of random variable X reach a value of x0.

Hidden Markov Model

The assumption made by HMM

The assumption is that for all random variables in the (conditional) probability chain, the conditions is only made on the previous n variables in the sequence. In another way, the conditional probability of the variable is independent of the variables other than the previous n variables.


The tagging problem can be abstracted as to model the joint probability of two sequences: sentence sequence and tag sequence. In a HMM approach to solve this joint probability. Tag sequence is modeled (approximated) as a Markov sequence. Sentence sequence is modeled as a independent occurring of events that are only conditioned on the tagging of the corresponding position.

Generative or Discriminative

HMM is by definition is generative model because it models the sequence with joint probability rather than conditional probability.

Interpretation from ML Perspective

The training objective of the HMM is a probabilistic model that can not output target labeling directly. Instead, a labeling function has to be defined in addition to the HMM probabilistic model. The training set of the HMM model consists of training samples made by a pair of word sequence and pos tagging sequence.

The training process is essentially a counting process in which the statistical property of the labeling sequences (pos tagging) is estimated. Also, the conditionally probability of word/tagging pair is estimated. These estimates then are used to generate the parameters of the HMM model (transitional probability and emission probability).

In prediction, the labeling function (output function) acquire parameters in the HMM to make predictions of the labeling of the new word sequences.

Origin of Name: Probability Distribution

"Distribution" indicates that the sum of probability "1" is divided and distributed into the probability of each random variable.

Distributed Cluster Study Note

  1. Application dependency v.s. Software dependency
    Application dependency is one application in one container depends on another application on another container.
    Software dependency is one software depends on other software on the same container.

Grammar Study Note, Part One

  1. Pos tagging indicates the property of the word itself.
  2. Dependency tagging indicates the function (or relative information) of the word.
  3. Both N-gram model and PCFG model assumes the a Markov property and then applies the conditional probability.

Dependency Grammar Notions

  1. Subject-Predicate relationship
    subject: The person or thing about whom the statement is made
    predicate: The purpose of the predicate is to complete an idea about the subject

Part of Speech Tagging

  1. Ambiguity:
    Local Preference -> The probability of a part-of-speech tag for a specific word in the vocabulary
    Contextual Preference -> The probability of a part-of-speech tag for a specific word in given context.