信息论和熵

随机变量的熵

熵常以数学符号H表示, 如果H(P)代表随机变量P的熵。

从概率论的角度上来说,熵描述了随机变量取值的不确定性。随机变量的熵的大小与其概率密度分布的均匀程度成正比,即均匀分布的随即变量每种取值几率相差不大,所以该分布描述的随即变量不确定性大,没有太多有效的信息。

对于随机变量X, 熵的定义为H=-SUM(Pi*log2(Pi))

 

交叉熵损失函数

交叉熵(Cross Entropy)描述了两个随机变量概率分布之间的差异

Math Notation Conventions

Superscript and Subscript

Superscript refers to the index of training examples
Subscript refers to the index of vector elements.

Sequence Notation

A sequence may be named or referred to by an upper-case letter such as "A" or "S". The terms of a sequence are usually named something like "ai" or "an", with the subscripted letter "i" or "n" being the "index" or the counter. So the second term of a sequnce might be named "a2" (pronounced "ay-sub-two"), and "a12" would designate the twelfth term.

The sequence can also be written in terms of its terms. For instance, the sequence of terms ai, with the index running from i = 1 to i = n, can be written as:
http://www.purplemath.com/modules/series.htm

Hat Operator

In statistics, the hat is used to denote an estimator or an estimated value, as opposed to its theoretical counterpart. For example, in the context of errors and residuals, the "hat" over the letter ε indicates an observable estimate (the residuals) of an unobservable quantity called ε (the statistical errors).

Prime Notation

Derivative Operator

Probability and Statistics Reviews, Part Two

"Or" Rule

The probability of Event A or Event B happens is the addition of P(A) and P(B) subtracted by the probability of event a and event b happens at the same time.

P(A or B) = P(A) + P(B) - P(A and B)

"Multiplication Rule"

The probability of Event A and Event B happens at the same time is the product of the conditional probability of A given B and the probability of B.
P(A and B) = P(A/B) * P(B)

The Law of Total Probability

The law of total probability is the proposition that if {Bn: n = 1, 2, 3, ...} is a finite or countably infinite partition of a sample space and each event Bn is measurable, then for any event A of the same probability space:
P(A) = SUM( P(A and Bx) ) x <- 0 to n

The law of total probability = "Or" Rule + "Multiplication Rule"

Independence Test

Two events are independent if P(A and B) = P(A)*P(B)

EM Algorithm

In the E-step, the missing data is estimated through the technique of conditional expectation. In the M-step, the non-hidden parameters are estimated through MLE.

  1. E-step
    P(W0 | xi) = a, P(W1 | xi) = b
    E(W) = aW0 + bW1
    if |E(W) - W0| < |E(W) - W1|, then W = W0, else, W = W1.

Conditional Expectation

E(Y | X = x) = Sum( y * P ( y | x ) )

Covariance v.s. Correlation

Similar to : Variance v.s. Standard Deviation

How to Estimate the Parameters of a Statistical Model

A statistical model can take the form of a explicit algebraic expressions with parameters.
Or a model can contain no algebraic expressions but only conditional/joint or other probability measurements (called free parameters). These probability measurements can be think as the sampling of the subject population.

The true value of the probability of event A: PA(X=a) can be estimated by repeat the random experiment (repeat the random process), PA(X=a) ~= ratio(A/all). The limit of this ratio is the true value of the probability of event A (happening).

Distribution and Set

A bionomial random experiment contains several bornouli random experiments.
The subset of a sampling of certain distribution satisfy the same distribution.

Random Variable, Probability and Distribution

Random variable and the probability of a random variable given certain value (an event) refers to a specific random experiment.

The distribution, in contrast, describe the subject in the overall trend (the population , the sample set).

P(X) and P(X=x0)

P(X) is the PDF or PMF of a distribution. P(X=x0) is the probability of random variable X reach a value of x0.

Hidden Markov Model

The assumption made by HMM

The assumption is that for all random variables in the (conditional) probability chain, the conditions is only made on the previous n variables in the sequence. In another way, the conditional probability of the variable is independent of the variables other than the previous n variables.

HMM on NLP

The tagging problem can be abstracted as to model the joint probability of two sequences: sentence sequence and tag sequence. In a HMM approach to solve this joint probability. Tag sequence is modeled (approximated) as a Markov sequence. Sentence sequence is modeled as a independent occurring of events that are only conditioned on the tagging of the corresponding position.

Generative or Discriminative

HMM is by definition is generative model because it models the sequence with joint probability rather than conditional probability.

Interpretation from ML Perspective

The training objective of the HMM is a probabilistic model that can not output target labeling directly. Instead, a labeling function has to be defined in addition to the HMM probabilistic model. The training set of the HMM model consists of training samples made by a pair of word sequence and pos tagging sequence.

The training process is essentially a counting process in which the statistical property of the labeling sequences (pos tagging) is estimated. Also, the conditionally probability of word/tagging pair is estimated. These estimates then are used to generate the parameters of the HMM model (transitional probability and emission probability).

In prediction, the labeling function (output function) acquire parameters in the HMM to make predictions of the labeling of the new word sequences.

Origin of Name: Probability Distribution

"Distribution" indicates that the sum of probability "1" is divided and distributed into the probability of each random variable.

The Meaning of Arithmetic Operations and Mathematical Terms

  1. Addition
    The amount of adder and addee combined

  2. Multiplication
    A repeated addition

  3. Exponentiation
    A repeated multiplication

  4. Multinomial
    An algebraic expression constructed by the sum of multiple mathematical terms (could be exponential with a degree of 1, 2, 3, etc with any base).

  5. Polynomial
    A special case of Multinomial expression. It is constructed by the sum of powers of one or more variables multiplied by coefficients.

  6. Algebraic Expression
    An expression in which a finite number of symbols is combined using only the operations of addition, subtraction, multiplication, division and exponentiation with constant rational exponent.

  7. Term
    The components in an expression connected by the addition or subtraction.

  8. Factor
    The elements in a term, which is connected by multiplication

  9. Coefficient
    The constant in a term, which is also a factor.

Probability and Statistics Reviews, Part One

Classic Probability Problem - Standard Procedures

  1. Define simple events (basic events) based on the given information
  2. Abstract the problem to be solved into a compound event
  3. Solve the problem based on classic probability equations

Random Experiment

The concept of random experiment arise from the assumption that the event to be observed is described on a occurring or not basis. Many real world probability problem, however, does not self-describe in such nature. These types of problems can be conceptually transformed into the occurring&observation pattern.

Random Variable

A random variable is the mapping of the result of a random experiment to a real value.

Statistics and Probability

Statistics is the necessary measure that takes to reach a probability (when the population is known) or probability estimate (when only sampling is available).

Relative Frequency

Phrase "Relative" here refers to the fact that in any experiment that is infinitely repeatable, the absolute frequency (proportion ratio) cannot be obtained.

Simple Event (Elementary Event)

A random experiment with an outcome (coin toss with a head on top).

In contrast, compound events have multiple outcomes. An elementary event can be a process with outcomes that are "complex", for example, toss 3 dices and get 1, 3, 2 on top. In this case, the elementary event A here is constructed by three elementary events in a related probability space (toss only one dice to see the outcome).

Coin toss with a head on top and coin toss with a tail on top are two simple events.
Simple and compound events are relative concepts, when no lower level events are defined, a complex events may be defined as simple events when analyzing certain problems.

Sample Space (One Element of Probability Space)

A sample space is a set, denoted as S, which enumerates each and every possible outcome or simple events.
All possible simple events (event condition together with its outcomes).

Any events (to be observed) covers part of the sample space (the event space is a subset of sample space)

The equation of conditional probability should only be applied when P(A), P(B) and P(A*B) are reflects the probability of events under the same sample space.

Distribution Mixture

N1 = The population size of distribution 1
N2 = The population size of distribution 2
P3(x) = (P1(x)N1 + P2(x)N2)/(N1+N2)