随机变量的熵
熵常以数学符号H表示， 如果H(P)代表随机变量P的熵。
从概率论的角度上来说，熵描述了随机变量取值的不确定性。随机变量的熵的大小与其概率密度分布的均匀程度成正比，即均匀分布的随即变量每种取值几率相差不大，所以该分布描述的随即变量不确定性大，没有太多有效的信息。
对于随机变量X, 熵的定义为H=-SUM(Pi*log2(Pi))
交叉熵损失函数
交叉熵(Cross Entropy)描述了两个随机变量概率分布之间的差异
熵常以数学符号H表示， 如果H(P)代表随机变量P的熵。
从概率论的角度上来说，熵描述了随机变量取值的不确定性。随机变量的熵的大小与其概率密度分布的均匀程度成正比，即均匀分布的随即变量每种取值几率相差不大，所以该分布描述的随即变量不确定性大，没有太多有效的信息。
对于随机变量X, 熵的定义为H=-SUM(Pi*log2(Pi))
交叉熵(Cross Entropy)描述了两个随机变量概率分布之间的差异
Superscript refers to the index of training examples
Subscript refers to the index of vector elements.
A sequence may be named or referred to by an upper-case letter such as "A" or "S". The terms of a sequence are usually named something like "ai" or "an", with the subscripted letter "i" or "n" being the "index" or the counter. So the second term of a sequnce might be named "a2" (pronounced "ay-sub-two"), and "a12" would designate the twelfth term.
The sequence can also be written in terms of its terms. For instance, the sequence of terms ai, with the index running from i = 1 to i = n, can be written as:
http://www.purplemath.com/modules/series.htm
In statistics, the hat is used to denote an estimator or an estimated value, as opposed to its theoretical counterpart. For example, in the context of errors and residuals, the "hat" over the letter ε indicates an observable estimate (the residuals) of an unobservable quantity called ε (the statistical errors).
Derivative Operator
The probability of Event A or Event B happens is the addition of P(A) and P(B) subtracted by the probability of event a and event b happens at the same time.
P(A or B) = P(A) + P(B) - P(A and B)
The probability of Event A and Event B happens at the same time is the product of the conditional probability of A given B and the probability of B.
P(A and B) = P(A/B) * P(B)
The law of total probability is the proposition that if {Bn: n = 1, 2, 3, ...} is a finite or countably infinite partition of a sample space and each event Bn is measurable, then for any event A of the same probability space:
P(A) = SUM( P(A and Bx) ) x <- 0 to n
The law of total probability = "Or" Rule + "Multiplication Rule"
Two events are independent if P(A and B) = P(A)*P(B)
In the E-step, the missing data is estimated through the technique of conditional expectation. In the M-step, the non-hidden parameters are estimated through MLE.
E(Y | X = x) = Sum( y * P ( y | x ) )
Similar to : Variance v.s. Standard Deviation
A statistical model can take the form of a explicit algebraic expressions with parameters.
Or a model can contain no algebraic expressions but only conditional/joint or other probability measurements (called free parameters). These probability measurements can be think as the sampling of the subject population.
The true value of the probability of event A: PA(X=a) can be estimated by repeat the random experiment (repeat the random process), PA(X=a) ~= ratio(A/all). The limit of this ratio is the true value of the probability of event A (happening).
A bionomial random experiment contains several bornouli random experiments.
The subset of a sampling of certain distribution satisfy the same distribution.
Random variable and the probability of a random variable given certain value (an event) refers to a specific random experiment.
The distribution, in contrast, describe the subject in the overall trend (the population , the sample set).
P(X) is the PDF or PMF of a distribution. P(X=x0) is the probability of random variable X reach a value of x0.
The assumption is that for all random variables in the (conditional) probability chain, the conditions is only made on the previous n variables in the sequence. In another way, the conditional probability of the variable is independent of the variables other than the previous n variables.
The tagging problem can be abstracted as to model the joint probability of two sequences: sentence sequence and tag sequence. In a HMM approach to solve this joint probability. Tag sequence is modeled (approximated) as a Markov sequence. Sentence sequence is modeled as a independent occurring of events that are only conditioned on the tagging of the corresponding position.
HMM is by definition is generative model because it models the sequence with joint probability rather than conditional probability.
The training objective of the HMM is a probabilistic model that can not output target labeling directly. Instead, a labeling function has to be defined in addition to the HMM probabilistic model. The training set of the HMM model consists of training samples made by a pair of word sequence and pos tagging sequence.
The training process is essentially a counting process in which the statistical property of the labeling sequences (pos tagging) is estimated. Also, the conditionally probability of word/tagging pair is estimated. These estimates then are used to generate the parameters of the HMM model (transitional probability and emission probability).
In prediction, the labeling function (output function) acquire parameters in the HMM to make predictions of the labeling of the new word sequences.
"Distribution" indicates that the sum of probability "1" is divided and distributed into the probability of each random variable.
Addition
The amount of adder and addee combined
Multiplication
A repeated addition
Exponentiation
A repeated multiplication
Multinomial
An algebraic expression constructed by the sum of multiple mathematical terms (could be exponential with a degree of 1, 2, 3, etc with any base).
Polynomial
A special case of Multinomial expression. It is constructed by the sum of powers of one or more variables multiplied by coefficients.
Algebraic Expression
An expression in which a finite number of symbols is combined using only the operations of addition, subtraction, multiplication, division and exponentiation with constant rational exponent.
Term
The components in an expression connected by the addition or subtraction.
Factor
The elements in a term, which is connected by multiplication
Coefficient
The constant in a term, which is also a factor.
The concept of random experiment arise from the assumption that the event to be observed is described on a occurring or not basis. Many real world probability problem, however, does not self-describe in such nature. These types of problems can be conceptually transformed into the occurring&observation pattern.
A random variable is the mapping of the result of a random experiment to a real value.
Statistics is the necessary measure that takes to reach a probability (when the population is known) or probability estimate (when only sampling is available).
Phrase "Relative" here refers to the fact that in any experiment that is infinitely repeatable, the absolute frequency (proportion ratio) cannot be obtained.
A random experiment with an outcome (coin toss with a head on top).
In contrast, compound events have multiple outcomes. An elementary event can be a process with outcomes that are "complex", for example, toss 3 dices and get 1, 3, 2 on top. In this case, the elementary event A here is constructed by three elementary events in a related probability space (toss only one dice to see the outcome).
Coin toss with a head on top and coin toss with a tail on top are two simple events.
Simple and compound events are relative concepts, when no lower level events are defined, a complex events may be defined as simple events when analyzing certain problems.
A sample space is a set, denoted as S, which enumerates each and every possible outcome or simple events.
All possible simple events (event condition together with its outcomes).
Any events (to be observed) covers part of the sample space (the event space is a subset of sample space)
N1 = The population size of distribution 1
N2 = The population size of distribution 2
P3(x) = (P1(x)N1 + P2(x)N2)/(N1+N2)