Ubuntu 14.04.2 LTS Kernel: Linux 3.13.0-52-generic
GCC version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
Version: 1.5.2 with Java 1.8.0_40
Java 1.8.0_40, JRE(Build 1.8.0_40-b25), JVM (64bit, build 25.40-b25, mixed mode)
- GIt add
The files added to the staged area once will be tracked for any modification in later usage, but still need to be added to stage the changes in later operations.
How to read/learn source code
- Read Comments
- Read Function Signature
- Run the source code and observe result, variable values, etc
The variables defined within a context can be accessed outside of the cotext.
The context is used to apply automatic operations.
- Interactive Session
tf.InteractiveSession() creates a default session that can be used without explictly called in a IPython environment
a = tf.constant(1)
- Regular Session
Regular session needs to be run within a python context or through session object
a = tf.constant(1)
sess = tf.Session()
a = tf.constant(1)
This function is used to add bias to the input tensor (element-wise addition between "bias" vector and the feature vector)
It should be noted that the bias added here is completed different from the normal concept of adding a bias to the hidden unit (summed weight before activation)
The run function of tf.Session provides a interface to execute provided TF operations and evaluate Tensors.
Continuous features can be feed into the first hidden layer of the neural network directly. Discrete features are recommended to go through an embedding layer.
This API accepts a [batch_size, doc_length] tensor of type "int32" or "int64"
This API is used to loopup an embedding using ID.
Categorial to Embedding Pipeline
Categorial Label ---> Ordinal ID ---> One-hot Embedding ----> Dense Embedding
The Cause of Immigration
The life quality of western countries is far better than that of China in terms of economical, environmental and social conditions. I found myself increasingly believe in western cultures and values and that is the primary driving force of my immigration thoughts. I need to improve my language skill as well as professional experiences to get ready for the oppertunity. Meanwhile, I need to take it very carefully to examine this thought to confirm it is a good fit for me and my family.
I need a carefully planning to merge my immigration effort with the long-term paradigm shift of my family earning, that is, to shift from work-based earning to asset-based earning. (Asset here is something that can generate revenue without extra mental/physical labor)
- The embedding of a word is the hidden layer ouput (a vector) got when feeding the one-hot vecotr (input) of the word.
- The target value in the training data carries sequential information (word sequence in a sentence)
- The hidden layer weight matrix is the word vector lookup table
- Word embedding is a by-product of language modelling
- Embedding layer is just a linear layer that transforms one-hot input into embedding matrix (look-up table)
- The embedding of categorial data is obtained in model training, just as that of word embedding.
- The target value carries prediction information (predicted value), which is opposed to the sequential information in word embedding
- Pre-trained embedding can be used in new model training to improve performance (both accuracy and speed)
- Ordinal Values that carries information but in discrete (by definition) values should be normalized to 0-1 (or -1 - 1) scale as the input for DNN
Quandl (/ˈkwɑːndəl/) is a platform for financial, economic, and alternative data that serves investment professionals. Quandl sources data from over 500 publishers
Measurement of Dispersion
Coefficient of Variation
Stock Price High-Low Range COV
AAPL (2015-1-1 to 2017-1-1): 0.582
XNET (2015-1-1 to 2017-1-1): 0.803
The dataset of Two Sigma Competition is a h5 file. It can be read into a 1,710,756 rows x 111 columns pandas dataframe.
All datapoints are identified by two attributes combined: id + timestamp, both of which are not unique. Id is a financial security and timestamp indicates the time of quote (Y).
In the dataset there are 1424 different ids and 1813 different timestamps. In general, the data point is sampled in a fixed timestamp interval of 750.