Development Environment Version Control

  • OS

    Ubuntu 14.04.2 LTS Kernel: Linux 3.13.0-52-generic

  • Python

    Python Distribution:

        Anaconda3-4.2.0
    
  • Compiler

    GCC version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

  • Scala

    Scala-2.10.5

  • Apache Spark

    Version: 1.5.2 with Java 1.8.0_40

  • Java
    Java 1.8.0_40, JRE(Build 1.8.0_40-b25), JVM (64bit, build 25.40-b25, mixed mode)

  • Gogs
    0.9.22.0425

Git Study Note

  1. GIt add
    The files added to the staged area once will be tracked for any modification in later usage, but still need to be added to stage the changes in later operations.

Python Programming Notes

Language Features

  1. Context
    The variables defined within a context can be accessed outside of the cotext.
    The context is used to apply automatic operations.

Escape Depth

  1. Python Interpreter will escape special characters only once !
    \n ---> r"\n"

  2. Unicode Escape Characters
    Python will treat \x0343 as unicode string when printing to console or write to text file.

Python will treat \x0343 as ASCII characters when printing to console or write to text file (!!! Due to The Escape Only Once Principle)

  1. Unnecessary Escapes can be removed by line = line.replace(r"\", "\")

Python Json String "()"

(1,2,3) ---> String Repr: "(1, 2, 3)"

Nohup下输出重定向失效的问题

Comment: It looks like you need to flush stdout periodically (e.g. sys.stdout.flush()). In my testing Python doesn't automatically do this even with print until the program exits.
You can run Python with the -u flag to avoid output buffering
1. 解决办法之一:加-u
2. 解决办法之二: sys.stdout.flush()

Tensorflow Study Note, Part Three

Session

  1. Interactive Session

tf.InteractiveSession() creates a default session that can be used without explictly called in a IPython environment

Example:

tf.InteractiveSession()
a = tf.constant(1)
a.eval()

  1. Regular Session

Regular session needs to be run within a python context or through session object

Example 1:

a = tf.constant(1)
sess = tf.Session()
sess.run(a)

Example 2:

a = tf.constant(1)
with tf.Session():
a.eval()

tf.add_bias

This function is used to add bias to the input tensor (element-wise addition between "bias" vector and the feature vector)

It should be noted that the bias added here is completed different from the normal concept of adding a bias to the hidden unit (summed weight before activation)

tf.Session

The run function of tf.Session provides a interface to execute provided TF operations and evaluate Tensors.

Feature Preprocessing

Continuous features can be feed into the first hidden layer of the neural network directly. Discrete features are recommended to go through an embedding layer.

embed_sequence

This API accepts a [batch_size, doc_length] tensor of type "int32" or "int64"

embedding_lookup

This API is used to loopup an embedding using ID.

Categorial to Embedding Pipeline

Categorial Label ---> Ordinal ID ---> One-hot Embedding ----> Dense Embedding

About Immigration

The Cause of Immigration

The life quality of western countries is far better than that of China in terms of economical, environmental and social conditions. I found myself increasingly believe in western cultures and values and that is the primary driving force of my immigration thoughts. I need to improve my language skill as well as professional experiences to get ready for the oppertunity. Meanwhile, I need to take it very carefully to examine this thought to confirm it is a good fit for me and my family.

I need a carefully planning to merge my immigration effort with the long-term paradigm shift of my family earning, that is, to shift from work-based earning to asset-based earning. (Asset here is something that can generate revenue without extra mental/physical labor)

Feature Embedding of Categorial Values

Word Embedding

  1. The embedding of a word is the hidden layer ouput (a vector) got when feeding the one-hot vecotr (input) of the word.
  2. The target value in the training data carries sequential information (word sequence in a sentence)
  3. The hidden layer weight matrix is the word vector lookup table
  4. Word embedding is a by-product of language modelling

Embedding Layer

  • Embedding layer is just a linear layer that transforms one-hot input into embedding matrix (look-up table)

Categorial Embedding

  1. The embedding of categorial data is obtained in model training, just as that of word embedding.
  2. The target value carries prediction information (predicted value), which is opposed to the sequential information in word embedding

Pre-trained Embedding

  • Pre-trained embedding can be used in new model training to improve performance (both accuracy and speed)

Ordinal Values

  1. Ordinal Values that carries information but in discrete (by definition) values should be normalized to 0-1 (or -1 - 1) scale as the input for DNN

Finance Research Resources

Financial Data

Quandl
Quandl (/ˈkwɑːndəl/) is a platform for financial, economic, and alternative data that serves investment professionals. Quandl sources data from over 500 publishers

Accounting Codification

FASB Codification
https://asc.fasb.org/viewpage

Two Sigma Competition

Dataset

The dataset of Two Sigma Competition is a h5 file. It can be read into a 1,710,756 rows x 111 columns pandas dataframe.

All datapoints are identified by two attributes combined: id + timestamp, both of which are not unique. Id is a financial security and timestamp indicates the time of quote (Y).

In the dataset there are 1424 different ids and 1813 different timestamps. In general, the data point is sampled in a fixed timestamp interval of 750.