图像文法(Image Grammar)

文本文法

文法在文本领域的应用(比如Context-Free Grammar)可以看成是对自然语言(字符串)在结构上的分解和抽象(分解+抽象=解析),I want to eat a burger => NP VP. 这种分解和抽象后形成的数据结构包含里文本字符串的结构信息(和Metadata相似),文法解析的过程是将自然语言字符串这种非结构化数据结构化的过程,这些结构信息以Tree这种数据结构在计算机里被表示,处理和存储。(非结构化数据=无法抽象出数据模型或者没有整理成预定义的数据模型的数据)

图像文法

同理,图片的信息结构化也可以采用类似文本文法解析的方法来实现。用叶子节点描述所有的具体物品的部件,用中间节点描述具体物品和物品所属的类别,用根节点描述场景(整个图像)。通过树形结构描述所有节点的Inter-relationship以及Hierarchy。直接相连的两节点关系最紧密,不相邻的节点通过其他节点发生联系。

File Management Guidelines

Programming Language Test Files

  1. All programming language test files goes into folder named in the following convention: "python-test1", "python-test2" , except for the case in ruling 2 given below.
  2. The folder naming of the specialized test files should reflect the objective of the test, for example: "stdnlp-test"

Example files and Temp files

  1. Temp files and example files should go into waffle, apollo, and zeus project source tree when they are needed to be versioned.
  2. Temp files should be periodically moved into one target folder or removed.
  3. Example files should be periodically moved into the example project source tree: coding-examples

Tutorial Folder and Examples Folder

  1. Repositories hosting tutorial projects designated to a specific topic should be named xxxx-tutorial.
  2. Repositories hosting loosely organized example files should be named xxxx-examples

Naming

No upper-case char should appear in project name and file names. Dash is allowed in project name, underscore is allowed in file name.

  1. Python
    modulename/module_name, packagename, ClassName, method_name, ExceptionName, function_name, GLOBAL_CONSTANT_NAME, global_var_name, instance_var_name, function_parameter_name, local_var_name

  2. Java
    FileName, packagename, ClassName, methodName, CONSTANT_NAME, localVarName, fieldName, parameterName, TypeVarName (T, single char)

  3. C/C++
    file_name, foldername, local_var_name, ClassName, FunctionName, class_data_member_, struct_data_member

Start a Docker Container with Multiple Bash Command Execution

Start a docker container and launch multiple background process (service) through docker run is impossible. The reason is that docker run does not support multiple command given as docker run argument. Multiple command execution is only viable through /bin/bash -c "command1; command2". However, a container is by design exited whenever the /bin/bash process is finished. Because of this mechanism, all service launching command like "python web.py" must not ended with a daemon decorator "&", such as /bin/bash -c "python web1.py & python web2.py &" (In this case, the /bin/bash process finishes so that the container then exited").

Theoretically, a command such as /bin/bash -c "python web1.py & python web2.py" should bypass this problem since /bin/bash process got stuck with the web2.py process running in the foreground (within the container)", but this proved to be wrong in practice due to unknown reasons".

It is worth to mention that commands such as /bin/bash -c "python web1.py; python web2.py" will keep the container running, but the launch of web1.py will prevent web2.py from launching forever (because of the ; decorator).

Conclusions:
1. /bin/bash -c "command1; command2" is rarely useful
2. docker run xxx /bin/bash -c "command" can be replaced by docker run xxx command
3. Multiple services should be split into different containers/images (container and service should maintain 1-on-1 mapping)

Tensorflow安装指南

Tensorflow(TF)项目的Python3版本安装文件的官方打包只有python 3.4版本的,如果想要将TF安装到Python3.5上,请不要使用官方提供的pip安装方式。

破解方法如下

  1. 将官方wheel安装包下载到本地: wget https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0-cp34-cp34m-linux_x86_64.whl
  2. 重命名wheel安装文件: mv ~/username/tensorflow-0.8.0-cp34-cp34m-linux_x86_64.whl ~/username/tensorflow-0.8.0-cp35-cp35m-linux_x86_64.whl
  3. 使用pip从命令行手动安装wheel: pip install ~/username/tensorflow-0.8.0-cp35-cp35m-linux_x86_64.whl

GPU版TF的安装:

  1. GPU版TF的安装除了上述步骤外,还需要提前确认用于计算的GPU(显卡)是否有CUDA Compute Capability 3.0以上的运算能力,具体档位和显卡型号对应表请查阅下面的链接: CUDA Compute Capability 官方链接
  2. CUDA Driver的安装, CUDA Driver包含在CUDA Toolkit里,请直接安装CUDA Toolkit; 请参照Nvida官方的文档, 如果安装环境以前装过非官方的Nvida显卡驱动,需要手动卸载。CUDA Driver Instructions 官方链接。CUDA Toolkit (含Driver和工具)的安装请使用Standalone安装方式(即从runfile离线安装)
  3. Cudnn (GPU运算库)的安装:官方下载需要注册和审核,这里给出一个临时下载地址:Cudnn Library 百度网盘链接下载解压候使用如下命令安装Cudnn文件
tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo cp cuda/lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*

GPU相关问题

  1. Miscellaneous Problems:
    a. Cannot find CUDA cuda.xx.so.4.5. (for example): sudo ldconfig /usr/local/cuda/lib64
    (Update the cache file of linker targets so that linker can find shared libraries)
    b. Cannot Open CUDNN cudnn.xx. (for example): sudo ldconfig /usr/local/cuda/lib64

  2. 常用命令:
    lspci | grep -i nvidia
    nvidia-smi -L 显示GPU硬件信息
    nvidia-smi -a 显示GPU使用情况
    nvidia-smi 显示GPU综合情况(含Current Running GPU Process)

  3. GPU编号
    nvidia-smi -L 命令显示的GPU编号不一定是GPU运算库配置文件里所需要的编号, 配置文件有时需要按GPU个数编号如0, 1, 2。
    nvidia-smi -L 命令显示出来的GPU编号可以从4, 5, 6这些开始
    nvidia-smi -a 显示GPU使用情况

Work Journal, May 17 – May 27

May 17

  1. Unable to register docker service inside a docker container
  2. All development headers provided by Python-dev package (Centos 6, maybe including ubuntu) should already be included by anconda2/3 distribution
  3. Floating-number precision lost error was caused by lacking of python package: nose_parameterized

May 18

  1. $PATH, $LD_LIBRARY_PATH are different bash shell variables
  2. "hostname" can be viewed in file /etc/hostnames

May 19

  1. /etc/profile, ~/.profile, ~/.bashrc will be executed by bash shell sequentially
  2. "alias" in bash shell command is purely a replacement for longer commands when put into effects

May 27

  1. Theano run time error:
    Initialisation of device gpu failed! Reason=CNMEM_STATUS_OUT_OF_MEMORY
    Solution: disable CNMeM by deleting the following lines in .theanorc
    [lib]
    cnmem = 1

Python Java and C++,Part One

Entry Point

Java and C++ has the main method as the sole entry point, python has no official definition or requirement for entry point. In python, it is common to use a script called main.py as the entry point of the program (the interpreter starts running the main.py script first).

Arithmetic Evaluation (Division)

Python3 will by default evaluate integral division in a float division manner ( 1/2 = 0.5). Java, C++ and Scala will evaluate integral division in an integral division manner ( 1/2 = 0 )

Import System

  1. Python's real equivalence of java package is called module (which is actually a .py file). Python's package is equivalent to Java's package of package (nested package). Unlike a Python package, a Java package (a namespace) needs to be explicitly declared. Python allows programming at the package level. In Java, package is just a namespace.

  2. By default, in Java the import statement applies to class names in the package name space. Import is followed by the full name of a class (package.package.....class). However, the import statement in python applies to not only class names (from package.module import class), but also other variables defined in the module (functions, constants, etc). In addition, the import mechanism of python allows the import of only package name (while in case the module names can not be referenced via the package name if no corresponding imports are found in the "init.py" file of the package).

  3. After a full name import of module in Python, the imported module has to be referenced by the full name only. In Java, only the class name should be used for reference after the imports. To alleviate this problem, Python supports the use of alias by "as" keyword in import statement.

  4. Python supports defining variables and introducing names at the package level, which makes package level API design available.
    Example: import tensorflow as tf (all APIs are introduced into the tensorflow namespace by using imports in "init.py" file of package tensorflow)

In contrast, a java version of tensorflow API would have to be imported by the following statements: import tensorflow.api.xxx.
(In java, package object can not have methods/functions. Instead, all methods are defined in classes and must be introduced through importing classes)

Python's import mechanism is pretty messed up with lots of inconsistency, partly due to its natural as an interpreting language.

  1. Wildcard imports in python can only be used when "all" magic variable is defined in the "init.py" file of the package. Only names defined by "all" will be imported by the wildcard import statement.In Java, no such definition of wildcard names is required.

  2. Avoid using module attribute: "file" in any cases ( due to inconsistency and ambiguity), all module/package path issue should be resolved through entry point: sys.path[0] or cwd() (1 in 2).
    Note: After py3.4.3, "file" will return the absolute path of the module object when not invoked from "main". Using "file" would result in different behavior between running a script directly under cwd (using relative path) and running a script indirectly from cwd/abs path. Due to the fact that "file" is a module attribute, this different behavior can not be mitigated by patch wrapping.

  3. Good python API design should collect all important APIs in the package namespace through programming "init.py" file. This avoids explicit imports of all sub-modules and class/function names

Class Instantiation

Class instantiation in python does not require the use of new keyword.

Local Scope

All if and loop "block" in python has global scope.

Identity Test

  1. In python, "is" operator is the identity test, "a" is "b" is the equivalent of id(a) == id(b). "==" operator is the value equality test. When in the equality test, if the right operand is a primitive type object (an expression), then the "==" operator is safe to be replaced by "is" operator.

  2. java中的数据类型,可分为两类:
    基本数据类型,也称原始数据类型。byte,short,char,int,long,float,double,boolean
    他们之间的比较,应用双等号(==),比较的是他们的值。
    复合数据类型(类)
    当他们用(==)进行比较的时候,比较的是他们在内存中的存放地址,所以,除非是同一个new出来的对象,他们的比较后的结果为true,否则比较后结果为false。 JAVA当中所有的类都是继承于Object这个基类的,在Object中的基类中定义了一个equals的方法,这个方法的初始行为是比较对象的内存地 址,但在一些类库当中这个方法被覆盖掉了,如String,Integer,Date在这些类当中equals有其自身的实现,而不再是比较类在堆内存中的存放地址了。
    对于复合数据类型之间进行equals比较,在没有覆写equals方法的情况下,他们之间的比较还是基于他们在内存中的存放位置的地址值的,因为Object的equals方法也是用双等号(==)进行比较的,所以比较后的结果跟双等号(==)的结果相同。

pass by value v.s. pass by reference v.s. pass by object reference

Python and Java and C++ has its own language describes their object model and argument pass mechanism.

Note: Object reference means an object is bound to an identifier (called a variable), all changes to the variable applies to the object referenced as well. (Java: Pass the object reference by value).
A reference to a variable is treated exactly the same as the variable itself, any changes made to the reference are passed through to the argument.

Python is designed to be "pass by object reference", that is for primitive types and non-primitive types argument passing is done through passing of object reference (which is the id of this object, iid, memory address).

Java is designed to be "pass by value", for primitive types it is equal to value copying, for non-primitive types it is the copying and passing of the object's reference by value.

C++ has pass by value and pass by reference mechanism. Pass by value applies to both arithmetic values and object pointers (is itself an object) and class instances, it is essentially a copying of the argument (applies to class instance as well). Pass by reference applies to all types in that all changes to the parameters affects the arguments as well.

Note: The effect of C++ pass by reference cannot be produced in Java and Python (for both the primitive types and the non-primitive types), which is, a function can change the reference of the variable that acts as the argument of this function, iid, change the reference of outer scope variable via operation in the inner scope. (Needs verification).

The above mechanism reflects the justification of "pass reference by copying value".

Variables

In C++, a variable is a storage location. "int a = 1", assign value 1 to the address location of a.
In Python, a variable is a purely an identifier (bounded to certain objects). "a = 1", assign/change the object reference (not a reference in c++ context) of a to object 1.

Program Constructions

Java requires that all functions been implemented as methods of a class. Variables take two forms, one is the class members (attributes in JS, property member in python), the other is temp variables that reside in the local scope of methods.

C++ can have variables and functions defined/used in the global scope.
Python technically can only have variables and functions defined/used in the module and all kinds of local scope. "Global Scope" in python usually refers the module scope of the "main" module.

Pointer Declaration

In C++, the pointer is declared by adding a affix to a non-pointer type to construct a pointer type. int* a = &b
In golang, the pointer is declared by prefixing the non-pointer type to construct a pointer type. var a *int = &b

Concurrency and Network IO

Single Threaded Concurrency

Due to the fact that CPython implementation has an embedded GIL, the mainstream python ecosystem is build on the single threading assumption. All major IO libraries are consequently written in a synchronous manner.

In modern python, concurrency is achieved through the use of coroutine, which is implemented via generators and the yield statement (previously iterators).

A function which implements a yield statement become a coroutine, the context switch between the coroutine and the master routine (the caller) is triggered whenever the generator function is called. The yield statement allows the local variables, states, and execution progress of the coroutine to be retained, which makes single threaded concurrency available.

Single-threaded languages have urgent need for asynchronous IO libraries because in a multi-threaded environment, even when the IO of a specific thread is blocked, the overall program is still responsive. The extra burden of single threaded libraries such as python and js is the requirement of asynchronous implementation of IO library. The lack of IO std library is the reason why JS is chosen by Ryan Dahl as the implementation language of NodeJS, largely to achieve asynchronous IO from the bottom. To be noted, NodeJS is a run-time environment packaged with JavaScritpV8 engine rather than a new language. In NodeJS, all IO is performed in the default event-loop (with no event-loop called to be defined/declared explicitly).

The Limitation of Single-threaded Concurrency

Single-threaded concurrency can only solve one type of Blocking: IO-blocking, this is the emphasis of NodeJS and Python tornado. In an environment where all IO is non-blocking, a computationally intensive operation will block the entire server. For the above reason, a single threaded non-blocking IO server must and can only handle computationally intensive operation through Restful Http/RPC call to other service providers (which should be non-blocking). In this chain of service request / providing, except for the first call from client, all the following services should be non-blocking, otherwise the overall service is blocked.

Single-threaded non-blocking IO concurrency can only save CPU resources for the Http Server (The nodejs/tornado server that only handles http requests) in a manner that one thread v.s. multiple threads (may contain suspended threads). This server can only handle tasks that (besides IO operation) will not be CPU intensive enough to block the whole IO chain (a light task).

Blocking/Non-blocking v.s. Synchronous/Asynchronous

Synchronous/Asynchronous describes the identity of the thread who process the IO. In synchronous mode, the main thread process the IO, so all IO must be processed one after another (refers as "blocking") and the main thread cannot do anything until the IO request returns. In asynchronous mode, the main thread assign the IO request to a child thread (or coroutine), so all IO can be processed simultaneously (non-blocking). Synchronous/Asynchronous is used to describe the mechanism of the request processing logic, this terminology is commonly applied from a service provider's viewpoint.

Blocking/Non-blocking is used to describe how requests are processed, this terminology is commonly applied from a IO request caller's viewpoint. Blocking means the service operates in a blocked manner.

All kinds of Blocking

Blocking for single-threaded language includes all executions (computationally or non-computationally) that would suspend the current (main and only) thread. Even sleep(1) in different coroutines (python) will block each other because there is only one real thread running at a given time. In this sense, single-threaded languages tend to be very sensitive to all "blocking" operations.

Single threaded concurrency can not handle cpu blocking (deployment of dl network, syntactic parsing, etc).
When dealing with blockings from other sources: network-io, disk-io, etc, event-loop is more efficient in terms of cpu resources. Event-loop is most suitable for light-weighted http server.

Solution

  1. Blocking/Non-blocking Http Client -> Async Tornado Http Server (Thrift: AsyncClient) -> (Thrift) TThreadPoolServer
  2. Blocking/Non-blocking Http Client -> Actor-based Concurrency Playframework Http Server (Thrift: SyncClient) -> (Thrift) TThreadPoolServer

Note: service requester can have blocking io, only service providers should be able to handle concurrency request (through multi-threading or event-loop).

利用Tensorflow构建图像识别简单Web应用

Tensorflow官方有一个CNN图像识别的本地应用Demo 原始链接,现在对这个应用做简单的修改和封装生成一个Web应用,用于演示图像识别效果。

修改官方的classify_image脚本程序

官方的这个demo脚本模块化做的不是很好,为了将这段脚本实现的业务逻辑(图像识别)封装进Web服务端,我们需要对脚本进行修改。

  1. 配置模块
    首先,因为tf.app.flags.FLAGS这个模块不兼容tornado框架,所以用一个简单的config类将其替换。

  2. 模型下载模块
    由于国内特殊的网络环境,启动时下载模型文件的功能没有实际意义(反而会阻塞程序),这里将其去掉,所有应用所需文件改为默认直接加载本地路径。

  3. 类和对象的引用
    将所有的直接引用模块全局类对象的代码改为通过函数参数引用,代码风格尽量向函数式靠拢。

  4. 本地测试的入口由tf.app.run()改为run_inference_on_image(), 测试图片的路径在if name == "main"下修改,避免更改模块代码。

搭建Tornado图片上传应用

  1. 图片上传功能
    利用class UploadHandler(tornado.web.RequestHandler)实现了上传文件的命名和本地存储

  2. 图片识别
    利用run_inference_on_image函数实现对图像的识别

  3. 识别命名翻译
    因为tensorflow模型内置的图像类别名为英文,这里调用百度翻译API将英文名自动译成中文。

别的不多说了,上代码:
Github链接

启动命令:
python upload.py