Apache Spark Note

  1. spark-submit提交有第三方依赖的python脚本
    First off, I'll assume that your dependencies are listed in requirements.txt. To package and zip the dependencies, run the following at the command line:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .
Above, the cd dependencies command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's post for heads up.

Next, submit the job via:

spark-submit --py-files dependencies.zip spark_job.py
The --py-files directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH (source of confusion for me). To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, spark_job.py:


  1. A JOIN clause is used to combine rows from two or more tables, based on a related column between them.

Dgraph Note

  1. dgraph live -c
    The default configuration of -c (concurrent transaction) is 100, which significantly slows the speed of data importing.

Development Environment Version Control

  • OS

    Ubuntu 14.04.2 LTS Kernel: Linux 3.13.0-52-generic

  • Python

    Python Distribution:

  • Compiler

    GCC version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)

  • Scala


  • Apache Spark

    Version: 1.5.2 with Java 1.8.0_40

  • Java
    Java 1.8.0_40, JRE(Build 1.8.0_40-b25), JVM (64bit, build 25.40-b25, mixed mode)

  • Gogs

Git Study Note

  1. GIt add
    The files added to the staged area once will be tracked for any modification in later usage, but still need to be added to stage the changes in later operations.

MySQL Administration Memo

Database Backup

  1. Command Used to back up a database in Linux

mysqldump -u root -p edusoho > edusoho.sql

  1. MySQL Directory Layout

If install from rpm package, the .lib .h and all other type of files are placed by rpm into the corresponding directories

If install by uncompressing a tar, the mysql directory layout will be preserved.

  1. MySQL Command

mysql > show processlist;

mysql > show status;

Software Engineering Study Notes

Productivity v.s. Complexity

Over engineering happens when the productivity gained does not worth the effort of work and the complexity added into the system.

Memory Safety

A concern in software development that aims to avoid software bugs that cause security vulnerability dealing with random-memory-access, such as buffer overflows and dangling pointers.

Type Safety

Type safety is the extent to which a programming language discourages or prevents type errors.

Linux Tool Note

  1. TCP/UDP Connection Test
    nc -zv 25331

  2. Show all process
    ps aux | less

  3. Uncompress gzip file
    tar -zxvf {file.tar.gz}

  4. List all hardware information

  5. Grep Multiple Patterns
    grep -E '123|abc' filename

  6. Show CPU Information
    cat /proc/cpuinfo

  7. Check if a Process Exist
    ps -ef | grep deplearning

  8. Server Benchmark
    ab -c 1000 -n 50000 http://localhost:8080/

  9. 查看系统日志

tail -f /var/log/syslog

  1. 查看文件夹大小
    du -hs /path/to/directory

Java Multi-threading

How to Write Multi-threading Code ?

The multi-threading programming of Java is achieved through the use of Thread object.
1. Declare a class to be the subclass of Thread class and overrides its run method.
2. Declare a class and implement the Runnable interface, then implement the run method. (Recommended)

The create of a new thread requires that a Thread object to be created with a Runnable object given as the first argument. To start running a new thread, a thread.start() method is provided. Also, thread.join() method is provided to synchronize states between different threads.


Static class members are shared by all threads (atomic w/r). Keyword: Synchronized is provided to ensure the all threads execute certain methods sequentially.

Class method members (as well as local variables in the method definition) is independent for each threads.

Start a Docker Container with Multiple Bash Command Execution

Start a docker container and launch multiple background process (service) through docker run is impossible. The reason is that docker run does not support multiple command given as docker run argument. Multiple command execution is only viable through /bin/bash -c "command1; command2". However, a container is by design exited whenever the /bin/bash process is finished. Because of this mechanism, all service launching command like "python web.py" must not ended with a daemon decorator "&", such as /bin/bash -c "python web1.py & python web2.py &" (In this case, the /bin/bash process finishes so that the container then exited").

Theoretically, a command such as /bin/bash -c "python web1.py & python web2.py" should bypass this problem since /bin/bash process got stuck with the web2.py process running in the foreground (within the container)", but this proved to be wrong in practice due to unknown reasons".

It is worth to mention that commands such as /bin/bash -c "python web1.py; python web2.py" will keep the container running, but the launch of web1.py will prevent web2.py from launching forever (because of the ; decorator).

1. /bin/bash -c "command1; command2" is rarely useful
2. docker run xxx /bin/bash -c "command" can be replaced by docker run xxx command
3. Multiple services should be split into different containers/images (container and service should maintain 1-on-1 mapping)