M41 Highway

Data science and software engineering blog


Install Rabbit MQ 3.3 on CentOS 7

This is a step by step guide to install the Rabbit MQ for the series of topics about AMQP messaging. Rabbit MQ support most of the Linux distributions, Mac OS and MS Windows. I would demonstrate it on CentOS 7 as an example.

1. Install compiler and related if  necessary

# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel autoconf

2. Update latest EPEL

# wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-1.noarch.rpm
# wget http://rpms.famillecollet.com/enterprise/remi-release-7.rpm
# sudo rpm -Uvh remi-release-7*.rpm epel-release-7*.rpm

3. Install Erlang

# wget http://packages.erlang-solutions.com/erlang-solutions-1.0-1.noarch.rpm
# yum install -y erlang

Type “erl” to verfiy if Erlang is installed correctly.

4. Install Rabbit MQ

# wget http://www.rabbitmq.com/releases/rabbitmq-server/v3.3.5/rabbitmq-server-3.3.5-1.noarch.rpm

Add the necessary keys for verification by

# rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc

Install with command

# yum install rabbitmq-server-3.3.5-1.noarch.rpm

Issue command to turn on the web UI plugin

# sudo rabbitmq-plugins enable rabbitmq_management

Change permission

# chown -R rabbitmq:rabbitmq /var/lib/rabbitmq/

Issue following command to start the server (*1)

# /usr/sbin/rabbitmq-server

5. Setup admin user account
In /usr/sbin, create new user “mqadmin” by

# rabbitmqctl add_user mqadmin mqadmin

Issue following command to assign administrator role


# rabbitmqctl set_user_tags mqadmin administrator

Issue command ‘rabbitmqctl set_permissions -p / mqadmin “.*” “.*” “.*” ‘ to grant permission

# rabbitmqctl set_permissions -p / mqadmin ".*" ".*" ".*"

Now you can access the web admin by http://host:15672/ (*2)

(*1) You may not start the server with command “service rabbitmq-server start” successfully as documented in official manual, please read this link to resolve https://groups.google.com/forum/#!topic/rabbitmq-users/iK3q4GLpHXY

(*2) The default guest/guest user account can only be accessed via localhost


Version Information

Software Version Reference
CentOS 7 http://mirror.centos.org/centos/
Erlang OPT17 http://www.erlang.org/download.html
Rabbit MQ http://www.rabbitmq.com/install-rpm.html

Leave a comment

Apply Dependency Control on AngularJS

ng require



Appreciated to the JS community for minify and uglify tools so we save much bandwidth for the web apps and the mobile apps. To uglify and concatenate java scripts files make the download more efficient. Meanwhile modularization and dependency control is important in application development with growing complexity. I was hesitate to apply dependency control on the AngularJS application with the two ideas come from opposite ends, until i see the talk by Thomas Burlesonabout Angular and Require JS. As a frontend developer, I like DI that really make code cleaner. Why not take one more step forward to incorporate the dependency control just like the way we code in backend application. Some people argue that it does not worth because a single concatenated file in production don’t bother you to manage the dependencies at all. But I see its value when we need to incorporate pretty much javascript plugins with AngurlarJS webapp in development time.

If you are reading up to this point, I assume you have the knowledge about AngularJS and/or RequireJS, and intend to integrate both, just like me, but there are not much practical information on the web. RequireJS is easy to use. Rather apply it directly to our existing project with fair complexity, I would take an approach to integrate it with a seed project from scrap, and I also take this chance to share how to get it done.

First of all, I use the popular Phonecat tutorial as an example to work with RequireJS. If you haven’t tried it before, please take a look at AngularJS.org. And you can get all the source code on my github for reference.

Modular design is good, but lack of dependency control is bad. Particularly when the Angular webapp become large and complex. It causes program break if the order of modules are not loaded in the order as expected. The followed diagram shows the dependency of the phonecat app. The phonecat module runs successfully only if all other modules are downloaded.



So you would prefer screen B to A as it load module more as the way you expect.


Screen A: Javascript files request at the same time but no guarantee when to complete.



Screen B: Javascript files are downloaded as the sequence as the dependency order


Okay, the idea is simple and we make the magic now. Let’s see the project structure shown in the followed image. The js folder is a typical template of AngularJS project with modules named animations, app, controllers, directives, filters, and services. Second, the main.js is the config file to define the dependency between the modules. In the lib folder, you need the RequireJS libaray. Third, I need to modify the index.html to add a bootstrap point to the RequireJS config.


In the main.js, you need to 1) define RequireJS module with an id and its file path in the “paths” block. It includes external Javascript library such as jQuery and AngularJS, as well as the customized module for example the Angular modules in this case. 2) define the inter-dependency between the RequrieJS modules in the “shim” block. Third we don’t need include script any more, instead we need define a callback method to trigger the bootstrap when the app module is loaded.


Forth, the ng-app attribute is removed so that bootstrap initialize the webapp. Fifth, you don’t need to script include for all the javascript files anymore, instead you only need one script include to define the path of the config file and the RequireJS library.



The setup is already done. Now we can inject the dependency in your program (in a similar way you import the classes in Java) by wrapping up the code with a define function and a callback. The followed five screen shots show how easy it is, but be careful not to mix up the RequireJS module id and the Angular module id.






Here it is done. I hope you enjoy and it would be helpful.

Leave a comment

Customize your own favicon in Kraken JS

Kraken JS is built on top of Express JS and Connect JS. It inherits the favicon of Connect JS by default and you may spend some time to get rid of it, just like me. Here is the workaround, and it is expected to have this feature in later release. The author suggested the solution in https://github.com/paypal/kraken-js/pull/106, but I would put it in app.requestStart instead.

In the index.js, remove the favicon middleware, and add your own favicon.

app.requestStart = function requestStart(server) {
   server.stack.some(function (middleware, idx, stack) {
   if (middleware.handle.name === 'favicon') {
      stack.splice(idx, 1);
      server.use(express.favicon(__dirname + '/public/favicon.ico'));
      return true;

Leave a comment

Synchronous control of iteration containing callback execution

You love Node.js because of its non-blocking programming model to make your software better throughput. Sometimes you need step-by-step execution for example of a member registration which includes some database lookup and processing, followed by a persistence of a new document of data, and to fire an activation email. You may chain up the execution in the callback method recursively, or if you don’t feel good with this style, you may chain them up with node async series function.It works well with pre-defined execution blocks. But it doesn’t make you happy in case an iteration containing a series of asynchronous execution, and you need to make sure the results arrive as same as the invoking sequence.

The code in first image fire the synchronous function (for example some disk I/O) in the sequence of the queue (simply govern by the for-loop). The mockIO function simulate a blocking process finished in random time (up to 2 seconds). 


However the result does not come back as you expect.


A recursive approach to trigger the callback function can help in this situation. We use the same mockIO function and queue. it pass the control to callback function and the result runs in the invoking sequence. 


This result is what you want!


Most people argue we should not writing blocking code in Node.js because it is single-thread process which will degrade the performance of the whole process. I agree with this point without doubt. This is a demonstration one of the way to program a series of asynchronous functions which parameter may depends on the result of the last execution. Happy coding!


Leave a comment

The influence behind a stocks market using social network analysis


Screenshot 2013-11-30 11.14.39

This analysis aims to reveal the influence of enterprises in the Hong Kong economy.
Hong Kong is one of the remarkable financial centers in the world. Hong Kong
economy is highly coherent to the stocks market, in which the Hang Seng Index (HSI)
is the most representative because the 50 constituent companies account for about
60% of capitalization of the Hong Kong Stock Exchange. This study refers to the 50
constituent companies as of 29th November, 2013 published by the Hang Seng
Indexes website (http://www.hsi.com.hk/HSI-Net/HSI-Net). Each public listed
company typically constitutes a several major shareholders who own a significant
portion of shares. These major shareholders influence the strategies of the public
listed company. If an organization plays as a major shareholder in several public listed
companies simultaneously, it creates a sufficient condition to exert her influence in the
stocks market and even the economy. This study will find out are there any super
power to influence the Hong Kong stocks market, and in which degree they exert the


Source code

All the source code for the analysis is available in my github respository (https://github.com/m41highway/sna) . The dataset is merged and cleaned with a piece of Python script using Numpy and Pandas. It will generate a GML format file. The GML format file will then be analyzed using Gephi.

File Purpose
hsindex.csv Dataset
create_network.py Data merge, clean, generate GML
hsi.gml Network file in GML format
EmpiricalNetworkAnalysis.pdf Analysis results




Leave a comment

Classification using Random Forest Approach

Kaggle is really a great place to learn data science in a practical way. Today I just joined a competition (tutorial) and submitted my first prediction using Random Forest Classifier. I scored 0.77512 for my initial try and really surprised by the efficiency of the Scikit-learn library.

ImageHere’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = csv_file_object.next() # skip the header
train_data = []
for row in csv_file_object:
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest = Forest.fit(train_data[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = test_file_object.next() # skip header row
test_data = []
for row in test_file_object:
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')

Leave a comment >

Canopy is a very well equipped IDE for Python programmer. With its free version, you ‘ve have already offered NumPy, SciPy, and many more useful libraries. Today I need to use the Random Forrest Classifier from Scikit-learn, but it needs to upgrade to Canopy basic version. So I have to install on top of the Canopy IDE by myself. Here’s the step.

I am using free version Canopy with 2.7.3 Python installed on CentOS 6.4. First, install gcc compiler using “yum” command if you don’t have it installed.

# yum install gcc-c++

Second, if you don’t have PIP (Python Packaging Index) installed, download and install it.

# wget http://python-distribute.org/distribute_setup.py 
# sudo python distribute_setup.py 
# wget https://github.com/pypa/pip/raw/master/contrib/get-pip.py 
# sudo python get-pip.py

Verify if it is installed successfully by using “pip” command.

Third, install Scikit-learn as followed.

# pip install -U scikit-learn

Verify if it is installed correctly.

# python -c "import sklearn; sklearn.test()"

Finally, enter your Canopy interactive prompt and you can use the great library. For details of installation on other OS, please refer to scikit-learn doc.

Leave a comment

List of good books in Big Data

Data Science is a emerging field comprising of expertise across different domains. Here’s a list of awesome books I highly recommended to individual from different level.



“Software Performance and Scalability – A Quantitative Approach” (Book information)

Author: Henry H. Liu

Performance tuning sometimes is heuristic, particular in large scale Internet system. If you wish to have better planning and get more insight of what the performance characteristics of complicated system, here’s the way you go.



Image“Algorithms of the Intelligent Web” (Book information)

Authors: Haralambos Marmanis and Dmitry Babenko

This is a very practical book to learn machine learning, data clustering, and other data science topic in a Java programming way. It is especially good for software engineer with Java background as a introductory learning material to get involved in Big Data.




“Python for Data Analysis” (Book Information)

Author: Wes McKinney

There are some very good library for mathematics and statistic in the Python family. If you have programming background, you will love it for its efficiency. This is a very useful book to master Python from a analysis prospective.


(I will keep updating the list)


Leave a comment

A movie recommender based on the similarity of text contents

A recommendation function is gaining popular in many websites. It is useful to increase the traffic of the websites. There are many different implementation of a recommendation engine, from item-based, user-based, item-user-based, content-based collaboration filtering, to naive based algorithms. One of the core concept in collaboration filtering is the measurement of similarity between entities. The degree of similarity is usually implemented in mathematical way to measure the distance between two vectors. Here I will show you the basic idea (It is also the core library implemented is Lucene) of using the cosine algorithm to find out the most similar movie from a pool. 

In the demonstration, the movie contains attributes such as title, description, category (genre), director, actor, etc (refer to the data). First of all, we have to extract the important information that represent the specific movie most. We will break down the whole passage/ short sentences into into keyword by a way of tokenization (with the help of IKAnalyzer in this case because of Chinese language, refer to tokenizer) so that we can analyze and get the most important information from the entity .Image

TF/IDF is a standard way to calculate the importance of each keyword by counting the frequency of keywords while normalized through the whole data set, for example, movie “A” has keyword “children”, “zoo”, “elephant” and movie “B” has keywords “comedy”, “children”, “zoo” . The comparison of every two movies will be proceed one by one. All the unique keywords from the two movies will form a array, i.e. [“comedy”, “children”, “zoo”, “party”, “elephant”]. And we initialize a vector [0, 0, 0, 0, 0] for “A” and “B” movie respective to the array. We check the keyword of movie “A” against the array, if the keyword appear on the array, we assign value 1 in the corresponding positing in the vector. So we do it for movie “B” and we have [0, 1, 1, 0, 1] for movie “A” and [1, 1, 1, 0, 0] for movie “B”. The similarity between “A” and “B” is equivalent to calculate the “distance” between the vector [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] in a five-dimension space, which is (dot product of [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] / scalar product of Pythagorean distance of [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] ). 


0 + 1+ 1+ 0 + 0 / 3 + 3 = 0.66667

The value lies between 0 (for different) to 1 (similar).


You may get source to explore the details. Here’s some images to illustrate the process. The first image shows the words tokenized.


And this image shows score of the TF/IDF process. 


The last image shows the result that similarity of the movie ‘Lord of the Rings” to other movies in the pool.


The code discussed above is mainly for the demonstration purpose for the development team. There are a few shortcomings if it is directly applied in a production system. The similarity calculation is O(n^2) which does not scale if you have 100000 movies for example, so you should consider a parallel processing way like Hadoop. Second, the vector calculation is implemented in a for-loop which can be optimized by using a vector specialized library, or implemented in efficient library say Python NumPy. Third, we have selected a few attributes (dimension) in the demonstration. In reality, there is no limitation of the number of attributes used in the calculation, but the high dimension will suffered from the curse of dimensionality. Continue reading

Leave a comment

To achieve High Availability with two commodity computers

It is an English version of my post of “用兩台家用電腦實現高可用性部署” dated 3rd September 2013.

The High Availability becomes the implicit non functional requirement since the enterprises has been paying rising attention of the ‘data’. The High Availability is the ability to resist failures in the whole system, rather than achieved by just a piece of software, or by deploying a load balance device. There is no perfect solution for High Availability because, with limited budget, you can’t solve all the problems which may arise from all sort of possibilities, including networks, hardware, software, and even the business logic. For an example of a B2B platform, which application servers have been deployed in redundancy. If one of application servers occurs a failure, the user requests can be routed to the server in good conditions and thus the services can be regarded as High Availability. Another example is the master-slave database deployment that ensure the minimum data loss during failures. The UPS and the backup power supply is another good example in which you secure the electricity supplies. Using two broadband suppliers is also a reasonable solution to prevent Single point of failure on network level. There is no perfect solution in High Availability! You’ve done all the four deployments in the platform. Guess what will happen if the payment service provider occurs a blackout for a couple of hours. Guess what will happen if there is fire happens at your neighbor and saves all the lives luckily, but the flood damaged both your main and backup power supplies. It is tragedy obviously because either failure cause a single point of failure that is not on your budget book! With large and sufficient budget, you could minimize the most of the risks. With limited budget, you could get rid of the most risky problem, and develop a emergency plan for the rest of the problems with your stakeholders.

The High Availability is a mature functionality on both hardware and networking devices nowadays. Redundant components is ready get “online” once a failure is detected.  The complexity of handling data synchronization, data integrity, and data recovery leads to a complicated mechanism in High Availability of the database system. It makes senses to copy data in real time to achieve High Availability. This idea is implemented by copying the transaction logs files from the primary machine to the slave machine. If there is failure in the primary database, the slave database will take over and execute the transaction logs to roll back to the latest image. This way is easy but there are three problems. Firstly, it need human intervention  to change the configuration file when a failure occurs. Secondly it takes time (proportional to the volume of data) to rollback, the downtime would be 52.56 minutes a year if the system claims to be 99.9999% HA. it may not be a very nice solution for large scale system. Thirdly, it is not cost-effective to buy a standby machine which is idle in most of the time. So what about MySql Cluster? It is a good idea if you can pay expensive license fee and adapt to a more complicated replication mechanism. The semi-replication mechanism causes additional synchronization, thus it requires hardwares with higher performance and hence higher cost. Another solution builds the cluster on SAN, for example of Oracle RAC. This solution offers high performance in synchronization and high scalablity, but the SAN is a Single point of failure itself. Today I want to introduce a High Availability solution which is open sourced, offers real time data synchronization, robust data protection, replication, and automation.

First of all, please familiar with yourself about DRBD from Linbit. I would like to share a successful case of deploying High Availability solution based on DRDB technology. I am not responsible and liable in the demonstration. Please test it thoroughly before applying to the production environment.

1. DRBD User Guide (8.3.x)

2. Linux HA User Guide

3. Combine GFS2 with DRBD

4. MySQL Availability with Red Hat Enterprise Linux

5. Dual Primary (Think Twice)


In general, it is not possible to synchronize two computer systems (file systems) which issue disk write operation concurrently. GFS2 is a file system developed by Red Hat to solve the file system level synchronization. I am here to demonstrate how to make use of DRBD of Linbit, CMAN, and GFS2 to implement the Dual Primary system to achieve the High Availability.

Table of Content

0. Environment

1. Installation of operating system

2. Network configuration

3. Installation of softwares

4. DRBD initialization and first synchronization

5. GFS2 formating and mounting

6. Testing

0. Environment

Hardware environment

Node 1 (sony.localdomain)
  • Intel Pentium (R) Dual CPU T2330 @ 1.60 GHz
  • 2GB RAM
Node 2 (dell2.localdomain)
  • Intel Core2Duo CPU E7200 @ 2.53 GHz
  • 4GB RAM

Software environment (identical on both computer)

1. Operating system: CentOS 6.4 (Final) kernel 2.6.32-358.el6.i686
2. cman
3. gfs-utils
4. kmod-dlm
5. modcluster
6. ricci
7. luci
8. cluster-snmp
9. isci-initiator-utils
10. openais
11. oddjobs
12. rgmanager

1. Installation of operating system

GFS2 is developed by Red Hat. You need to pay for the Cluster suite (the software stated above), so I will demonstrate on CentOS 6.4 and install the softwares manually. Firstly, you have to prepare two PCs for installing CentOS 6.4. I would stress to use two physical machines and connected with cross-over Ethernet cable because the performance will be better. If you got two PCs with Windows installed, you can install VM Player to host the CentOS. It is not recommended to use a single Windows PC to create two VM Player instances to host the CentOS. It is possible theoretically but it will be less realistic.

1.1 Install CentOS 6.4 (non VM)

1.1.1 Download CentOS 6.4 ISO image ,make it into a bootable DVD disk.

1.1.2 Install the CentOS on both computers. Click “Next” button until seeing “Which type of installation would you like?”. Because it need a stand alone partition to install DRBD, but it offers only one boot partition and another one partition for the OS, so we need to choose “Create custom layout” as the image shown as followed


1.1.3 We re-organize the partition layout by deleting the existing partitions and creating new partitions. We use sda1 for boot partition, sda2 for LVM, sda3 for GFS2 partition. It is advised to leave some unused space for future expansion. Please note that don’t set the size of the GFS2 partition too large as it takes long time to synchronize. 10 GB or 20 GB is a good choice for testing purpose.


After finishing the partition layout, follow the instructions to complete the whole installation.

1.2.1 Install CentOS 6.4 (VM)

For Windows installation, firstly download and install VM Player 5.0.2, and also CentOS 6.4 ISO image. To create a VM Player instance, choose “Edit virtual machine setting” on the VM instance, and then click “create new virtual disk”, follow the instructions to complete the installation, and you will see the new partition sdb.



2 Network configuration

2.1 To minimize network latency, use cross-over Ethernet cable to connect both computer on their NIC, and configure as followed.

root@dell2# cat /etc/sysconfig/network-scripts/ifcfg-eth0
root@sony# cat /etc/sysconfig/network-scripts/ifcfg-eth0

 2.2 DRBD configuration needs to use hostname, so configure the hostname as followed.

root@your_machine# cat /etc/hosts sony.localdomain dell2.localdomain
Restart the network service and test it the connectivity by pinging each other.
3. Software installation
3.1 Use yum command to install the software.

root@your_machine# yum install -y cman gfs2-utils kmod-gfs kmod-dlm 
modcluster ricci luci cluster-snmp iscsi-initiator-utils openais oddjob rgmanager
3.2 Because DRDB does not have a yum repository, you need a little bit more effort. Install Erlang on both computers.
root@your_machine# wget http://elrepo.org/
Install it on both computers.
root@your_machine# rpm -ivUh elrepo-release-6-4.el6.elrepo.noarch.rpm
Change the configuration of Erlang on both computers. change the 8th line as ‘enable=0’
root@your_machine# gedit /etc/yum.repos.d/elrepo.repo
3.3 Now you can use yum to install DRBD
root@your_machine# yum --enablerepo=elrepo install drbd83-utils kmod-drbd83
3.4 Create the cluster configuration file on both computers as followed,
root@your_machine# gedit /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster alias="cluster-setup" config_version="1" name="cluster-setup">
<rm log_level="4"/>
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
  <clusternode name="sony.localdomain" nodeid="1" votes="1">
      <method name="2">
        <device name="LastResortNode01"/>
  <clusternode name="dell2.localdomain" nodeid="2" votes="1">
      <method name="2">
        <device name="LastResortNode02"/>
<cman expected_votes="1" two_node="1"/>
  <fencedevice agent="fence_manual" name="LastResortNode01" nodename="sony.localdomain"/>
  <fencedevice agent="fence_manual" name="LastResortNode02" nodename="dell2.localdomain"/>
<totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
3.5 Configure the DRBD setting as followed, if there are include “drbd.d/global_common.conf”; 和 include “drbd.d/*.res”; on the top of the file, just remark them.

root@your_machine# gedit /etc/drbd.conf
global { usage-count yes; }
common { syncer { rate 100M; } }
resource res2 {
  protocol C;
  startup {
    wfc-timeout 20;
    degr-wfc-timeout 10;
    # we will keep this commented until tested successfully:
    # become-primary-on both; 
  net {
    # the encryption part can be omitted when using a dedicated link for DRBD only:
    # cram-hmac-alg sha1;
    # shared-secret anysecrethere123;
  on sony.localdomain {
    device /dev/drbd2;
    disk /dev/sda3;
    meta-disk internal;
  on dell2.localdomain {
    device /dev/drbd2;
    disk /dev/sda3;
    meta-disk internal;
  disk {
    fencing resource-and-stonith;
  handlers {
    #outdate-peer "/sbin/handler";
  • ‘resource’ is the reference name in DRBD configuration. I suggest to use ‘res2’ to point to ‘/dev’drbd2’ and use ‘res0’ to point to ‘/dev/drbd0’ for ease of management. The order of the device is not important.
  • ‘device’ is the default path of the DRBD device. After DRBD is installed, the default devices will be shown as /dev/drbd0, /dev/drbd1, …., /dev/drbd9 .
  • ‘disk’ is the hard disk partition for synchronization, we will format it as GFS2 a while later, and which we have already prepared in section 1.
  • ‘address’ is the IP address of the computer. The default port number will be 7789.
3.6 To automate the startup of DRBD in every reboot, change the configuration as followed.

root@your_machine# gedit /etc/init.d/drbd
Change # chkconfig: 345 70 08 to # chkconfig: 345 22 78
Note that ‘#’ is not a remark syntax, please don’t remove it.
3.7 Configure firewall on both computer
root@your_machine# iptables -I OUTPUT -o eth0 -j ACCEPT
root@your_machine# iptables -I INPUT -i eth0 -j ACCEPT
root@your_machine# service iptables save

4. DRBD initialization and first synchronization

4.1 Start the DRBD services on both computers

root@your_machine# service drbd start
4.2 Create the meta data on both computer

root@your_machine# drbdadm create-md res2

If you got error like this “exited with code 40″

Device size would be truncated, which would corrupt data and result in 'access beyond end of device' errors. You need to either * use external meta data (recommended) * shrink that filesystem first * zero out the device (destroy the filesystem) Operation refused. Command 'drbdmeta 0 v08 /dev/hdb1 internal create-md' terminated with exit code 40 drbdadm create-md ha: exited with code 40
Use dd command to fill some bits of data in the disk partition, and then re-execute drbdadm create-md res2
root@your_machine# dd if=/dev/zero of=/dev/hdb1 bs=1M count=100
4.3 Communicate with each other

root@your_machine#  drbdadm up res2
4.4 Check the status of each other
root@your_machine#  #drbd-overview
Status will look alike this:
1:res2  Connected Secondary/Secondary Inconsistent/Inconsistent C
4.5 On either one (and only one) computer to issue a synchronize command. It takes 10 mins to finish a 10GB hard disk partition for my case. You are free to check the status during the period of the process.
# drbdadm -- --overwrite-data-of-peer primary res2
The status will look alike this after synchronization complete.
1:res2  Connected Primary/Secondary UpToDate/UpToDate C r----
4.6 We need to achieve “Dual Primary”, so you need to remove the remark on this line ‘become-primary-on both;’.

#gedit /etc/drbd.conf
 5. GFS2 formatting and mounting

5.1 Format the hard disk partition as GFS2 on each computer.
# mkfs.gfs2 -p lock_dlm -t cluster-setup:res2 /dev/drbd2 -j 2
5.2 Stop the Network Manager, and start the (Fencing device) CMAN.
# /etc/init.d/NetworkManager stop  
 # service cman start
5.3 Mount the formatted partition to a directory, you must start CMAN before mounting
# mkdir /mnt/ha

# mount -t gfs2 -o noatime /dev/drbd2 /mnt/ha
6 Testing
Issue create files, modify files, delete files in the /mnt/ha directory to verify whether the synchronization succeed or not. CMAN takes effect and monitor the connectivity of either sides until the shutdown of the computer. If there is a network failure, computer shutdown (accidentally or planned), mal function of hardware, DRBD will enter protection mode until the problem is rectified, and then it will synchronize both computer (the data image written in during the down time ) to resume the service.
This demonstration show you how to achieve disk level High Availability. It has satisfied the requirement of real time synchronization, zero down-time, relatively short replication time, and automation.