M41 Highway

Data science and software engineering blog

Leave a comment

Classification using Random Forest Approach

Kaggle is really a great place to learn data science in a practical way. Today I just joined a competition (tutorial) and submitted my first prediction using Random Forest Classifier. I scored 0.77512 for my initial try and really surprised by the efficiency of the Scikit-learn library.

ImageHere’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = csv_file_object.next() # skip the header
train_data = []
for row in csv_file_object:
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest = Forest.fit(train_data[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = test_file_object.next() # skip header row
test_data = []
for row in test_file_object:
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')

Leave a comment >

Canopy is a very well equipped IDE for Python programmer. With its free version, you ‘ve have already offered NumPy, SciPy, and many more useful libraries. Today I need to use the Random Forrest Classifier from Scikit-learn, but it needs to upgrade to Canopy basic version. So I have to install on top of the Canopy IDE by myself. Here’s the step.

I am using free version Canopy with 2.7.3 Python installed on CentOS 6.4. First, install gcc compiler using “yum” command if you don’t have it installed.

# yum install gcc-c++

Second, if you don’t have PIP (Python Packaging Index) installed, download and install it.

# wget http://python-distribute.org/distribute_setup.py 
# sudo python distribute_setup.py 
# wget https://github.com/pypa/pip/raw/master/contrib/get-pip.py 
# sudo python get-pip.py

Verify if it is installed successfully by using “pip” command.

Third, install Scikit-learn as followed.

# pip install -U scikit-learn

Verify if it is installed correctly.

# python -c "import sklearn; sklearn.test()"

Finally, enter your Canopy interactive prompt and you can use the great library. For details of installation on other OS, please refer to scikit-learn doc.

Leave a comment

List of good books in Big Data

Data Science is a emerging field comprising of expertise across different domains. Here’s a list of awesome books I highly recommended to individual from different level.



“Software Performance and Scalability – A Quantitative Approach” (Book information)

Author: Henry H. Liu

Performance tuning sometimes is heuristic, particular in large scale Internet system. If you wish to have better planning and get more insight of what the performance characteristics of complicated system, here’s the way you go.



Image“Algorithms of the Intelligent Web” (Book information)

Authors: Haralambos Marmanis and Dmitry Babenko

This is a very practical book to learn machine learning, data clustering, and other data science topic in a Java programming way. It is especially good for software engineer with Java background as a introductory learning material to get involved in Big Data.




“Python for Data Analysis” (Book Information)

Author: Wes McKinney

There are some very good library for mathematics and statistic in the Python family. If you have programming background, you will love it for its efficiency. This is a very useful book to master Python from a analysis prospective.


(I will keep updating the list)