November | 2013 | M41 Highway

Kaggle is really a great place to learn data science in a practical way. Today I just joined a competition (tutorial) and submitted my first prediction using Random Forest Classifier. I scored 0.77512 for my initial try and really surprised by the efficiency of the Scikit-learn library.

Here’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = csv_file_object.next() # skip the header
train_data = []
for row in csv_file_object:
    train_data.append(row)  
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest = Forest.fit(train_data[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = test_file_object.next() # skip header row
test_data = []
for row in test_file_object:
    test_data.append(row)  
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')

Canopy is a very well equipped IDE for Python programmer. With its free version, you ‘ve have already offered NumPy, SciPy, and many more useful libraries. Today I need to use the Random Forrest Classifier from Scikit-learn, but it needs to upgrade to Canopy basic version. So I have to install on top of the Canopy IDE by myself. Here’s the step.

I am using free version Canopy with 2.7.3 Python installed on CentOS 6.4. First, install gcc compiler using “yum” command if you don’t have it installed.

# yum install gcc-c++

Second, if you don’t have PIP (Python Packaging Index) installed, download and install it.

# wget http://python-distribute.org/distribute_setup.py 
# sudo python distribute_setup.py 
# wget https://github.com/pypa/pip/raw/master/contrib/get-pip.py 
# sudo python get-pip.py

Verify if it is installed successfully by using “pip” command.

Third, install Scikit-learn as followed.

# pip install -U scikit-learn

Verify if it is installed correctly.

# python -c "import sklearn; sklearn.test()"

Finally, enter your Canopy interactive prompt and you can use the great library. For details of installation on other OS, please refer to scikit-learn doc.

M41 Highway

Data science and software engineering blog

Monthly Archives: November 2013

Classification using Random Forest Approach