Bayes’ theorem implementation in python
Machine learning is a method of data analysis that automates analytical model building of data set. Using the implemented algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. Naive bayes algorithm is one of the most popular machine learning technique. In this article we will look how to implement Naive bayes algorithm using python.
Before someone can understand Bayes’ theorem, they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes’ Rule.
Conditional Probability is just What is the probability that something will happen, given that something else has already happened.
Let say we have a collection of people. Some of them are singers. They are either male or female. If we select a random sample, what is the probability that this person is a male? what is the probability that this person is a male and singer? Conditional Probability is the best option here. We can calculate probability like,
P(Singer & Male) = P(Male) x P(Singer / Male)
What is Bayes rule ?
We can simply define Bayes rule like this. Let A1, A2, … , An be a set of mutually exclusive events that together form the sample space S. Let B be any event from the same sample space, such that P(B) > 0. Then, P( Ak | B ) = P( Ak ∩ B ) / P( A1 ∩ B ) + P( A2 ∩ B ) + . . . + P( An ∩ B )
What is Bayes classifier ?
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features in machine learning. Basically we can use above theories and equations for classification problem.
Bayes classifier implementation in python
Now we have to implement this great theorem in python. Fortunately we have amazing library called scikit-learn in python.In this example we are going to create some random points in three dimensional space. We classified these points onto RED and BLUE. Our task is classify new points in this three dimensional space into either BLUE or RED. Lets start with importing required modules.
import warnings
warnings.filterwarnings(‘ignore’)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from IPython.display import Image
Now we are going to create sample three dimensional data for training
x_blue = np.array([1,2,1,5,1.5,2.4,4.9,4.5])
y_blue = np.array([5,6.3,6.1,4,3.5,2,4.1,3])
z_blue = np.array([5,1.3,1.1,1,3.5,2,4.1,3])
x_red = np.array([5,7,7,8,5.5,6,6.1,7.7])
y_red = np.array([5,7.7,7,9,5,4,8.5,5.5])
z_red = np.array([5,6.7,7,9,1,4,6.5,5.5])
We have to format this data to train with sklearn
red_points = np.array(zip(x_red,y_red,z_red))
blue_points = np.array(zip(x_blue,y_blue,z_blue))
points = np.concatenate([red_points,blue_points])
output = np.concatenate([np.ones(x_red.size),np.zeros(x_blue.size)])
Now we want to classify following points
predictor = np.array([5.3,4.2,3.3])
We are going to apply Bays classification theorem
classifier = GaussianNB()
classifier.fit(points,output)
print classifier.predict([predictor])
Lets move into more real world example. Suppose we have a list of name. We want to classify this names into Male and Female categories . Our classification process is as show in below.
Image(filename=‘classification.png’)
So first step is feature extraction. I am going to observe (extract) following evidance from a name last letter, last two letter and last_is_vowel
Next step is machine leraning algorithm. of cource we are going to use Naive Bayes classification.
Lets start implimentation in python. As it is more a NLP problem, We could use NLTK module from python. We have two csv files traing file names.txt and predict.txt which contains name to be predicted.
Lets import required module,
import numpy as np
import pandas as pd
import nltk
Define a function that parse csv file and return feature sets. We are using panda for parsing csv file.
def get_data(name, result=“gender”):
df = pd.read_csv(name)
df[‘last_letter’] = df.apply (lambda row: row[0][-1],axis=1)
df[‘last_two_letter’] = df.apply (lambda row: row[0][-2:],axis=1)
df[‘last_is_vowel’] = df.apply (lambda row: int(row[0][-1] in “aeiouy”),axis=1)
train = df.loc[:,[‘last_letter’,’last_two_letter’,’last_is_vowel’]]
train_dicts = train.T.to_dict().values()
genders = df.loc[:,[result]][result]
return [(train_dict, gender) for train_dict,gender in zip(train_dicts,genders)]
our names.txt is looks like,
df = pd.read_csv(“names.txt”)
print df
name gender
0 ebin M
1 leekas M
2 jinesh M
3 neethu F
4 mary F
5 neenu F
6 sanitha F
7 lekha F
df[‘last_letter’] = df.apply (lambda row: row[0][-1],axis=1)
df[‘last_two_letter’] = df.apply (lambda row: row[0][-2:],axis=1)
df[‘last_is_vowel’] = df.apply (lambda row: int(row[0][-1] in “aeiouy”),axis=1)
The extracted features are like
print df
name | gender | last_letter | last_two_letter | last_is_vowel | |
0 | ebin | M | n | in | 0 |
1 | leekas | M | s | as | 0 |
2 | jinesh | M | h | sh | 0 |
3 | neethu | F | u | hu | 1 |
4 | mary | F | y | ry | 1 |
5 | neenu | F | u | nu | 1 |
6 | sanitha | F | a | ha | 1 |
7 | lekha | F | a | ha | 1 |
Now we want to train with data from names.txt
train_set = get_data(“names.txt”)
classifier = nltk.NaiveBayesClassifier.train(train_set)
Finally we want to test our model. We can use names from predict.txt file to test the created model
for name_and_feature in get_data(“predict.txt”,”name”):
print name_and_feature[1],“==”, classifier.classify(name_and_feature[0])
sukesh == M
jithil == M
sijith == M
maria == F
soumya == F
neethu == F