Replacing missing values using numpy and pandas

While working with datasets, there is very commonly a situation where some of your random data fields are empty. You might totally drop those tuples where there are missing values, but ultimately you’re losing data that way. So generally missing values are filled in with the mean or the median (in some rare cases the mode as well) of the corresponding column (feature).

If you’re working with pandas, I found this task to be straightforward. The only piece of code we will need to add is:-

df = df.fillna(df.median()) , assuming df is the pandas dataframe generated from the dataset

This will automatically fill the missing data field with the median of it’s respective column. We could’ve also used mean or somthing else here. But the point is, the fillna() function helps us with the .

Note: To fill a particular column with missing values, we have to write:-

df [“loc”] = df [“loc”].fillna(df [“loc”].median() )

Now let us turn towards numpy. If you have to do the same, i.e. replace missing values in a numpy array, you do something like this:-

age[ age==’ ‘] = np.median(age)

The numpy array has the empty element ‘ ‘,  to represent a missing value. The above concept is self-explanatory, yet rarely found. I have seen people writing solutions to iterate over the whole array and then replacing the missing values, while the job can be done with a single statement only. Such is the power of a powerful library like numpy!

All for today in Python tips and tricks.


Learning Flask microframework in Python

I really wanted to learn to create websites and web applications. I have some basic knowledge of HTML only. Most websites are made using PHP and javascript as their backend, but I felt too complacent to learn this stuff (was also short of time). So I started learning how to make web apps using python web frameworks first.
I was initially looking towards learning Django, but found that Flask would be easiler to learn, understand and implement quickly.

To start with the Flask microframework I installed all the dependencies and started learning through Miguel Grinberg’s blog.

This will help me learn the basics of flask by making a simple microblog first, and help me implement the various ideas I have in my mind for creating web apps.

We will be using sqlite as our database for database functions which is included in the sql-alchemy library. We’ll also need sql-alchemy-migrate for migration of our database. We will use OpenID for authentication of user login, and WTForms to take various kinds of inputs from users.

Up till now, I have completed around 7 steps of the tutorial (took me 2 days) and have learnt how to use HTML templates in flask, generate and update web forms and databases, make different pages, redirecting users to pages, and following other users. An avatar (using gravatar) can also be added as the profile picture by a user. I still haven’t understood exactly how was unit testing being done in the tutorial, which i’ll go over again.

I will push the code i’ve written to github soon (mostly copied but includes lot of explanatory comments for my understanding). I am also looking to add some CSS to this project of mine to make it look more appealing to a user. Hopefully, by the end of this megatutorial, I will have understood how to go about creating web apps with flask. I’ll keep updating new progress here from time to time.


Comparing PCA and LDA techniques

I find a lot of people confused over whether they should use PCA or LDA for their application. Also, many don’t understand the fundamental difference between PCA and LDA. Hopefully by the end of this blog post, the difference between PCA and LDA will be clear to the user. ( I also learnt the exact differences while trying to implement both of them).

Both PCA and LDA are widely used as dimensionality reduction techniques as a pre-processing step for Machine Learning and Pattern Recognition problems. Our desired outcome through both these techniques is to reduce the dimension of the dataset
with minimal loss of information. This reduces the computational cost, speeds up computational time, and most importantly reduces overfitting by projecting the dataset onto a lower-dimensional space that describes our data best.

The main difference between the two is that PCA is an “unsupervised learning” algorithm, since it “ignores” the class labels to find the directions (the so-called principal components) that maximize the variance in a dataset. In contrast to PCA, LDA is a “supervised learning” technique and computes the directions (“linear discriminants”) that will represent the axes that maximize the separation between multiple classes as well.
In case of LDA, rather than thus just finding axes(eigen vectors) that maximise the variance of our data, we are additionally interested in the axes that maximise separation between multiple classes. This ensures good class separability in our dataset, which PCA kinda ignores.


Another difference is that in PCA, there is no assumption on the data points being distributed normally. However, if the data points come from other distributions, PCA is only really approximating their features via their first two moments, so it’s not really optimal unless the data points are normally distributed. On the other hand, In LDA,  you explicitly assume that the data points come from two separate multivariate normal distributions with different means but the same covariance matrix.
This makes LDA a less generalized method compared to PCA.

Visualizing PCA and LDA plots:-
The plots have been generating using scikit-learn ML library on the Iris Dataset. The Iris dataset consists of 150 images of 3 classes of flowers, each flower having 4 features.

The above two images make it clear that where the PCA accounts for the most variance in the whole dataset, the LDA gives us the axes that account for the most variance between the individual classes.

When to use which technique?

It might seem like LDA is always a better technique to go with, but such is not the case. Comparisons show that PCA outperforms LDA if the number of samples per class is relatively small (as in this Iris Dataset). However, when you have a large dataset having multiple classes, then it’s better to use LDA, because class separability will be an important factor in that case while reducing dimensionality.

NOTE: PCA can also be used together with LDA. I will leave it to the user to explore that scenario.

Other useful links:-


Rank 25 obtained on Kaggle leaderboard for Facial Keypoint Detection and Improvement ideas

After my previous work on the Facial Keypoint Detectio problem, I made a submission to Kaggle with some changes.
Please do note that the whole data had not been made use of in the earlier implementation. I have dropped the fields having missing values for any of the features and I replaced the missing values with the median of the feature(column) they belong to, making use of the whole dataset.

With 400 epochs, this gave me a rank of 25 on the Kaggle leaderboard.
Although this is largely the approach followed by Daniel Nouri only, I have though of some improve and experiments to do on this model before moving on to implementing Convolutional Neural Networks.

1. We can use PCA (Principle Component Analysis) to reduce the dimensionality of the problem, reduce overfitting and to try and improve the accuracy.
2. I also want to implement histogram stretching for pre-processing the data. This can be used to improve image contrast. Basically refers to stretching the range of pixel intensity values for each image. I don’t think we would need normalization after this.
3. Currently, my implementation of Neural Nets in Lasagne uses hold-out cross validation. I want to replace it with K-fold cross validation. K-fold cross validation will make the neural network learn better, although it will take significantly more computational time.

My code for neural net in lasagne :-

net1 = NeuralNet(
layers = [ #3 layers, including 1 hidden layer
(‘input’, layers.InputLayer),
(‘hidden’, layers.DenseLayer),
(‘output’, layers.DenseLayer),


#layer parameters for every layer
#refer to the layer using it’s prefix

#96X96 input pixels per batch
#None – gives us variable batch sizes
input_shape = (None, 9216),

#number of hidden units in a layer
hidden_num_units = 100,

#Output layer uses identity function
#Using this, the output unit’s activation becomes a linear combination of activations in the hidden layer.

#Sine we haven’t chosen anything for the hidden layer
#the default non linearity is rectifier, which is chosen as the activation function of the hidden layer.
output_nonlinearity = None,

#30 target values
output_num_units = 30,

#Optimization Method:
#The following parameterize the update function
#The update function updates the weight of our network after each batch has been processed.

update = nesterov_momentum,
update_learning_rate = 0.01, #step size of the gradient descent
update_momentum = 0.9,

#Regression flag set to true as this is a regression
#problem and not a classification problem

max_epochs = 400,

#speci fies that we wish to output information during training
verbose = 1,

#NOTE: Validation set is automatoically chosen as 20% (or 0.2) of the training samples for validation.
#Thus by default, eval_size=0.2 which user can change accordingly

4. Also, I am using a rectifier as my linear activation function in this script. I want to try other activation functions like sigmoid function, softmax function and the tanh function. I used a rectifier here because linear activation functions have been found to perform consistently better than other activation  functions. Currently there is no clear proof of why that happens, it’s just experimental.

5. Like Daniel Nouri, I am using Nesterov’s Accelerated Gradient Descent (NAG) right now. I hope to try other techniques like stochastic gradient descent (although it is pretty slow).

When I will apply convolutional nets to this problem, every epoch will take a lot of time on my machine. (Cuurently every epoch is taking around 2-3 seconds). I don’t have a CUDA capableGPU on my own machine, therefore I would need to use cloud servers. So i’ll have to set up theano and lasagne on EC2 now.

I’ll also make a github repo for this project so that everyone can see the code. I make it a habit to use a lot of comments in my code, so that no one has problem understanding it.


Difference between pipes and named pipes

When I was studying pipes in Operating Systems, I couldn’t find a single source enumerating all the differences between them. So I thought I’ll blog about the differences between pipes and Named pipes (which are also known as FIFO).

Pipes are basically an IPC mechanism used for message passing between process in a system. They signify information flow between sender and reciever processes.
The major differences between named and unnamed pipes are:-

1. As suggested by their names, a named type has a specific name which can be given to it by the user. Named pipe if referred through this name only by the reader and writer. All instances of a named pipe share the same pipe name.
On the other hand, unnamed pipes is not given a name. It is accessible through two file descriptors that are created through the function pipe(fd[2]), where fd[1] signifies the write file descriptor, and fd[0] describes the read file descriptor.

2. An unnamed pipe is only used for communication between a child and it’s parent process, while a named pipe can be used for communication between two unnamed process as well. Processes of different ancestry can share data through a named pipe.

3. A named pipe exists in the file system. After input-output has been performed by the sharing processes, the pipe still exists in the file system independently of the process, and can be used for communcation between some other processes.
On the other hand, an unnamed pipe vanishes as soon as it is closed, or one of the process (parent or child) completes execution.

4. Named pipes can be used to provide communication between processes on the same computer or between processes on different computers across a network, as in case of a distributed system.  Unnamed pipes are always local; they cannot be used for communication over a network.

5. A Named pipe can have multiple process communicating through it, like multiple clients connected to one server. On the other hand, an unnamed pipe is a one-way pipe that typically transfers data between a parent process and a child process.

Then there are differences in their usage and syntax. An unnamed type is simple to use and incurs less overheads, whereas a named pipe provides us with greater functionalities.
Examples and further reading:-

~ jigsaw

ICPC 2015 Amritapuri Online Round| Failed

I have been practicing competitive programming a lot since a past few months, and had been improving a lot lately. Hence, I was really looking forward to this year’s ICPC along with my team, which took place today, a few hours ago.

But sadly, we couldn’t do good enough to qualify. We could do only do 2 out of the 5 questions, and we couldn’t do the 3rd one no matter how hard we tried. I suddenly don’t know what to do tonight, and therefore am visiting this blog after months. I really prepared hard for this. I practiced a lot of Graph and Dynamic Programming (DP) questions from Codeforces and I was confident that i’ll crack it this year. But alas, when have things gone as planned for me! Even though this is a setback, and I feel mildly depressed, but i don’t want to stop coding. I don’t want to turn complacent. I want to work harder, code more, to crack some good companies in my 8th semester.

As of now, I have done around 170-180 problems on Codeforces, which are mostly A and B problems from the a20j ladder, but at least I don’t mess up the easy problems now. And I am getting better at C level probelems too. In the coming week (which is leading to APAC Round C), I want to focus on more graph algorithms like Dijkastra, Floyd Warshall etc., and try and complete as many problems of the B ladder I can.

Becoming a good competitive programmer is a slow process and often takes years. I need to be patient with myself. But, I also need to grow every day, bit by bit.
Even though my team couldn’t do well today in ICPC Amritapuri Regionals, I want to take this as a learning step. And hopefully i’ll get as good as the Div.1 coders some day.

P.S: There’s still the Chennai Online Round to look forward too. Even though chances for that are extremely dim, but I’m not losing hope.


Progress on Neural Network project | Generating Data and Training first model

I made a little progress on the Neural Networks project. After reading a bit about
Convolutional Neural Networks, I read a research paper on Face Detection by Stanford University, and made a short summary of it for my understanding.
After that, I set out to do the real technical part. I first read in detail about Theano, which is a Python Library used for processing expressions involving multi dimensional arrays and Lasagne, which is a Machine Learning library used for building and training Neural networks in Theano, and it’s documentation.
I installed and checked all the pre-requisites for Lasagne and Theano, like scipy, numpy, scikit-learn, matplotlib, etc. Thereafter, I installed Theano’s latest version and Lasagne from Github, as it is waiting for its first official release.

One problem I am running into is that, if I had a CUDA capable GPU, I could’ve configured it for Theano, and that would’ve made my computations even faster. Only IF. Sadly, I have a Radeon graphic card in my machine, and CUDA is enabled only by Nvidia graphic cards.

After I was done installing all the depencies, I set out analysing the dataset. The dataset consists of 7000 grayscale images in the form of pixel values, with attributes associated with each image. However, the dataset wasn’t perfect,(as if it really is) it had values missing for some of the features corresponding to the images. To resolve that, for my first prototype I have thought of considering only the data fields which have all the features for an image. Later on, I plan to replace the missing fields with the medians of the features. So, after reading the data from the csv file (using the pandas library, which will prove to be extremely helpful in all data generation related tasks), I started segregating the data into the pixel values (X), and the target co-ordinates (y). Also, the pixel values and the target co-ordinates had to be properly scaled for my algorithm to work on it. Since pixel values like b/w 0 to 255, i simply divided the pixel values by 255 to scale them between [0,1], and the target variables (the features), all lie between 0 to 96, and i scaled them to lie between  [-1,-1] by doing y = (y-48)/48.
I put both my X and y into numpy arrays and casted them to float32 type. Finally, the I shuffled X and y randomly, (in a corresponding manner, obviously) to generate my final data set.

Next, i’ll run a simuation on this data set by creating one hidden layer (initially) and training my neural network on the data generated.

To build my first neural network, i had to start making use for lasagne now. Since I don’;t have a CUDA capable GPU, the training of my neural network was a bit simpler.
layers module and nesterov_momentum were imported from lasagne, and NeuralNet was imported from nolearn.lasagne. Nesterov_momentum is our gradient descent optimization method I am using. It  works well for a large number of problems.
After specifying the layers and the layer types. we list the layer parameters like the shape of the layers, the number of units/neurons, the type of activation function of the layer, etc. The specifying of these  parameters gives us great flexibility and compatibility with other modules and programes. The non-linearity of the layer defines the kind of activation function used. Default activation function used is a rectifier; by specifying ‘None’, we get activation values linearly dependent on the previous hidden layers. Validation set is automatoically chosen as 20% (or 0.2) of the training samples for validation. Thus by default, eval_size=0.2 which user can change. I decided to run the training of the neural network for about 400 epochs. I also have to specify that I am performing a regression task, by setting a flag as True. Now, I just pass the vector of target values and pixel values to fit the neural network (train).

When I train my data using this neural network I have built, I see the step by step reduction of the training error and the error on the validation set. After 400 epochs, I get the training loss as 0.00226 and validation loss as 0.00306, which is quite good.

Till now I have made a simple, regular neural network, with some default parameters. I will start venturing into convolutional neural networks in the coming days. I shall get into experimentations after I am done with the tutorial implementation, but I am collection ideas to improve the accuracy of my network, side by side.

I shall analyse the results obtained int the next blog post, and brainstorm methods to make this algorithm better.


Neural Network project on Facial Keypoint Detection

I started work on my minor project for this semester, the topic for which I have chosen to be Facial Keypoint Detection using Convolutional Neural Networks. I have worked with Neural Networks in the past and they are an extremely useful algorithm in Machine Learning problems. They generally make the use of several hidden layers to prevent overfitting. But Convolutional Neural Networks are a different breed than the usual Neural Networks.

Regular neural networks do not scale well to images. Convolutional neural networks are an architecturally different way of processing dimensioned and ordered data. Instead of assuming that the location of the data in the input is irrelevant (as fully connected layers do), convolutional and max pooling layers enforce weight sharing translationally.

Further Reading:-

I have worked on image recognition before and it is a fascinating and a novel field to be working in. Earlier I had dabbled with image recognition softwares and classifiers.

Facial keypoint detection has far reaching applications in various fields of research like :-

  • tracking faces in images and video
  • analysing facial expressions
  • detecting dysmorphic facial signs for medical diagnosis
  • biometrics / facial recognition
  • age/ gender detection

In this project, I’ll be working on a dataset by the University of Montreal, which consists of information of about 7000 grayscale images, in the form of their pixel values, with the associated parameters. Every image has 30 features associated with it the training set, like the centres of the left and right eyes, the centres of the mouths etc in the training set given. Using these parameters, we will train our Neural Network, so that it can identify the keypoints from an arbitrary image whose pixel values are given (in the test data).

I have begun reading blogs, ebooks and research papers on using Neural Networks and Facial detection and recognition.
Deep Convolutional Network Cascade for Facial Point Detection
Facial Keypoints Detection

I will also be taking help from Daniel Nouri’s blog on the same topic, which is an excellent tutorial and will help me during the inital stages of the project.
Daniel Nouri’s blog on Facial Keypoint Detection

The various hardware and software specifications I will be working on are as follows:-
OS – Linux Based Operating System    CPU – Core i5/Core i7     RAM – 4gb/8gb

Programming Language – Python 2.7

Scikit-learn – Python based Machine Learning library

Theano – Python library used to optimize and evaluate mathematical expressions involving multi-dimensional arrays efficiently

Lasagne – Library used to build and train Neural Networks in Theano

I hope to continously update my progress/roadblocks on this project, so that this blog can prove to be a help for anyone who wants to experiement with Neural Networks, or in the field of facial detection and facial feature recognition.


A note about Compilers and their speed

I learnt an interesting new thing today about compilers which I want to share on this blog. I had heard most people say, and also believed that Python is a slow language, meaning it’s compiler is slow. But is it really so?

Are compilers slow and fast? What really affects the speed of execution of code in a particular language?

The answer is that the process of compilation has nothing to do with the running time of a code. Compilation merely involves the generation of Machine Code, which the computer understands from the object code. Running time is the execution of that Machine Code. No matter how slow the process of generation of a machine code is, the execution of the code will depend upon exactly how optimised the machine code is.

By that logic, a compiler written in Python might outperform a compiler written in C. It’s because the output of the compiler (or compilation) doesn’t depend on the language it’s written in, but the algorithms and the optimizations used by it. You could write a really slow, inefficient compiler that produces very efficient code. There’s nothing very special about a compiler, except it’s made of lexical analysers and other stuff, and it’s work in a nutshell is to take some input and produce some output.

Source: How can a language whose compiler is written in C ever be faster than C?

Can a dynamic language like Ruby/Python reach C/C++ like performance?


Bitwise n&-n

I was solving a problem on codeforces that required me to find a vallue  2k where k is the position of the first one in the binary representation of n.
I was confused initially, but then I found out that this exact quanti5ty can be found out by ANDing n with it negative.
But how?

-n can be written as ~n + 1, where ~n denotes the complement of n. If we add 1 to the complement of n, we get a number same upto the first 1, and inverted after that! For example,

If we take n=1001110, then
-n= 0110010         (as ~n = 0110001, and ~n+1 = 0110010)
n&-n = 0000010

That’s exactly what I’m saying here. If n=4, then n&-n = 4, which is 2 raised to the power the position of the first set bit.
Let’s take one more example for n = 12

n = 1100
-n= 0100
12 & -12 = 0100  = 4.

This can also be used to check if a number is perfect square or not.

If (n== n&-n) condition is satisfied, then the number n is a perfect sqaure, because only one bit will be set in the binary representation of n, making n&-m yield n itself.

Just wanted to write this, so I remember this. And also because I couldn’t find too many links explaining it.