Replacing missing values using numpy and pandas

While working with datasets, there is very commonly a situation where some of your random data fields are empty. You might totally drop those tuples where there are missing values, but ultimately you’re losing data that way. So generally missing values are filled in with the mean or the median (in some rare cases the mode as well) of the corresponding column (feature).

If you’re working with pandas, I found this task to be straightforward. The only piece of code we will need to add is:-

df = df.fillna(df.median()) , assuming df is the pandas dataframe generated from the dataset

This will automatically fill the missing data field with the median of it’s respective column. We could’ve also used mean or somthing else here. But the point is, the fillna() function helps us with the .

Note: To fill a particular column with missing values, we have to write:-

df [“loc”] = df [“loc”].fillna(df [“loc”].median() )

Now let us turn towards numpy. If you have to do the same, i.e. replace missing values in a numpy array, you do something like this:-

age[ age==’ ‘] = np.median(age)

The numpy array has the empty element ‘ ‘,  to represent a missing value. The above concept is self-explanatory, yet rarely found. I have seen people writing solutions to iterate over the whole array and then replacing the missing values, while the job can be done with a single statement only. Such is the power of a powerful library like numpy!

All for today in Python tips and tricks.
~jigsaw

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s