I found this blog post on pandas while browsing Kaggle. This post illustrates a simple implementation of the pandas library.
Pandas is a data analysis tool which helps the user is in generating a well formed data set in a format suitable for a machine learning algorithm to work on, using its various functionalities. It can read and write to a variety of file formats.
In this tutorial, I worked on a dataset of schools in New York City, the data was extracted from https://data.cityofnewyork.us/
The data was in the form of 5 files in the .csv format, and to build the final data set from those files, the procedure applied was basically:-
1. Cleaning the data individual files, making them suitable to be joined to the data from other files.
2. Joining the data tables to form one complete data set.
3. Cleaning that complete Data set
4. Dropping non-numeric values from the final data set.
As of now, even though I have completed the tutorial, I am not getting the desired data set as the ouput, meaning there is some error in my code. I still need to debug it. I have run into a few other roadblocks as well, as I’m not able to grasp the functining of pandas fully (P.S. This is the first time I implemented it).
I also ran into a problem towards the end of the tutorial where we were dropping non-numeric data from our final data set, but we had to take special care of grades. The tutorial talked about a method to exp lode these grades into boolean values or something, but I haven’t been able to intuitively grasp that idea. I plan to give time to pandas again in a few days when I’m done with my practicals and everything.
Apart from everything, I’m beginning to understand the importance of data cleaning in the job of a data scientist, and how valuabe a tool pandas is in the wake of that. I just hope I don’t loose touch with pandas after this tutorial.
I will make another post as and when I’m able to fully debug my code, generate the data set and possibly even try to implement an ML algorithm on the generated data set.