Building web scraper in Python using BeautifulSoup

I recently built a web scraper in Python, to scrape out uselful, desired information for some predefined webpages. I can call it my mini project. I enjoyed a lot while doing it and at the same time learnt a hell lot more about Python and data scraping and cleaning tools and methods. It gave me an immense amount of confidence in python too.

The parser can still be optimised, but since I learnt how to make it from scratch, I’m pretty happy as to how this mini project turned out to be.

I used a python library called BeautifulSoup4 to build this web parser. BeautifulSoup is used to parse HTML codes in python, and provides the user an easy way to navigate and search through the HTML of a webpage.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

The webpage that I’m scraping is http://pyvideo.org/category/50/pycon-us-2014

This page basically contains a lot of videos on Python. So what my python script does is that, it reads the basic info on this page, and then selects the first 10 video links on the page, to branch of to them and scrape data about each video ( eg. speaker, date of upload, language, etc) from there.

I just uploaded the script on github. I encourage you to go through it and try and apply it yourself.
For everyone’s perusal, I have a written a detailed documentation along with the script itself.

Link to github repository –  https://github.com/jigsaw2212/Web-Scraper-in-Python-using-BeautifulSoup

scraper_github

The scraper for this page is included in the file scraper_new.py, and the data extracted is stored in the files data.txt and link_data.txt. The File data.txt contains the general information from my source page, and the file link_data.txt contains the information extracted from the 10 video links chosen from the source page.
Note: The file scraper_new.py includes the documentation too.

You can also download the whole of this parser along with the extracted data from the zip file Divyansh_NSIT_scraper1_python.py.

I still have a lot to explore in the field of web parsing and data munging, a lot of other libraries and tools to try and implement. This is like a very small step towards gaining acumen in that field.

-jigsaw

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s