Understanding Regular Expressions

Although I have studied regular expressions in my course work, I finally learnt their significance and implementation in the programming environment.

I happened to find this google tutorial on regular expressions and I remembered feeling uneasy while using them earlier, so I set out to understand the concepts once and for all.

Link to Google tutorial – https://developers.google.com/edu/python/regular-expressions

Regular Expressions (also known as RegEx) are basically text strings which serve as a search pattern. They are generally used in patter mathching and find-replace operations.

I have implemented the basics of RegEx in the form of a Python script. Since I cannot upload a .py file here, I’ll create a github repo if any one wants to use my python scripts as a reference to understand Regular Expressions. I have written a concise documentation along with the code.

Some other things about Regular expressions which you may not find in the scripts:-

1. Parenthesis () in RegEx are used to form groupings, but sometimes we might wanna match parenthesis () and we don’t want to extract groups out of our string, so for that to happen we write the parenthesis starting with a ‘?:’ like, r'(?:)’ and that left parenthesis will not be counted as a group result.

2. We can give the re functions extra arguments while calling them to modify their behaviour. The 3 most common extra arguments are:-

  • IGNORECASE: ignores upper/lower case differences between characters while matching. So that ‘a’ and ‘A’ are matched.
    re.search(pat, string, re.IGNORECASE)
  • DOTALL: Traditionally, (.) denotes any character except for a new line character, but using re.search(pat, string, re.DOTALL), we can allow . to match new lines as well.
  • MULTILINE: Allows ^ and $ to be applied to every line in a multiline string separately. re.search(pat, string, re.MULTILINE.

3. re.sub( pattern, replacement, string ) can be used to search for the given pattern in the string, and replace it with the given replacement string. This will be done for all the occurences of that pattern in the string. Read more about it in the tutorial link I have shared.

4. I specially wished to highlight the difference between re.match() and re.search() functions.
The difference, simply put, is that re.match() can’t search, so it just matches with the beginning of the string. And re.search() searches the whole string for the pattern.
Even the MULTILINE tag isn’t supported my re.match().

We can also specify a position for the beginning of matching of the string in case of re.match(), as  re.match( pattern, string, pos = index)

Some links to make the difference more clear:-
http://stackoverflow.com/questions/180986/what-is-the-difference-between-pythons-re-search-and-re-match

http://stackoverflow.com/questions/3346076/python-re-match-vs-re-search

The re module in Python supports a variety of functions, these are just the basic few I learnt and are good enough to provide anyone with a basic idea of how regular expressions work.

Regular expressions can come really handy at times, I have myself used them in web scraping tasks, and can provide the user a very easy way of pattern matching along with a lot of flexibility in that task.

I’ll keep adding more stuff about regular expressions to this Github repo in future.

-jigsaw

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s