Using Perl and Curl for automating Web Tasks

 

The World Wide Web is a very vast network that facilitates information on almost every topic to be available at your fingertips. Nowadays, almost every service is providing information on the web, in addition to conventional modes. For example, trading can be performed online, students can check for marks online, the list is endless. The only problem is that in order to do all these, we have to be online. What if we want some information while on the move? What if you don't have regular Internet access (you would be surprised to find out that many of us don't actually have 24/7 Internet access in India despite the hype) and would still like to get regular online updates?

This was a problem that actually my dad faced. He is a lawyer and everyday he has to scan a cause list book, which is about 40 pages, just to see if his case has come up for hearing or not. Actually, the Indian high courts were good enough to provide a web interface where you could search for your case fast enough. However, being a normal good old fellow, my dad did not want to do anything with computers and it was left for me to check the case listings every day and inform him.

After some days I figured out that I was doing a repetitive task. Being a good programmer, I decided to write a script and let the computer do the task for me. After all, that’s what computers are good at! In this article, I have presented a generalized method to gather information from the web, laying particular emphasis on how I gathered the information I needed. You will actually be surprised to see how little code we write!

Step I: Identify the site and organization of content

Obviously, there should be some content provider to offer the content online. In our case it was http://hc.ap.nic.in/, the high court web site of Andhra Pradesh. Actually, the high court web site is very big with lots of information regarding cases. We are just concerned with the cause lists. For getting this purpose, the site address is http://causelists.ap.nic.in/. Here again, there are two links, one for the weekly list and another for the daily supplementary list. After talking with my dad, and going through the site I found that the weekly list page is displayed at http://causelists.ap.nic.in/apnew/weekly/cl.html while the daily supplementary list is displayed at http://causelists.ap.nic.in/apnew/{day}/cl.html where day is omon, otues, owed, othurs, ofri depending on the day of the week. 

Step II: Download the Web Page

Once we have decided which web page contains the data we need, we have to download those pages so that we can parse them to find the data we need. We are going to use Perl's libwww-perl collection, or LWP. Perl's LWP is a useful collection of modules for accessing data and pages on the Web. With LWP, your Perl script acts as a user agent, much like a browser would. The object-oriented nature provides an easy to use interface to Web services, such as fetching web pages, using Web protocols such as HTTP, HTTPS, FTP and more. In this case, we just need to fetch a web page. So, in order to fetch a web page, the following steps should be performed:

1. Use LWP::Simple - we should include the LWP::Simple package to make use of the functions it provides;

2. Call the function getstore(url,file) - the function getstore() takes two parameters url and file, where ‘url’ is the URL of the page to be downloaded and ‘file’ is the name of the file where the downloaded page will be stored; and

3. Since the URL changes based on the day of the week, we need to know the day of the week. To know the day of the week, we will use localtime() function of Perl. localtime() converts time as returned by the time function to a nine-element list with the time analyzed for the local time zone. Typically, used as follows:

# 0 1 2 3 4 5 6 7 8
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) =localtime(time) 

All list elements are numeric. $sec, $min, and $hour are the seconds, minutes and hours of the specified time, respectively. $mday is the day of the month while $mon is the month itself – in the range 0...11, with 0 indicating January and 11 indicating December. $year is the number of years since 1900; i.e. $year is 120 in year 2020. $wday is the day of the week, with 0 indicating Sunday, 3 indicating Wednesday and so on. $yday is the day of the year, in the range 0…364 (or 0…365 in leap years). $isdst is true if the specified time occurs during daylight savings time, false otherwise. Our concern here is only $wday. The complete code to download the cause list page based on the weekday is given in code 1.

Code 1

#!/usr/bin/perl -w
use LWP::Simple;
# setting file name of page to be downloaded
$homepage = 'cl.html';
# getting week day
($Second, $Minute, $Hour, $Day, $Month, $Year, $WeekDay, $DayOfYear, $IsDST) = localtime(time) ;
# downloading page based on weekday. 
# sunday
if($WeekDay == 0)
{
$status = getstore('http://causelists.ap.nic.in/apnew/weekly/cl.html', $homepage);
}
# monday
elsif($WeekDay == 1)
{
$status = getstore('http://causelists.ap.nic.in/apnew/omon/cl.html', $homepage);
}
#tuesday
elsif($WeekDay == 2)
{
$status = getstore('http://causelists.ap.nic.in/apnew/otues/cl.html', $homepage);
}
#wednesday
elsif($WeekDay == 3)
{
$status = getstore('http://causelists.ap.nic.in/apnew/owed/cl.html', $homepage);
}
#thursday
elsif($WeekDay == 4)

$status = getstore('http://causelists.ap.nic.in/apnew/othurs/cl.html', $homepage);
}
# friday
elsif($WeekDay == 5)

$status = getstore('http://causelists.ap.nic.in/apnew/ofri/cl.html', $homepage);
}
# staurday is not needed as it is not a court working day, so commented out
#elsif($WeekDay == 6)
#$status = getstore('http://causelists.ap.nic.in/apnew/osat/cl.html', $homepage);

# printing result
print("hooray done downloading") if is_success($status);



Step III: Search for data in the downloaded page

Once the required web page has been downloaded, we have to parse the page line by line until we find the lines containing the data we require. In this particular case, I am looking for lines that have my father's name (CH.SIVA REDDY) in them. The code for parsing the downloaded page is as in code 2:

Code 2

open OUT ,"cl.html"; # open the page for parsing and associating OUT as the file handle
$message=':'; # initialize the message to be sent if cases are found
# loop once for each line. Once we've opened the file, we can read stuff from it using the
# line input operator (also known as the angle operator). In normal usage, the line
# input operator returns the next line from our file:

while($line=<OUT>) 
{
#if line contains CH.SIVA REDDY ignoring case
# we are using the perl regular expression features.
if($line =~ m/CH.SIVA REDDY/i)
{
# here we do a lot of substitutions to remove unnecessary characters
#replacing two or more spaces with one space
$line =~ s/(\W)/\ $1\ /g;
$line =~ s/\s{2,}/\ /g;
#replacing space and . with .
$line =~ s/\s\././g;
#replacing . with space
$line =~ s/\./\ /g;
$temp=substr($temp,4,8);
$message=$message." ".$line.$temp;
}
else
{

#replacing two or more spaces with one space
$line =~ s/(\W)/\ $1\ /g;
$line =~ s/\s{2,}/\ /g;
#replacing space and . with .
$line =~ s/\s\././g;
$temp=$line;
}

}

Step IV: Identify the delivery mechanism

Once, the message has been formed after parsing the downloaded message, we need to deliver the message to the customer, in this case, my dad. I could have emailed the message to him, which would mean that he would have to log on to the Internet, which would mean that he could have gone to the high court web site in the first place. So I decided that the best option would be to send him an SMS message. Once again I scourged the Internet to find a free Internet-based SMS service provider. I found one at www.indiamobiles.com. The exact URL of the form to send an SMS was http://www.indiamobiles.com/sms/. The next step, after identifying the site, was to programmatically send the SMS. For this I opted for curl.

Step V: Identify data needed for curl

Curl is a command line tool for doing all sorts of URL manipulations and transfers. More information can be found at http://curl.haxx.se. In this article we shall see how one can use curl to send HTTP POST requests to the http://www.indiamobiles.com/sms/ web site. The web page screen shot is as shown in figure 1. From the web page, we see that there are five input fields – first four digits of the mobile number, remaining digits of the mobile number, name of the sender, subject of SMS and the message (normally, message should be limited to less than 150 characters. You can now understand why we removed unnecessary characters in the message). We have to enter values for these five fields and click on submit.

Fig 1



Forms are the general way in which web sites provide fields for users to enter data. Once data has been entered in the form and the user submits the data, this data is transferred to the server using either “post” or “get” method. The major difference is that when we submit data using “get” method, the data gets appended to the URL and will be visible, whereas if we submit data using the “post” method the data will be sent as part of the body of the request and hence the data will not be visible. Also, there is a limit on how much data you can submit using the former method.

So, for us to submit data programmatically we need to see what method our SMS web server is using. To see that, right click on the web page and click on view source, the HTML source will be visible. Search for a <form> tag in the source. The part of the source, which is needed in the web page, is shown in code 3:


Code 3   <!-- SMS POST START -->
<form method=post action="sendsms.php">
<table>
<tr>
<td height="22" width="151"><font face="Tahoma" color="#006699" size="2"><b>Mobile Number :</td>
<td height="22" width="349">
<select size="1" name="MOBFNO">
<option value="9422">9422</option>
<option value="9810">9810</option>
<option value="9811">9811</option>
<option value="9812">9812</option>
<option value="9815">9815</option>
<option value="9816">9816</option>
<option value="9819">9819</option>
<option value="9820">9820</option>
<option value="9821">9821</option>
<option value="9822">9822</option>
<option value="9823">9823</option>
<option value="9824">9824</option>
<option value="9825">9825</option>
<option value="9831">9831</option>
<option value="9837">9837</option>
<option value="9840">9840</option>
<option value="9841">9841</option>
<option value="9842">9842</option>
<option value="9843">9843</option>
<option value="9845">9845</option>
<option value="9846">9846</option>
<option value="9847">9847</option>
<option value="9848">9848</option>
<option value="9849">9849</option>
<option value="9890">9890</option>
<option value="9892">9892</option>
<option value="9893">9893</option>
<option value="9894">9894</option>
<option value="9895">9895</option>
<option value="9896">9896</option>
<option value="9898">9898</option>
</select><font face="Tahoma" color="#006699" size="2">+
<input type=text name=MOBSECNO size="10" MAXLENGTH="10"></td>
</tr>
<tr>
<td height="22" width="151"><font face="Tahoma" color="#006699" size="2"><b>Your Name : </td>
<td height="22" width="349">
<input type=text name=NAME size="15" MAXLENGTH="20"></td>
</tr>
<tr>
<td height="22" width="151"><font face="Tahoma" color="#006699" size="2"><b>Subject : </td>
<td height="22" width="349"><input type=text name="SUB" size="50" MAXLENGTH="30"></td>
</tr>
<tr>
<td height="31" width="151"><font face="Tahoma" color="#006699" size="2"><b>Message : </td>
<td height="31" width="349">
<textarea rows="5" name="MESSAGE" cols="45" WRAP="PHYSICAL" MAXLENGTH="120"></textarea></td>
</tr>
<tr>
<td colspan=2 valign="top"><input type=submit value="Send SMS"></td>
</tr>
</table>
</form>
<!-- SMS POST END -->

The rest of the HTML code is not required. From code 3, it is evident that we have to use “post” method to send data and the program that receives the data is sendsms.php, which has been specified in the action tag. We also identify the names of the data elements we need to submit. The necessary elements have been highlighted in code 3.

To use curl to post this form with the data we wish to send, the following line of code is necessary:

curl -d "name=value&name=value.." http://www.indiamobiles.com/sms/sendsms.php, 

where name=value pairs are the data we are posting. The form data should be URL-encoded, meaning that you have to replace all spaces with %20 and any other special characters with their encoded value. For example, if you want to search for C.Thomas Reed you have to encode it as C.Thomas%20Reed. For my dad, I included the command 

curl -d "MOBFNO=9848&MOBSECNO=403968&NAME=chetan&SUB=cases&MESSAGE=$1#" http://www.indiamobiles.com/sms/sendsms.php in a shell script and saved it as sms.sh. Now whenever I want to send a message to my dad, all I have to do is type ./sms.sh {message} at the prompt, where message was the message that I wanted to send to my dad.

Step VI: Link up code

Now that we have a means of getting the message as well as a means of sending the message, let us link them up together. Code 4 below links these two components:


Code 4

#check if message exists
if($message eq ':')
{
print "no messages";
exit 0;
}
else
{
print "case present mailing";
#URL encoding by replacing space with %20
$message =~ s/\ /%20/g;
#executing the shell script containing the sms sending command
exec './sms.sh '.$message or print STDERR "could not exec test: $!";
print "\n";
exit 1;
}

Voila, your program is ready to run. Of course it is a different story that I now have to teach my dad how to read SMS messages!!! If you are interested, you can set up cron so that your Perl script will run everyday morning and send an SMS if a case exists.

In this article we saw how we can use Perl and curl, which come inbuilt with most Linux distributions to perform some automated web tasks. Actually, the capability is limitless. We can set up scripts for different kinds of alerts. We can set up scripts that perform stock market watching for us and inform us of any alerts. We can write scripts that can watch cricket scores and inform us of updates. The list is endless. All you need to do is follow the six-step procedure:

1. Identify the site and organization of content;
2. Download the web page;
3. Search for data in the downloaded page;
4. Identify a delivery mechanism;
5. Identify data needed for curl; and
6. Link up the scripts.

The author is an Assistant Professor at Karshak Engineering College, Hyderabad, Andhra Pradesh. He can be contacted at: nutanc@yahoo.com.




Added on July 16, 2007 Comment

Comments

#1

Indra commented, on June 5, 2008 at 10:11 p.m.:

Hi Sneha,
That was a cool approach. I am looking for a similar piece of information but I want to own the sendsms.php or any server side script so that I have complete control. I would like to know if you have more information on how to build an SMS gateway or use any open source SMS gateway that use Perl/PHP/Python/Ruby or any other platform independent scripting language that can also assist in web programming. I am looking for an approach where I do not need to strike a contract with mobile service provider to communicate with their smsc servers.

Thanks,
Indra

Post a comment

Your name:

Comment: