Agile Agents for Smart Crawling

This article explores the hidden facts inside the Agile agents ,Web crawler and the the emerging interfaces between these two generic fields. The challenges before us is not information overload but information overlook. The Agile agents may be the fruitful answer to overcome this problem and would provide a new roadmap for intelligent and smart web crawling process. This article also introduces the emerging tools for the development of Agile agents. 

Introduction

Agility describes the behavior of the participants and their ability to move or adjust in new and possibly unforeseen situations. According to the Software Engineering Institute(SEI), a methodology must posses certain attributes in order to meet the requirements of being called a methodology[R1]. Another framework for methodologies is the Software Engineering Body of Knowledge (SWEBOK) that contains other knowledge development methods to be used by any professional software engineer. Agile processes emphasize both the rapid and flexible adaptation to changes in the process, the product, and the development environment . This is a very general definition and therefore not very useful without some specific context.

Before establishing this context, Agile processes include three major attributes, they are:
· Incremental and Evolutionary – allowing adaptation to both internal and external events. 
· Modular and Lean – allowing components of the process to come and go depending on specific needs if the participants and stake-holders. 
· Time Based – built on iterative and concurrent work cycles, which contain feedback loops and progress checkpoints.

The introduction of an Agile process should only be undertaken by organizations that are risk aware if not risk adverse.

An autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future[R3,R4]. The Agile agents include the agility with autonomous agents. It contains following properties(Table-1):

(Table-1)

Property

Meaning

reactive

responds in a timely fashion to changes in the environment

autonomous

exercises control over its own actions

goal-oriented

does not simply act in response to the environment

temporally continuous

is a continuously running process

communicative

communicates with other agents, perhaps including people

learning

changes its behavior based on its previous experience

mobile

able to transport itself from one machine to another

flexible

actions are not scripted

character

believable "personality" and emotional state.

Here in this article the emphasis is given on the application of Agile agents for the well known web crawling process. 

Web Crawling

Web crawlers, programs that automatically find and download web pages (Figure-1) .The Web crawling process can be summarized in following ways:
l Download a set of Web pages
l Set consists typically of all pages reachable following links from a root set
l Crawling is performed periodically
l Goals:

– Find new pages
– Keep pages fresh
– Select “good” pages 



(Figure-1)

Crawling Issues

The size of the web contents is doubled in approx. 18 months. Approximately 8% new pages added every week and 25% new links are created every week(http://research.compaq.com) . The following chart shows the weakly change reality(Figure-2):

(Figure-2)



From the above discussion we can focus our concentration on the following crawling issues:

· How to crawl? 
* Quality: “Best” pages first
* Efficiency: Avoid duplication (or near duplication)
* Etiquette: Robots.txt, Server load concerns
· How much to crawl? How much to index?
* Coverage: How big is the Web? How much do we cover? 
* Relative Coverage: How much do competitors have?
· How often to crawl?
* Freshness: How much has changed? 
* How much has really changed? 

The crawling issues can be summarized as(Table-2):
(Table-2:Academic and non-academic crawling issues)

User/use

Issues

Academic computing research developing crawlers or search engines. (Full-scale search engines now seem to be the exclusive domain of commercial companies, but crawlers can still be developed as test beds for new technologies.)

High use of network resources. No direct benefits to owners of web sites crawled. Indirect social benefits.

Academic research using crawlers to measure or track the web (e.g., webometrics, web dynamics).

Medium use of network resources. Indirect social benefits.

Academic research using crawlers as components of bigger systems (e.g., Davies, 2001).

Variable use of network resources. No direct benefits to owners of web sites crawled. Indirect social benefits.

Social scientists using crawlers to gather data in order to research an aspect of web use or web publishing.

Variable use of network resources. No direct benefits to owners of web sites crawled. Indirect social benefits. Potential privacy issues from aggregated data.

Education, for example the computing topic of web crawlers and the information science topic of webometrics.

Medium use of network resources from many small-scale uses. No direct benefits to owners of web sites crawled. Indirect social benefits.

Commercial search engine companies.

Very high use of network resources. Privacy and social accountability issues.

Competitive intelligence using crawlers to learn from competitors’ web sites and web positioning.

No direct benefits to owners of web sites crawled, and possible commercial disadvantages.

Commercial product development using crawlers as components of bigger systems, perhaps as a spin-off from academic research.

Variable use of network resources. No direct benefits to owners of web sites crawled.

Individuals using downloaders to copy favorite sites.

Medium use of network resources from many small-scale uses. No form of social contract or informal mechanism to protect against abuses.

Individuals using downloaders to create spam email lists.

Privacy invasion from subsequent unwanted email messages. No form of social contract or informal mechanism to protect against abuses. Criminal law may not be enforceable internationally.

The above issues are responsible for the evolution of the new, smart and intelligent framework i.e. Agile agent –based crawling .

Agile agents for smart crawling

Agile Agents

An intelligent information agent , is a computational software entity (agent) that
• may access one or multiple, distributed, and heterogeneous information sources available, and
• pro-actively acquires, mediates, and maintains relevant information on behalf of its user(s) or other agents preferably just-in-time.





Searching the web is a complex process of crawling the web pages and finding the keywords from them, which requires lot of resources. Traditionally the Search engines make use of dedicated systems in client-server architecture. This makes it very expensive for an organization or a community to have its own Search engine. The Agile Agent model is an effective model which makes efficient usage of the participating systems with agents for crawling process. Network excludes the use of dedicated systems and the Agents here make use of the system whose resources are readily available.

Web Crawling Agile Agents

The Web Crawling Agents work in three modes depending on the mode in which it was created. It basically carries out the job of crawling the web pages, which it gets as the URL parameter from the MA(multiagents). In the Searching mode it receives a URL and a keyword. It gets the HTML contents of the page and retains the reference to it if it finds the keyword present in it. It gets all the links from the pages and adds it to a list. It recursively does this process until the total number of pages it crawled or the number of links it collected crosses the target specified. At the end of the process it would send the keyword and the URL’s it found to the MA[R5].

Implementation

Java, the language that changed the Web overnight, offers some unique capabilities that are fueling the development of agent systems. In this article we have shown what exactly it is that makes Java such a powerful tool for mobile agent development.

Java Agent Development Tools:


· The IBM Agent Building and Learning Environment (ABLE) is a Java-based framework for developing and deploying hybrid intelligent agents and agent applications. 
· AgentBuilder, from Reticular Systems, Inc., is an integrated software development toolkit for constructing agents in Java
· Java Development (JADE) framework developed at CSELT in Torino, Italy is a FIPA-complient toolkit for creating multi-agent systems applications. 
· Aglets are Java-based autonomous and mobile agents developed by IBM. Aglets communicate using a white board that enables agents to collaborate and share information asynchronously
· FIPA Open Source (FIPA-OS) is an open-agent platform that supports communication by using the FIPA agent communication standards. 
· Gossip is a demonstration application of Tryllian Inc.’s mobile agent software. The Gossip agents use learning technology to create user’s profiles and to perform automated actions on their user’s behalf. 
· Java Agent Template Lite (JATLite) is a set of lightweight Java packages that was developed at Stanford University.
· Java Expert System Shell (JESS) is a Java application that can be run from the command line. JESS is a Java implementation of the standard CLIPS rule-based environment developed by NASA. JESS was developed at the Sandia National Laboratories. 

The Java code segment given below describes some major portion of the Web Crawling Agile Agents:

The <code>URLReaderAgileAgent</code> class implements an Agile agent that reads web pages and optionally pass parameters to the page.

<code>URLReaderAgileAgent</code>

package infofilter;

import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
import java.io.*;
import java.net.*;
import java.util.*;
import SmartAgent.*;


public class URLReaderAgileAgent extends SmartAgent
{
URL url = null; // URL specification
String paramString = null; // optional param string for queries
String contents = ""; // the contents of the URL or URL response

public URLReaderAgileAgent() {
this("URLReaderAgileAgent");
}
public URLReaderAgileAgent(String name){
super(name);
}

public void setURL(URL url) {
this.url = url;
}

public URL getURL(){
return url;
}

public void setParamString(String paramString) {
this.paramString = paramString;
}

public String getParamString() {
return paramString;
}
public String getContents() {
return contents;
}
public String getTaskDescription() {
return "Read a URL";
}

public void initialize() {
setSleepTime(5 * 1000); // every 5 seconds
setState(SmartAgentState.INITIATED);
}

/**
* Does nothing.
*/
public void process() {}
/**
* Does nothing.
*/
public void processTimerPop() {
}

public void processSmartAgentEvent(SmartAgentEvent event) {
Object source = event.getSource();
Object arg = event.getArgObject();
Object action = event.getAction();

trace("\n" + name + ": SmartAgentEvent received by " + name + " from " + source.getClass());
if (action != null) {
if (action.equals("trace")) {
if (((arg != null) && (arg instanceof String))) {
trace((String) arg); // display the msg
}
} else if (action.equals("getURLText")) {

// read the URL here
String text = getURLText();
if (text != null) {
// send back in event
NewsArticle article = new NewsArticle("URL") ;
article.setSubject("URL: " + url.toString()) ;
article.setBody(text) ;
sendArticleToListeners(article);
}
}
}
}

/**
* Sends the URL text to anyone listening for it.
*/
protected void sendArticleToListeners(NewsArticle article) {
System.out.println("URLReaderAgileAgent -- sending URL text to listeners ");
SmartAgentEvent event = new SmartAgentEvent(this, "addArticle", article);

notifySmartAgentEventListeners(event);
}

protected String getURLText() {
HttpURLConnection connection;
StringBuffer body = new StringBuffer();

System.out.println("URLReaderAgileAgent ... starting to read URL ");
try {
connection = (HttpURLConnection) url.openConnection();
System.out.println("Opened connection");

// process params if any
if ((paramString != null) && (paramString.length() > 0)) {
connection.setDoOutput(true);
PrintWriter out = new PrintWriter(connection.getOutputStream());
out.println(paramString);
out.close();
}
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

if (in == null) {
trace("Error: URLReaderAgileAgent could not connect to URL");
return null;
}
String inputLine;

while ((inputLine = in.readLine()) != null) {
body.append(inputLine);
body.append("\n");
}
in.close();
} catch (Exception e) {
trace("Error: URLReaderAgileAgent could not connect to URL : " + e.toString());
return null ;
}
contents = body.toString(); // save as string in agent
return contents;
}
}

Conclusion 

Since there is a wide range of different web hosting packages and constantly changing technological capabilities, a deontological list of absolute rights and wrongs would be quickly outdated, even if desirable. Utilitarianism, however, can provide the necessary framework to help researchers make judgments with regards to web crawling. It is important that decisions about crawl parameters are made on a site-by-site, crawl-by-crawl basis rather than with a blanket code of conduct. Web crawling involves a number of different participants whose needs will need to be estimated. This is likely to include not only the owner of the web site, but the hosting company, the crawler operator’s institution, and users of the resulting data.

References

[R1]: “Software Development Taxonomy,” www.sei.cmu.edu/legacy/kit/taxonomy.html

[R2]: “New Age of Software Development: How Component Based Software Engineering Changes the Way of Software Development,” Mikio Aoyama, 1998 International Workshop on Compe-tent–Based Software Engineering, 1998.

[R3]: Brooks, Rodney A. (1990), "Elephants Don't Play Chess," In Pattie Maes, ed., Designing Autonomous Agents, Cambridge, MA: MIT Press

[R4]: Hayes-Roth, B. (1995). "An Architecture for Adaptive Intelligent Systems," Artificial Intelligence: Special Issue on Agents and Interactivity, 72, 329-365, .

[R5]: Kautz, H., B. Selman, and M. Coen (1994), "Bottom-up Design of Software Agile Agents." Communications of the ACM, 37, 7, 143-146

[R6]: Wooldridge, Michael and Nicholas R. Jennings (1995), "Agent Theories, Architectures, and Languages: a Survey," in Wooldridge and Jennings Eds., Intelligent Agents, Berlin: Springer-Verlag, 1-22

[R7]: Franklin, Stan (1995), Artificial Minds, Cambridge, MA: MIT Press

[R8]: “Agile Modeling,” Scott Ambler, www.agilemodeling.com. Process Patterns: Building Large–Scale Systems Using Object Technology, Scott Ambler, Cambridge University Press, 1998. More Process Patterns: Delivering Large–Scale Systems Using Object Technology, Scott Ambler, Cambridge Uni-versity Press, 1999.
[R9]: Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117.
[R10]: Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., & Raghavan, S. (2001). Searching the Web. ACM Transactions on Internet Technology, 1(1), 2-43.

[R11]: Koster, M. (1993). Guidelines for robot writers. Retrieved February 23, 2005 from http://www.robotstxt.org/wc/guidelines.html

[R12]: Krogh, C. (1996). The rights of agents. In M. Wooldridge, J. P. Muller & M. Tambe (Eds.), Intelligent Agents II, Agent Theories, Architectures and Languages (pp. 1-16): Springer Verlag.
[R13]: http://research.compaq.com
[R14]: www.robotstxt.org/wc/norobots.html


About the Authors

Sunil Kr.Pandey
Asst. Professor
Department of Computer Science,
School of Management Sciences(SMS),
Varanasi(Utter Pradesh)
India.
Contact No:+91-9415817109
Fax:+91-542-2271773 
E-mail:sunilmca5@rediffmail.com
R.B.Mishra
Reader
Department of Computer Engineering 
Institute of Technology(IT),
Banaras Hindu University(BHU),
Varanasi(Utter Pradesh)
India.
Contact No:+91-9415817109
Fax:+91-542-2271773
E-mail:rbmbhu@yahoo.com



Added on August 8, 2007 Comment

Comments

Post a comment

Your name:

Comment: