Seven Trends that influences Search Technologies

The Semantic Web is a project that intends to create a universal medium for information exchange by giving meaning (semantics), in a manner understandable by machines, to the content of documents on the Web. Currently under the direction of the Web's creator, Tim Berners-Lee of the World Wide Web Consortium, the Semantic Web extends the World Wide Web through the use of standards, markup languages and related processing tools.

Most people are capable of using the web to, say, find the Swedish word for "car", renew a library book, or find the cheapest DVD and buy it. But if you ask a computer to do the same thing, it wouldn't know where to start. That is because web pages are designed to be read by humans, not machines. The Semantic Web is a project aimed to make web pages understandable by computers, so that they can search websites and perform actions in a standardized way.

The potential benefits are that computers can harness the enormous network of information and services on the web. Your computer could, for example, automatically find the nearest dentist to where you live and book an appointment for you that fits in with your schedule.
A lot of the things that could be done with the Semantic Web could also be done without it, and indeed already are done in some cases. But the Semantic Web provides a standard which makes such services far easier to implement.

The Semantic Web is comprised of the standards and tools of XML, XML Schema, RDF, RDF Schema and OWL. The OWL Web Ontology Language Overview describes the function and relationship of each of these components of the Semantic Web:

· XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documents.
· XML Schema is a language for restricting the structure of XML documents.
· RDF is a simple data model for referring to objects ("resources") and how they are related. An RDF-based model can be represented in XML syntax.
· RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization-hierarchies of such properties and classes.
· OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, and characteristics of properties (e.g. symmetry), and enumerated classes.

The intent is to enhance the usability and usefulness of the Web and its interconnected resources through:

· documents "marked up" with semantic information (an extension of the HTML <meta> tags used in today's Web pages to supply information for Web search engines using web crawlers). This could be machine-readable information about the human-readable content of the document (such as the creator, title, description, etc., of the document) or it could be purely metadata representing a set of facts (such as resources and services elsewhere in the site). (Note that anything that can be identified with a Uniform Resource Identifier (URI) can be described, so the semantic web can reason about people, places, ideas, cats etc.)
· common metadata vocabularies (ontologies) and maps between vocabularies that allow document creators to know how to mark up their documents so that agents can use the information in the supplied metadata (so that Author in the sense of 'the Author of the page' won't be confused with Author in the sense of a book that is the subject of a book review).
· automated agents to perform tasks for users of the Semantic Web using this metadata
· web-based services (often with agents of their own) to supply information specifically to agents (for example, a Trust service that an agent could ask if some online store has a history of poor service or spamming).

The current projects in web searching are all surrounding semantic web technologies. The idea is to look at the entire web like a book, and the technologies supported by semantic web will help you read and find information just the way to browse a book.

Clustering


Clustering is fast becoming a popular topic on forums and blogs, as both Google and MSN pertain to now use them for their search results. The truth is that search software, web mining software, computational language software (especially) have been using this pretty basic theory for a very long time, in fact since the 70's. 

It is a statistical technique used to identify groups in a multi-dimensional space. The idea is simple: to organise or discover a set of clusters for a given document set. The document similarity between clusters must be minimized, and within clusters must be maximized. In partitions the documents are divided into non-over-lapping groups.

Partitioning Methods yield a set of X clusters belonging to their respective clusters. Each cluster is represented by a centroid, which holds the definition of that cluster. Different types of algorithms belonging to this group include: 

1) The single pass method,where the first odject can be seen as the centroid of the first cluster. Then the next object is calculated using the similarity S, using the same similarity measure as for the clusters or centroids. If S is greater than a specified threshold value - the object is added and the centroid is re-calculated. It only goes through the data set once, hence its name. 

2) The hierarchical agglomerative clustering method is the most commonly used. 2 closest objects are merged into a cluster, then we find and merge the next two closest points, a point being an object or a cluster. Then we keep going until there there is no single cluster. Within this method there are variants such as the second matrix approac, the NN matrix. 

3) The Single Link Method (SLINK) works by joining the 2 most similar objects that are not yet in the same cluster. 

4) The Complete Link Method (CLINK) is about comparing inter-cluster similarity. 

5) The Group Average Method focuses on the similarity measure between the groups of clusters. Every query K is is compared against every document in the relevant clusters. If a large set of documents can be divided up into N coherent clusters, then the queries K can be compared with the representations of every cluster N. 

Clustering is widely used in bioinformatics, and work of a scientific nature. Even windows NT uses a clustering algorithm, and has done for some time now. There is still research being done in the field of classification (clustering). 

Caching

Caching is the process by which web pages are stored closer to the client (browser). Technologies have been developed to cache both static and dynamic content. Akamai is one of the trend setters. EdgeComputing for Java is built on Akamai EdgeSuite, a content delivery solution that leverages the globally distributed Akamai Platform to deliver Web content and applications via more than 15,000 servers in over 1,000 networks in 65+ countries.

EdgeComputing for Java supports the execution of Java Server Pages (JSP), Servlets, and JavaBeans on the edge of the Internet, thus avoiding network latency and the need for costly infrastructure overprovisioning, while improving the performance and reliability of mission-critical enterprise applications. To adapt an application for EdgeComputing for Java, applications are separated into two layers: a centralized origin layer and a distributed edge layer. The edge layer is deployed onto the Akamai network and is composed of presentation and business components optimized for the edge. 

With over 4 billion indexed and meticulously sorted web files, images and messages dating back to the 1980's, Google, a four and a half year old California-based company, has indisputably become the world's largest information powerhouse. Wielding a mixture of superior technology and purposeful business marketing, the once university student project utterly transformed the perception of Internet searching and triggered never before seen ensuing ramifications. One of the strengths of Google is its caching technology which makes web pages that are made unavailable

There are a number of projects to improve caching technologies. Some of these projects are being promoted by the search engine vendors.

Distributed Computing

Distributed computing is an aspect of computer science that deals with the coordination of multiple computers in remote physical locations in order to accomplish a common objective or task. In distributed computing, the type of each computer, hardware, programming languages, Operating System and resources may vary drastically. Clustering shares many things in common with distributed computing, but the main difference is the practical physical accessibility of the machines that are working together.

Organising the interaction between each computer is of prime importance. In order to be able to use the widest possible range and types of computers, the protocol or communication channel should not contain and use any information that may not be understood by certain machines. Special care must also be taken that messages are indeed delivered correctly and that invalid messages are rejected which would otherwise bring down the system and perhaps the rest of the network.

Another important factor is the ability to send software over to another computer in a portable way so that it may execute and interact with the existing network. Obviously, this may not always be possible when using differing hardware and resources, so other methods must be used such as cross-compiling or manually porting this software.

Distributed Computing remains to be the one of the biggest challenges for most search engine vendors. While there have been great progress, the scope of improvement is indeed large. Distributed Computing will influence the way large search engines work.

Non Textual Searches 

The biggest challenge of search engines is to search and identify non textual data. Today most search engines uses crude but effective algorithms based on the HTML tags, and content in a specific page. The fact remains that most search engines will identify an apple as an orange if the information pertaining such as the name of the file name, or content in the web site linked by the search engine. This is hardly a solution thriving for maximum accuracy. Finally, looking into the future, how many of these ideas can be extended to video retrieval? Combining the audio track from videos with the images that are being displayed may not only provide additional sources of information on how to index the
video, but also provide a tremendous amount of (noisy) training data for training object recognition algorithms en masse.

Even with the variety of research topics discussed previously, we are only still scratching the surface of the myriad of issues that AI technologies can address with respect to web search. One of the most interesting aspects of working with web data is the insight and appreciation that one can get for large data sets. 

In contrast to traditional approaches which solely make use of standard term lexicons to make spelling corrections, the Google spelling corrector takes a Machine Learning approach that leverages an enormous volume of text to build a very fine grained probabilistic context sensitive model for spelling correction.

This allows the system to recognize far more terms than a standard spelling correction system, especially proper names which commonly appear in web queries but not in standard lexicons. For example, many standard spelling systems would suggest the text “Beeg Shahi” be corrected to “Big Shah”, being completely ignorant of the proper name and simply suggesting common terms with small edit distance to the original text. Contrastingly, the Google spelling corrector does not attempt to correct the text “Beeg Shahi” since this term combination is recognized by its highly granular model. 

Context Sensitiveness

More interesting, however, is the fact that by employing a context sensitive model, the system will correct the text in a more intelligent way. Such fine grained context sensitivity can only be achieved through analyzing very large quantities of text. The Open Directory Project (ODP) (http://dmoz.org/) is a large open source topic hierarchy into which web pages have been manually classified. The hierarchy contains roughly 500,000 classes/topics. Since this is a useful source of hand-classified information, we sought to build a query classifier that would identify and suggest categories in the ODP that would be relevant to a user query. At first blush, this would appear to be a standard text classification task. It becomes more challenging when we consider that the “documents” to be classified are user queries, which have an average length of just over two words. Moreover, the set of classes from the ODP is much larger than any previously studied classification task, and the classes are non-mutually exclusive which can create additional confusion between topics. Despite these challenges, we have available roughly four million pre-classified documents, giving us quite a substantial training set. Hence context senisitive models is another area which influence searching technologies.

Responsiveness to Spam

Believe it or not Spam is the biggest issue which most web based searching techniques use. The plethora of Search Optimization tricks have resulted in may methods which are close to spamming. Search engines and searching techniques need to combat this effectively. Otherwise the effectiveness of web querying which depends of the accuracy of results thrown in will be questioned. Googel and others are trying their best to understand what is relevant to them and their users. Research in this direction is also quite important.



Added on September 27, 2007 Comment

Comments

Post a comment

Your name:

Comment: