Creating a Site Search Engine - Part I
Posted On July 25, 2007 by Priyadarshan Roy filed under Programming
Introduction
This search engine module will explore an entire page, including dynamic pages, to match keyword(s) or a phrase. It will even count the number of times the keyword(s) or phrase is found on the page and displays results with the highest matches first. The module will search all files with extensions, allowing you to easily place the extension name into the web.config file where indicated. Files or folders that you do not want searched can be placed in the web.config file where indicated, so that these files and folders are not searched. Also, you can choose the encoding of your choice.
This article contains tips to globalize and enhance the code.
Note: It is best suited for small sites. Also, you can modify this code to crawl pages internally by using regular expressions. For larger sites, you will need to write to the XML file periodically and then read from the XML file. Tips have been offered at the end of the section for this purpose.
Background
Our Site Search Engine (SSE) helps a user trace the pages of his interest. When I was working on an ASP.NET project, I had to add a site search module. I had one in ASP but not in .NET. Hence, the birth of this site search engine. My first version was just a single web form and I had not exploited the full features of the object oriented .NET language. In my spare time, I reworked my code to make the maximum use of the object-oriented language. For this article, I further enhanced my design on the basis of experience and good practices suggested by different authors.
Mr. Song Tao from Beijing, China, approached me with queries on how to convert the module into Chinese. With his help, I enhanced the code to support other languages. Also, a few users encountered article errors when the SiteSearch.aspx was placed in the root. I modified the code to rectify this error.
Source Code Overview
The structure of SSE is as shown in figure 1.

Classes
The ability to define a class and create instances of classes is one of the most important capabilities of any object-oriented language. In the coming section, we shall see the classes that we have used in our search module. Refer figure 2 and table 1.

Table 1
| Class Name | Description |
| SiteSearch | Class for a web form where the user can search a site for certain words. |
| Searches.CleanHtml | Class to clean the HTML content. |
| Searches.FileContent | Class to get content form the HTML file. |
| Searches.Page | Class to store data of the pages. |
| Searches.PagesDataset | Class to create and store results in dataset. |
| Searches.Site | Class to read the site configuration. |
| Searches.UserSearch | Class to store search information per user. |
SiteSearch.aspx
Web Forms are a new and exciting feature in Microsoft's .NET initiative. SiteSearch.aspx is a web form that is also the start page for the search module.
A Web Forms page consists of a page (ASPX file) and a code behind file (either .aspx.cs or .aspx.vb file). Our web form comprises SiteSearch.aspx and SiteSearch.aspx.vb. We will be treating them simultaneously, touching on the main elements of the web form.
ASP.NET is an event-driven programming environment. We will see some event handlers and methods in the coming section.
Page_Load
The server controls are loaded on the Page object and the view state information is available at this point. The Page_Load event checks if sSite is nothing and assigns the Session("Site") variable to it. See code 1.
Code 1
Private Sub Page_Load(ByVal sender As System.Object, _ ByVal e As System.EventArgs) Handles MyBase.Load If IsNothing(sSite) Then sSite = Session("Site") End If End Sub |
srchbtn_Click
The search button event is fired when the search button is clicked. Here, we place the code to change control settings or display text on the page. Also, we check if the search contains text and then call the SearchSite method. DisplayContent() is called to assign values to different controls in the web page. Refer code 2.
Code 2
'********************************************************* |
DisplayContent
DisplayContent() is called to assign values to different controls in the web page. The DataGrid content is set by calling the BindDataGrid method. ViewState("SortExpression") is used to store the sort expression. Check out code 3.
Code 3
'********************************************************************* |
Search
The main call to the search takes place in this method. UserSearch class, which we will cover shortly, stores the entire search information and results of the search. UserSearch object, i.e. srchSite, is created and its properties like SearchWords and SearchCriteria assigned. Also, srchSite.Search method is called. Refer code 4.
Code 4
'************************************************************ srchSite.Search(Server.MapPath("./")) |
DataGrid
The DataGrid control renders a multi-column, fully templated grid and is by far the most versatile of all data bound controls. Moreover, DataGrid control is the ASP.NET control of choice for data reporting. Hence, it has been used to display the search results. Since the focus of the article is on the internal search engine, a brief overview is provided of the DataGrid used here.
Databinding:
Data binding is the process of retrieving data from a source and dynamically associating it to a property of a visual element. Since a DataGrid handles (or at least has in memory) more items simultaneously, you should associate the DataGrid explicitly with a collection of data – i.e. the data source.
The content of a DataGrid is set by using its DataSource property. The entire search result is stored in sSite.PageDataset.Tables("Pages"). Hence, the content of the DataGrid is set to dvwPages i.e. sSite.PageDataset.Tables("Pages").DefaultView. BindDataGrid method is called every time the page loads. Check out code 5.
Code 5
'************************************************************ |
The control has the ability to automatically generate columns that are based on the structure of the data source. Auto-generation is the default behavior of DataGrid, but you can manipulate that behavior using a Boolean property named AutoGenerateColumns. Set the property to False when you want the control to display only the columns you explicitly add to the Columns collection. Set it to True (the default) when you want the control to add as many columns as required by the data source. Auto-generation does not let you specify the header text, nor does it provide text formatting. Here, we set it to False. You typically bind columns using the <columns> tag in the body of the <asp:datagrid> server control. See code 6.
Code 6
<Columns> |
DisplayTitle and DisplayPath methods are used to display customized information in the columns of the DataGrid. Refer code 7.
Code 7
'**************************************** |
Pagination:
Unlike DataList control, the DataGrid control supports data pagination, i.e. the ability to divide the displayed data source rows into pages. The size of our data source easily exceeds the page real estate. So, to preserve scalability on the server and provide a more accessible page to users, we display only a few rows at a time. To enable pagination of the DataGrid control, you need to tell the control about it. You do this through the AllowPaging property.
The pager bar is an interesting and complimentary feature offered by the DataGrid control to let users move easily from page to page. The pager bar is a row displayed at the bottom of the DataGrid control that contains links to available pages. When you click on any of these links, the control automatically fires the PageIndexChanged event and updates the page index accordingly. dgrdPages_PageIndexChanged is called when the page index changes. Check out code 8.
Code 8
'***************************************************************** |
The pager bar is controlled using the PagerStyle property’s Mode attribute. Values for the Mode attribute come from PagerMode enumeration. Here, we have chosen a detailed series of numeric buttons, each of which points to a particular page.
| <PagerStyle CssClass="GridPager" Mode="NumericPages"></PagerStyle> |
Sorting:
The DataGrid control does not actually sort rows, but provides good support for sorting as long as the sorting capabilities of the underlying data source is adequate. The data source is always responsible for returning a sorted set of records based on the sort expression selected by the user through the DataGrid control’s user interface. The built-in sorting mechanism is triggered by setting the AllowSorting property to True.
dgrdPages_SortCommand is called to sort the DataGrid (code 9). The SortCommand event handler knows about the sort expression through the SortExpression property, which is provided by the DataGridSortCommandEventArgs class. In our code, the sort information is persisted because it is stored in a slot in the page’s ViewState collection.
Note: In my pages, I have disabled the header but if the header is shown, you can use it to sort the DataGrid.
Code 9
'***************************************************************** |
The next section will delve into Page object, search method and UserSearch components.
Language: ASP.NET
Platform: Windows
