Creating a Site Search Engine - Part I

Introduction
This search engine module will explore an entire page, including dynamic pages, to match keyword(s) or a phrase. It will even count the number of times the keyword(s) or phrase is found on the page and displays results with the highest matches first. The module will search all files with extensions, allowing you to easily place the extension name into the web.config file where indicated. Files or folders that you do not want searched can be placed in the web.config file where indicated, so that these files and folders are not searched. Also, you can choose the encoding of your choice.

This article contains tips to globalize and enhance the code.

Note: It is best suited for small sites. Also, you can modify this code to crawl pages internally by using regular expressions. For larger sites, you will need to write to the XML file periodically and then read from the XML file. Tips have been offered at the end of the section for this purpose.

Background
Our Site Search Engine (SSE) helps a user trace the pages of his interest. When I was working on an ASP.NET project, I had to add a site search module. I had one in ASP but not in .NET. Hence, the birth of this site search engine. My first version was just a single web form and I had not exploited the full features of the object oriented .NET language. In my spare time, I reworked my code to make the maximum use of the object-oriented language. For this article, I further enhanced my design on the basis of experience and good practices suggested by different authors.

Mr. Song Tao from Beijing, China, approached me with queries on how to convert the module into Chinese. With his help, I enhanced the code to support other languages. Also, a few users encountered article errors when the SiteSearch.aspx was placed in the root. I modified the code to rectify this error.

Source Code Overview

The structure of SSE is as shown in figure 1.



Classes

The ability to define a class and create instances of classes is one of the most important capabilities of any object-oriented language. In the coming section, we shall see the classes that we have used in our search module. Refer figure 2 and table 1.



Table 1

Class Name

Description

SiteSearch

Class for a web form where the user can search a site for certain words.

Searches.CleanHtml

Class to clean the HTML content.

Searches.FileContent

Class to get content form the HTML file.

Searches.Page

Class to store data of the pages.

Searches.PagesDataset

Class to create and store results in dataset.

Searches.Site

Class to read the site configuration.

Searches.UserSearch

Class to store search information per user.


SiteSearch.aspx

Web Forms are a new and exciting feature in Microsoft's .NET initiative. SiteSearch.aspx is a web form that is also the start page for the search module.

 

A Web Forms page consists of a page (ASPX file) and a code behind file (either .aspx.cs or .aspx.vb file). Our web form comprises SiteSearch.aspx and SiteSearch.aspx.vb. We will be treating them simultaneously, touching on the main elements of the web form.

 

ASP.NET is an event-driven programming environment. We will see some event handlers and methods in the coming section.

 

Page_Load

The server controls are loaded on the Page object and the view state information is available at this point. The Page_Load event checks if sSite is nothing and assigns the Session("Site") variable to it. See code 1.

Code 1

Private Sub Page_Load(ByVal sender As System.Object, _
   ByVal e As System.EventArgs) Handles MyBase.Load
    If IsNothing(sSite) Then
      sSite = Session("Site")
    End If
  End Sub


srchbtn_Click


The search button event is fired when the search button is clicked. Here, we place the code to change control settings or display text on the page. Also, we check if the search contains text and then call the SearchSite method. DisplayContent() is called to assign values to different controls in the web page. Refer code 2.

Code 2

 

'*********************************************************
  '
  ' srchbtn_Click event
  '
  ' Add code to this event.
  '
  '**********************************************************
  Private Sub srchbtn_Click(ByVal sender As System.Object, _
   ByVal e As System.EventArgs) Handles srchbtn.Click
    Dim strSearchWords As String
    'If there is no words entered by the user to search for
    'then don't carryout the file search routine
    pnlSearchResults.Visible = False
    strSearchWords = Trim(Request.Params("search"))
 
   
If Not strSearchWords.Equals("") Then
      Searchs.Site.ApplicationPath = String.Format("http://{0}{1}",
      Request.ServerVariables("HTTP_HOST"), Request.ApplicationPath)
      sSite = SearchSite(strSearchWords)



     
Session("Site") = sSite
      dgrdPages.CurrentPageIndex = 0
      DisplayContent()
  End If
  End Sub


DisplayContent

DisplayContent() is called to assign values to different controls in the web page. The DataGrid content is set by calling the BindDataGrid method. ViewState("SortExpression") is used to store the sort expression. Check out code 3.

 

Code 3

'*********************************************************************


  '
  ' DisplayContent method
  '
  ' The data is bound to the respective fields.
  '
  '*********************************************************************
  Private Sub DisplayContent()
    If Not IsNothing(sSite.PageDataset) Then
      pnlSearchResults.Visible = True
      lblSearchWords.Text = sSite.SearchWords
      If ViewState("SortExpression") Is Nothing Then
        ViewState("SortExpression") = "MatchCount Desc"
      End If
      BindDataGrid(ViewState("SortExpression"))
      lblTotalFiles.Text = sSite.TotalFilesSearched
      lblFilesFound.Text = sSite.TotalFilesFound


    End If
  End Sub


Search

The main call to the search takes place in this method. UserSearch class, which we will cover shortly, stores the entire search information and results of the search. UserSearch object, i.e. srchSite, is created and its properties like SearchWords and SearchCriteria assigned. Also, srchSite.Search method is called. Refer code 4.

 

Code 4

'************************************************************
  
'
  ' SearchSite method
   '
   ' The sSite.PageDataset is used to populate the datagrid.


   '


   '************************************************************
   Private Function SearchSite(ByVal strSearch_
          As String) As Searchs.UserSearch
    Dim srchSite As Searchs.UserSearch
    srchSite = New Searchs.UserSearch()
    'Read in all the search words into one variable
    srchSite.SearchWords = strSearch

   
If Phrase.Checked Then
      srchSite.SearchCriteria = Searchs.SearchCriteria.Phrase
    ElseIf AllWords.Checked Then
      srchSite.SearchCriteria = Searchs.SearchCriteria.AllWords
    ElseIf AnyWords.Checked Then
      srchSite.SearchCriteria = Searchs.SearchCriteria.AnyWords
    End If
    srchSite.Search(Server.MapPath("./"))
    Return srchSite
  End Function



DataGrid


The DataGrid control renders a multi-column, fully templated grid and is by far the most versatile of all data bound controls. Moreover, DataGrid control is the ASP.NET control of choice for data reporting. Hence, it has been used to display the search results. Since the focus of the article is on the internal search engine, a brief overview is provided of the DataGrid used here.

 
Databinding:

Data binding is the process of retrieving data from a source and dynamically associating it to a property of a visual element. Since a DataGrid handles (or at least has in memory) more items simultaneously, you should associate the DataGrid explicitly with a collection of data – i.e. the data source.

 

The content of a DataGrid is set by using its DataSource property. The entire search result is stored in sSite.PageDataset.Tables("Pages"). Hence, the content of the DataGrid is set to dvwPages i.e. sSite.PageDataset.Tables("Pages").DefaultView. BindDataGrid method is called every time the page loads. Check out code 5.

 

Code 5

'************************************************************
 
'
   ' BindDataGrid method
   '
   ' The sSite.PageDataset is used to populate the datagrid.
   '
   '************************************************************
   Private Sub BindDataGrid(ByVal strSortField As String)
      Dim dvwPages As DataView
      dvwPages = sSite.PageDataset.Tables("Pages").DefaultView
      dvwPages.Sort = strSortField
      dgrdPages.DataSource = dvwPages
      dgrdPages.DataBind()
End Sub


The control has the ability to automatically generate columns that are based on the structure of the data source. Auto-generation is the default behavior of DataGrid, but you can manipulate that behavior using a Boolean property named AutoGenerateColumns. Set the property to False when you want the control to display only the columns you explicitly add to the Columns collection. Set it to True (the default) when you want the control to add as many columns as required by the data source. Auto-generation does not let you specify the header text, nor does it provide text formatting. Here, we set it to False. You typically bind columns using the <columns> tag in the body of the <asp:datagrid> server control. See code 6.

 

Code 6

<Columns>
<asp:TemplateColumn>
  <ItemTemplate>
   <%# DisplayTitle(Container.DataItem( "Title" ), _
        Container.DataItem( "Path" )) %>
   <br>
   <%# Container.DataItem( "Description" ) %>
   <br>
   <span class="Path">
    <%# String.Format("{0} - {1}kb", DisplayPath( _
         Container.DataItem( "Path" )) , _
         Container.DataItem( "Size" ))%>
   </span>
   <br>
   <br>
  </ItemTemplate>
</asp:TemplateColumn>
</Columns>


DisplayTitle and DisplayPath methods are used to display customized information in the columns of the DataGrid. Refer code 7.

 

Code 7

'****************************************
   '
   ' DisplayTitle method
   '
   ' Display title of searched pages
   '
   '****************************************
   Protected Function DisplayTitle(ByVal Title _
       As String, ByVal Path As String) As String
      Return String.Format("<A href="{1}">{0}</A>", Title, Path)
   End Function
   '****************************************
   '
   ' DisplayPath method
   '
   ' Path of the file is returned
   '
   '****************************************
   Protected Function DisplayPath(ByVal Path As String) As String
      Return String.Format("{0}{1}/{2}", _
       Request.ServerVariables("HTTP_HOST"), _
       Request.ApplicationPath, Path)
   End Function

Pagination:

Unlike DataList control, the DataGrid control supports data pagination, i.e. the ability to divide the displayed data source rows into pages. The size of our data source easily exceeds the page real estate. So, to preserve scalability on the server and provide a more accessible page to users, we display only a few rows at a time. To enable pagination of the DataGrid control, you need to tell the control about it. You do this through the AllowPaging property.

 

The pager bar is an interesting and complimentary feature offered by the DataGrid control to let users move easily from page to page. The pager bar is a row displayed at the bottom of the DataGrid control that contains links to available pages. When you click on any of these links, the control automatically fires the PageIndexChanged event and updates the page index accordingly. dgrdPages_PageIndexChanged is called when the page index changes. Check out code 8.

 

Code 8

'*****************************************************************
   '
   ' dgrdPages_PageIndexChanged event
   '
   ' The CurrentPageIndex is Assigned the page index value.
   ' The datagrid is then populated using the BindDataGrid function.
   '
   '*****************************************************************
   Protected Sub dgrdPages_PageIndexChanged(ByVal s As Object, _
     ByVal e As DataGridPageChangedEventArgs) _
     Handles dgrdPages.PageIndexChanged
   dgrdPages.CurrentPageIndex = e.NewPageIndex
   DisplayContent()
   End Sub


The pager bar is controlled using the PagerStyle property’s Mode attribute. Values for the Mode attribute come from PagerMode enumeration. Here, we have chosen a detailed series of numeric buttons, each of which points to a particular page.

 

<PagerStyle CssClass="GridPager" 
Mode="NumericPages"></PagerStyle>
 
Sorting:

The DataGrid control does not actually sort rows, but provides good support for sorting as long as the sorting capabilities of the underlying data source is adequate. The data source is always responsible for returning a sorted set of records based on the sort expression selected by the user through the DataGrid control’s user interface. The built-in sorting mechanism is triggered by setting the AllowSorting property to True.

 

dgrdPages_SortCommand is called to sort the DataGrid (code 9). The SortCommand event handler knows about the sort expression through the SortExpression property, which is provided by the DataGridSortCommandEventArgs class. In our code, the sort information is persisted because it is stored in a slot in the page’s ViewState collection.

 

Note: In my pages, I have disabled the header but if the header is shown, you can use it to sort the DataGrid.

 

Code 9


'*****************************************************************
  
'
   ' dgrdAdditionalItems_SortCommand event
   '
   ' The ViewState( "SortExpression" ) is Assigned
   ' the sort expression value.
   ' The datagrid is then populated using the BindDataGrid function.
   '
   '*****************************************************************
   Protected Sub dgrdPages_SortCommand(ByVal s As Object, _
       ByVal e As DataGridSortCommandEventArgs) _
       Handles dgrdPages.SortCommand
      ViewState("SortExpression") = e.SortExpression
       DisplayContent()
   End Sub


The next section will delve into Page object, search method and UserSearch components.


Language: ASP.NET
Platform: Windows
 




Added on July 25, 2007 Comment

Comments

Post a comment

Your name:

Comment: