Creating a Site Search Engine - Part IV

GetMetaContent Method

GetMetaContent method uses regular expressions to strip tags and get the required information. Check out code 1.

 

Code 1

    '************************************************
    '
    ' GetMetaContent Method
    '
    ' Metacontent is stripped in this method
    '
    '************************************************
    Private Shared Function GetMetaContent(ByVal strFile As String, _
     ByVal strMetaStart As String, ByVal strMetaEnd As String) As String
      'List the text between the title tags:
      Dim regexp As Regex
      Dim strMeta As String
      Dim strPattern As String
      Dim strInPattern As String

     
'If no description or keywords are found then you may be
      'using http-equiv= instead of name= in your meta tags
      If InStr(1, LCase(strFile), strMetaStart, 1) = 0 _
       And InStr(strMetaStart, "name=") Then
        'Swap name= for http-equiv=
        strMetaStart = Replace(strMetaStart, "name=", "http-equiv=")
      End If
 
     
'Build Pattern
      strInPattern = "((.|\n)*?)"
      strPattern = String.Format("{0}{1}{2}", _
      strMetaStart, strInPattern, strMetaEnd)
      regexp = New Regex(strPattern, RegexOptions.IgnoreCase)
      'Match Pattern
      strMeta = regexp.Match(strFile).ToString

     
'Build Pattern
      strInPattern = "(.*?)"
      strPattern = String.Format("{0}{1}{2}", _
      strMetaStart, strInPattern, strMetaEnd)
      'Get Pattern content
      strMeta = regexp.Replace(strMeta, strPattern,_
                "$1", RegexOptions.IgnoreCase)
     Return strMeta
    End Function



This class (figure 1) is used to create and build the DataSet. It consists of two methods and StoreFile. Create method creates a DataSet to store the searched results and Storefile is responsible for adding records to DataTable in the DataSet. Refer code 2.

Code 2

    '*******************************************************
    '
    ' Create Method - Shared method
    '
    ' Creates a datset for the pages and returns the result
    '
    '********************************************************

   
Public Shared Function Create() As DataSet
      'Objects are defined
      Dim pgDataSet As New DataSet()
      Dim keys(1) As DataColumn
      'Table is created and added to table collection
      pgDataSet.Tables.Add(New DataTable("Pages"))
      'Schema of table is defined
      pgDataSet.Tables("Pages").Columns.Add("PageId", _
         System.Type.GetType("System.Int32"))
      pgDataSet.Tables("Pages").Columns.Add("Title",_
         System.Type.GetType("System.String"))
      pgDataSet.Tables("Pages").Columns.Add("Description", _
         System.Type.GetType("System.String"))
      pgDataSet.Tables("Pages").Columns.Add("Path", _
         System.Type.GetType("System.String"))
      pgDataSet.Tables("Pages").Columns.Add("MatchCount", _
         System.Type.GetType("System.Int32"))
            pgDataSet.Tables("Pages").Columns.Add("Size", _
         System.Type.GetType("System.Decimal"))

      'PageId is defined as indentity
      pgDataSet.Tables("Pages").Columns("PageID").AutoIncrement = True
      pgDataSet.Tables("Pages").Columns("PageID").AutoIncrementSeed = 1

     
'PageId is defined as the primary key
      keys(0) = pgDataSet.Tables("Pages").Columns("PageId")
      pgDataSet.Tables("Pages").PrimaryKey = keys

     
Return pgDataSet
    End Function


    '********************************************************
    '
    ' StoreFile Method - Shared method
    '
    ' Creates a datset for the pages and returns the result
    '
    '********************************************************

   
Public Shared Sub StoreFile(ByVal dstPgs As DataSet,_
                  ByVal srchPg As Searchs.Page)
      'Objects are defined
      Dim pageRow As DataRow
      'New row is created
      pageRow = dstPgs.Tables("Pages").NewRow()
      'Data is added
      pageRow("Title") = srchPg.Title
      pageRow("Description") = srchPg.Description
      pageRow("Path") = srchPg.Path
      pageRow("MatchCount") = srchPg.MatchCount
             pageRow("Size") = srchPg.Size
      'Row is added to the dataset
      dstPgs.Tables("Pages").Rows.Add(pageRow)
    End Sub



CleanHtml class (figure 2) contains a single public shared method that uses regular expressions to clean the HTML content. See code 3.

Code 3

'*****************************************************
    '
    ' CleanFileContent Method
    '
    ' Subroutine to the clean the file of html content
    '
    '*****************************************************
    Public Shared Function Clean(ByVal Contents As String) As String
      Dim regexp As Regex
      Dim strPattern As String
      strPattern = ""
      regexp = New Regex(strPattern, RegexOptions.IgnoreCase)

     
Contents = regexp.Replace(Contents, _
       "<(select|option|script|style|title)(.*?)" & _
       ">((.|\n)*?)</(SELECT|OPTION|SCRIPT|STYLE|TITLE)>",_
       " ", RegexOptions.IgnoreCase)

     
Contents = regexp.Replace(Contents, "&(nbsp|quot|copy);", "")

     
'Contents = regexp.Replace(Contents, "<[^>]*>", "")

     
Contents = regexp.Replace(Contents, "<([\s\S])+?>",_

       
" ", RegexOptions.IgnoreCase).Replace(" ", " ")

     
'Contents = regexp.Replace(Contents, "<[^<>]+>",_
        " ", RegexOptions.IgnoreCase)

     
'Contents = regexp.Replace("(<(\w+)[^>]*?>(.*?)</\1>", "$1")
      Contents = regexp.Replace(Contents, "\W", " ")

     
'Trace.Warn("File Contents", Contents)
      Return Contents
End Function



Site class (figure 3) consists of shared properties, which store the configurations of the entire site. These properties get their values from web.config file using the ConfigurationSettings.AppSettings. Check out table 1 and code 4.

Following are the properties of site class:

FilesTypesToSearch

Returns the files types you want to search.

DynamicFilesTypesToSearch

Returns dynamic files to search.

BarredFolders

Returns the barred folders.

EnglishLanguage

Returns a Boolean value based on whether the language is English or otherwise.

Encoding

Returns the encoding for the site.

BarredFiles

Returns barred files.

ApplicationPath

Assigns and returns the path of the application.

Code 4

'*************************************************
    '
    ' FilesTypesToSearch ReadOnly Property
    '
    ' Retrieve FilesTypesToSearch of the site
    '
    '*************************************************
    Public Shared ReadOnly Property FilesTypesToSearch() As String
      Get
        Return ConfigurationSettings.AppSettings(
       "FilesTypesToSearch")
      End Get
    End Property
    '*************************************************
    '
    ' DynamicFilesTypesToSearch ReadOnly Property
    '
    ' Retrieve FilesTypesToSearch of the site
    '
    '*************************************************
    Public Shared ReadOnly Property DynamicFilesTypesToSearch() As String
      Get
        Return ConfigurationSettings.AppSettings(_
         "DynamicFilesTypesToSearch")
      End Get
    End Property

   
'*************************************************
    '
    ' BarredFolders ReadOnly Property
    '
    ' Retrieve BarredFolders of the site
    '
    '*************************************************
    Public Shared ReadOnly Property BarredFolders() As String
      Get
        Return ConfigurationSettings.AppSettings("BarredFolders")
      End Get
   End Property


    '*************************************************
    '
    ' BarredFiles ReadOnly Property
    '
    ' Retrieve BarredFiles of the site
    '


    '*************************************************
    Public Shared ReadOnly Property BarredFiles() As String
      Get
        Return ConfigurationSettings.AppSettings("BarredFiles")
      End Get
    End Property
    '*************************************************
    '
    ' EnglishLanguage Property
    '
    ' Retrieve EnglishLanguage of the site
    '
    '*************************************************

   
Public Shared ReadOnly Property EnglishLanguage() As String
      Get
        Return ConfigurationSettings.AppSettings("EnglishLanguage")
      End Get
    End Property 
    '*********************************************************************
    '
    ' Encoding Property
    '
    ' Retreive Encoding of the site
    '
    '*********************************************************************

   
Public Shared ReadOnly Property Encoding() As String
      Get
        Return ConfigurationSettings.AppSettings("Encoding")
      End Get
    End Property   
    '**********************************************************
    '
    ' ApplicationPath Property
    '
    'Assign and retrieve ApplicationPath of the site
    '
    '**********************************************************

   
Public Property ApplicationPath() As String
      Get
        Return m_ApplicationPath
      End Get
      Set(ByVal Value As String)
        m_ApplicationPath = Value
      End Set
    End Property


Web.config

The ASP.NET configuration system features an extensible infrastructure that enables you to define the configuration settings at the time your ASP.NET applications are first deployed, so that you can add or revise configuration settings at any time with minimal impact on operational web applications and servers. Multiple configuration files, all named Web.config, can appear in multiple directories on an ASP.NET web application server. Each Web.config file applies configuration settings to its own directory and all its child directories. As mentioned earlier, the site configurations can be assigned in the web.config file. See code 5.

 

Code 5

<appSettings>
 <!-- Place the names of the files types you want searching 
 in the following line separated by commas -->

<add key="FilesTypesToSearch" value=".htm,.html,.asp,.shtml,.aspx"  />
<!-- Place the names of the dynamic files types you want 
 searching in the following line separated by commas -->

 
<add key="DynamicFilesTypesToSearch" value=".asp,.shtml,.aspx" />

<!-- Place the names of the folders you don't want searched in the following line separated by commas-->
  <add key="BarredFolders" value="aspnet_client,_private,_vti_cnf,_vti_log,_vti_pvt,
_vti_script,_vti_txt,cgi_bin,_bin,bin,_notes,images,scripts"  />

<!-- Place the names of the files you don't want searched in the
 
following line separated by commas include the file extension-->
 <add key="BarredFiles" value="localstart.asp,iisstart.asp,AssemblyInfo.vb,
          Global.asax,Global.asax.vb,SiteSearch.aspx" />
 <!-- Set this boolean to False if you are not using an English language web site--> 

<add key="EnglishLanguage" value="True" />

<!-- Set this to the Encoding of the web site-->  
 
<add key="Encoding" value="utf-8" /> 
</appSettings>


How to integrate?

The application has been tested with the web form SiteSearch.aspx in the root directory. So, it is suggested that you do the same. Later on, you can try moving it to any subfolder. I have placed all my classes in the components folder. You can move them to any folder of your choice.

 

Note:

  1. For those users who do not have Visual Studio .Net:
    1. Download from the link 'Download latest version of demo project (Visual studio.net not required)';
    2. Place SearchDotnet.dll in the bin folder in the root; and
    3. Place the SiteSearch.aspx and web.config in the root.
  2. To use the XML version:
    1. Download from the link 'Download demo project which reads and writes to XML(VB.net)'.
    2. The project contains the following files:
      1. AdminSearch.aspx is used to write xml to file.
      2. SiteSearch.aspx is used to search files.
      3. All the classes have been placed in components folder.

Errors

When the application is placed in the root, you may get the following errors:

The remote server returned an error: (401) Unauthorized.

 

OR

 

The remote server returned an error: (500) Internal Server Error.

 

These errors are caused because:

  1. If server returns (401) Unauthorized, it means that the application is unable to read the file owing to right access issues; and
  2. If server returns (500) Internal Server Error, the page that it was trying to read returned an error. The page that the application was trying to read either has an error or requires parameters because of which it returns an error.

 

The following steps will help rectify the aforesaid errors:

  1. In the Web.config file, ensure that the BarredFolders list is comprehensive
    aspnet_client,_private,_vti_cnf, _vti_log,_vti_pvt,_vti_script,_vti_txt, cgi_bin,_bin,bin,_notes,images,scripts; and
  2. Ensure that the BarredFiles list is comprehensive and contains localstart.asp,iisstart.asp.

 

Globalization

The search engine module can be globalized easily. As an example, we will see how to convert it into Chinese language.

 

Web.config

The XML declaration must appear as the first line of the document without any other content, including white space, in front of the start <.

 

The XML declaration in the document map consists of the following:<?xml version="1.0" encoding="Your Encoding" ?>. By default visual studio uses the utf-8 encoding; this needs to be changed to encoding that you want to use. Here, we will change to gb2312. Hence, the XML declaration needs to be modified as follows:

 

English

<?xml version="1.0" encoding="utf-8" ?>

Chinese

<?xml version="1.0" encoding="gb2312" ?>

The requestEncoding and responseEncoding specify the assumed encoding of each incoming request and outgoing response. The default encoding is UTF-8, specified in the <globalization> tag included in the Machine.config file created when the .NET Framework is installed. If encoding is not specified in the Machine.config or Web.config file, encoding defaults to the computer's Regional Options locale setting. We will need to change requestEncoding and responseEncoding to reflect the change in encoding.

English

<globalization requestEncoding="utf-8" responseEncoding="utf-8" />

Chinese

<globalization requestEncoding="gb2312" responseEncoding="gb2312" />

In order to avoid building the code when the encoding changes, we need to add the encoding key to appsettings.
<!-- Set this to the Encoding of the web site-->   
<add key="Encoding" value="gb2312" /> 


Also, change the English language key to false.
<!-- Set this boolean to False if you are not using 
an English language web site-->  
<add key="EnglishLanguage" value="True" /> 

 

SiteSearch.aspx

Last, but not the least, the codepage attribute has to be added in the page directive.

English
<%@ Page Language="vb" Trace="False" AutoEventWireup="false"
Codebehind="SiteSearch.aspx.vb"
 Inherits="SearchDotnet.SiteSearch" debug="false" %>
Chinese
<%@ Page Language="vb" Trace="False" AutoEventWireup="false" 
 Codebehind="SiteSearch.aspx.vb" Inherits="SearchDotnet.SiteSearch"
 debug="false" codePage="936" %>


Enhancing the code

The application is meant for small sites. For bigger sites, the code can be further enhanced. In fact, you will need to write to a database, say an XML file, periodically and then read from it. Here are a few tips to do so.

 

1. In my code, I search and filter data using regular expressions. Instead of this, you will have to write the entire data (not filtered data) to an XML file. Refer code 6.

 

Code 6

   Private Shared Sub WriteXmlToFile(ByVal thisDataSet As DataSet)
      If thisDataSet Is Nothing Then
        Return
      End If 
      thisDataSet.WriteXml(XMLFile)
    End Sub


2. Later on, you will need to read the xml from file and save it to the shared dataset, say Searchs.Site.PageDataset.Tables("Pages"). Check out code 7.

 

Code 7

Private Shared Function ReadXmlFromFile() As DataSet
      ' Create a new DataSet.
      Dim newDataSet As New DataSet("New DataSet")

     
' Read the XML document back in.
      ' Create new FileStream to read schema with.
      Dim fsReadXml As New System.IO.FileStream(XMLFile,
        System.IO.FileMode.Open)

    
' Create an XmlTextReader to read the file.
      Dim myXmlReader As New System.Xml.XmlTextReader(fsReadXml)

     
' Read the XML document into the DataSet.
      newDataSet.ReadXml(myXmlReader)
      ' Close the XmlTextReader


      myXmlReader.Close()
      Return newDataSet

End Function


3. For each search, you will later have to use the Select method of PageDataset.Tables to filter it according to the search results. FillDataset method contains the logic to create and add search results (array of DataRow) to a database. Refer code 8.

 

Code 8

    Private Sub FiterPagesDatset()
     Dim strExpr As String
      Dim foundRows As DataRow()
      Dim Field() As String = {
       "Title", "Description", "Keywords", "Contents"}
      strExpr = SomeFunction 'Your function to build the query.
      foundRows = Searchs.Site.PageDataset.Tables(
        "Pages").Select(strExpr)
      FillDataset(foundRows)
    End Sub

(4) Store the filtered result into another dataset and use it to display results.



The author can be reached at: stevanin@hotmail.com.




Added on July 27, 2007 Comment

Comments

Post a comment

Your name:

Comment: