Creating a Site Search Engine - Part III

Search Method

Actual processing of the search begins here. The DataSet to store search results is created here and ProcessDirectory method is called. Check out code 1.

 

Code 1

'********************************************
    '
    ' Search Method
    '
    ' Search the entire site
    '
    '********************************************
    Public Function Search(ByVal targetDirectory As String) As DataSet
      'If the site is in English then use the server HTML encode method
      If Searchs.Site.EnglishLanguage = True Then
        'Replace any HTML tags with the HTML codes
        'for the same characters (stops people entering HTML tags)
        m_searchWords = m_page.Server.HtmlEncode(m_searchWords)
        'If the site is not english just change the script tags
      Else
        'Just replace the script tag <> with HTML encoded < and >
        m_searchWords = Replace(m_searchWords, "<", "<", 1, -1, 1)
        m_searchWords = Replace(m_searchWords, ">", ">", 1, -1, 1)
      End If
     If m_dstPages Is Nothing Then
        m_dstPages = Searchs.PagesDataset.Create()
      End If
      ProcessDirectory(targetDirectory)
      Return m_dstPages
    End Function


ProcessDirectory Method

The ProcessDirectory loops through all the files and calls the ProcessFile method. Later, it also loops through the subdirectories and calls itself. See code 2.

 

Code 2

'*********************************************
    '
    ' ProcessDirectory Method
    '
    ' Files in the directories are searched
    '
    '********************************************

   
Private Sub ProcessDirectory(ByVal targetDirectory As String)
      Dim fileEntries As String()
      Dim subdirectoryEntries As String()
      Dim filePath As String
      Dim subdirectory As String
      fileEntries = Directory.GetFiles(targetDirectory)
      ' Process the list of files found in the directory
      For Each filePath In fileEntries
        m_totalFilesSearched += 1
        ProcessFile(filePath)
      Next filePath

     
subdirectoryEntries = Directory.GetDirectories(targetDirectory)
      ' Recurse into subdirectories of this directory
       For Each subdirectory In subdirectoryEntries
        'Check to make sure the folder about to be searched
        'is not a barred folder if it is then don't search
        If Not InStr(1, Searchs.Site.BarredFolders, _
         Path.GetFileName(subdirectory), vbTextCompare) > 0 Then
          'Call the search sub prcedure to search the web site
          ProcessDirectory(subdirectory)
        End If
      Next subdirectory
    End Sub 'ProcessDirectory

ProcessFile Method

The ProcessFile calls GetInfo, which returns the Searchs.Page object that contains all the information of the particular file. Later, it checks if the matchcount is greater than 0 and calls CheckFileInfo to clean up information stored in the Page object. It then stores the file in PagesDataset. Refer code 3.

Code 3

'*******************************************************
    '
    ' ProcessFile Method
    '
    ' Real logic for processing found files would go here.
    '
    '*******************************************************

   
Private Sub ProcessFile(ByVal FPath As String)


      Dim srchFile As Searchs.Page
      srchFile = GetInfo(FPath)
      If Not IsNothing(srchFile) Then
         srchFile.Search(m_searchWords, m_searchCriteria)
        If srchFile.MatchCount > 0 Then
          m_totalFilesFound += 1
          'Response.Write(srchFile.Contents)
          srchFile.CheckFileInfo()
          Searchs.PagesDataset.StoreFile(m_dstPages, srchFile)
        End If
      End If
    End Sub 'ProcessFile

GetInfo Method

The GetInfo method's main task is to get data from the file. It calls the shared method Searchs.FileContent.GetFileInfo where much of the work is done. See code 4.

Code 4

'*****************************************************************
    '
    ' GetInfo Method
    '
    ' File data is picked in this method
    '
    '*****************************************************************

   
Private Function GetInfo(ByVal FPath As String) As Searchs.Page


      Dim fileInform As New FileInfo(FPath)
      Dim sr As StreamReader
      Dim srchFile As New Searchs.Page()
      Dim strBldFile As New StringBuilder()
      Dim strFileURL As String 'Holds the path to the file on the site
      'Check the file extension to make sure the file
      'is of the extension type to be searched

     
If InStr(1, Searchs.Site.FilesTypesToSearch, _
       fileInform.Extension, vbTextCompare) > 0 Then
        'm_page.Trace.Warn("File ext.", fileInform.Extension)
        'Check to make sure the file about to be searched
        'is not a barred file if it is don't search the file
        If Not InStr(1, Searchs.Site.BarredFiles, _
         Path.GetFileName(FPath), vbTextCompare) > 0 Then
          'm_page.Trace.Warn("File", FPath)
           If Not File.Exists(FPath) Then
            'm_page.Trace.Warn("Error", _
            'String.Format("{0} does not exist.", FPath))
            'Add throw excetion here
            '
            '
            Return Nothing
          End If

         
Searchs.FileContent.GetFileInfo(FPath, srchFile)
           Return srchFile
         End If
      End If
      Return Nothing
End Function

FileContent.vb



Figure 1

The contents of FileContent.vb are as depicted in figure 1.

 

GetFileInfo Method

Here, the chunk of data in the page is retrieved. The file content is read from the source if the files are static using GetStaticFileContent method. If the files are dynamic, then the page contents are retrieved from the server using GetDynamicFileContent method. Title information is retrieved from the title tags, while description and key words are recovered from meta tags by calling GetMetaContent method. Contents of the file are stripped from the HTML page by calling Searchs.CleanHtml.Clean method. Refer code 5.

 

Code 5

'**********************************************
    '
    ' GetFileInfo Method
    '
    ' File data is picked in this method
    '
    '**********************************************
    Public Shared Sub GetFileInfo(ByVal FPath As String, _
         ByVal srchFile As Searchs.Page)
       Dim fileInform As New FileInfo(FPath)
        Dim strBldFile As New StringBuilder()
        Dim fileSize As Decimal = fileInform.Length \ 1024

 
      srchFile.Size = fileSize
      GetFilePath(FPath, srchFile)

     
If InStr(1, Searchs.Site.DynamicFilesTypesToSearch, _
        fileInform.Extension, vbTextCompare) > 0 Then
        m_page.Trace.Warn("Path", String.Format("{0}/{1}", "", _
         srchFile.Path))
        GetDynamicFileContent(srchFile)
      Else
        GetStaticFileContent(FPath, srchFile)
      End If

         
If Not srchFile.Contents.Equals("") Then
       srchFile.Contents = sr.ReadToEnd()
         'Read in the title of the file
        srchFile.Title = GetMetaContent(srchFile.Contents,_
             "<title>", "</title>")
        'm_page.Trace.Warn("Page Title", strPageTitle)
 
      
'Read in the description meta tag of the file
        srchFile.Description = GetMetaContent(srchFile.Contents,_
           "<meta name=""description"" content=""", ",""">")
        'm_page.Trace.Warn("Page Desc", strPageDescription)
        'Read in the keywords of the file
        srchFile.Keywords = GetMetaContent(srchFile.Contents,_
          "
<meta name=""keywords"" content=""", ",""">")
        'm_page.Trace.Warn("Page Keywords", strPageKeywords)

       
srchFile.Contents = _
          Searchs.CleanHtml.Clean(srchFile.Contents)

       
srchFile.Contents = _
          strBldFile.AppendFormat("{0} {1} {2} {3}", _
          srchFile.Contents, srchFile.Description, _
          srchFile.Keywords, srchFile.Title).ToString.Trim()

       
'm_page.Trace.Warn("File Info", strBldFile.ToString)


        End If
              End Sub 
     '******************************************************
    '
    ' GetStaticFileContent Method
    '
    ' File Content is picked in this method
    '
    '*******************************************************

   
Private Shared Sub GetStaticFileContent(_
ByVal FPath As String, ByVal srchFile As Searchs.Page)

     
Dim sr As StreamReader
 
     
If Searchs.Site.Encoding.Equals("utf-8") Then
        sr = File.OpenText(FPath)
      Else
        sr = New StreamReader(FPath, _
Encoding.GetEncoding(Searchs.Site.Encoding))
      End If

     
Try
        srchFile.Contents = sr.ReadToEnd()
        sr.Close()
      Catch ex As Exception
        m_page.Trace.Warn("Error", ex.Message)
        srchFile.Contents = ex.Message
      End Try
End Sub


GetDynamicFileContent

GetDynamicFileContent branches into two methods viz. GetDynamicFileContentOther or GetDynamicFileContentUTF, depending on the encoding. Check out code 6.

 

Code 6

'*********************************************************************
    '
    ' GetDynamicFileContent Method
    '
    ' File Content is picked in this method
    '
    '*********************************************************************

   
Private Shared Sub GetDynamicFileContent(ByVal srchFile As Searchs.Page)
      Dim wcMicrosoft As System.Net.WebClient
      If Searchs.Site.Encoding.Equals("utf-8") Then
        GetDynamicFileContentUTF(srchFile)
      Else
        GetDynamicFileContentOther(srchFile)
      End If
    End Sub


System.Net.WebClient provides common methods for sending data to and receiving data from a resource identified by an URI. We make use of DownloadData, which downloads data from a resource and returns a byte array. See code 7.

 

Applications that target the common language runtime use encoding to map character representations from the native character scheme (Unicode) to other schemes. Applications use decoding to map characters from non-native schemes (non-Unicode) to the native scheme. The System.Text namespace provides classes that allow you to encode and decode characters.

 

Code 7
'****************************************************************
    '
    ' GetDynamicFileContentOther Method
    '
    ' File Content is picked in this method
    ' according to the encoding provided
    '
    '****************************************************************
    Private Shared Sub GetDynamicFileContentOther( _
           ByVal srchFile As Searchs.Page)
      Dim wcMicrosoft As System.Net.WebClient
      Dim fileEncoding As System.Text.Encoding
      Try
        fileEncoding = System.Text.Encoding.GetEncoding(_
       Searchs.Site.Encoding)
        srchFile.Contents = fileEncoding.GetString( _
        wcMicrosoft.DownloadData(String.Format("{0}/{1}", _
      Searchs.Site.ApplicationPath, srchFile.Path)))

     
Catch ex As System.Net.WebException
        m_page.Trace.Warn("Error", ex.Message)
        srchFile.Contents = ex.Message
      Catch ex As System.Exception
        m_page.Trace.Warn("Error", ex.Message)
        srchFile.Contents = ex.Message
      End Try
    End Sub


UTF8Encoding class encodes Unicode characters using UCS Transformation Format, 8-bit form (UTF-8). This encoding supports all Unicode character values and surrogates. Refer code 8.

 

Code 8

'*********************************************************************
    '
    ' GetDynamicFileContentUTF Method
    '
    ' File Content is picked in this method according to the utf-8 encoding
    '
    '*********************************************************************

   
Private Shared Sub GetDynamicFileContentUTF( _
           ByVal srchFile As Searchs.Page)
      Dim wcMicrosoft As System.Net.WebClient
      Dim objUTF8Encoding As UTF8Encoding

     
Try
        wcMicrosoft = New System.Net.WebClient()
        objUTF8Encoding = New UTF8Encoding()
        srchFile.Contents = objUTF8Encoding.GetString( _
        wcMicrosoft.DownloadData(String.Format("{0}/{1}", _
     Searchs.Site.ApplicationPath, srchFile.Path)))
      Catch ex As System.Net.WebException
        m_page.Trace.Warn("Error", ex.Message)
        srchFile.Contents = ex.Message
      Catch ex As System.Exception
        m_page.Trace.Warn("Error", ex.Message)


        srchFile.Contents = ex.Message
      End Try
    End Sub


GetFilePath Method

This method converts the local folder path to reflect the URL of the web site. See code 9.

 

Code 9

'*****************************************
    '
    ' GetFilePath Method
    '
    ' File path is modfied to be displayed
    ' as hyperlink in this method
    '
    '*****************************************
    Private Shared Sub GetFilePath(ByVal strFileURL As String,_
                ByVal srchFile As Searchs.Page)
      'Turn the server path to the file into a URL path to the file
      strFileURL = Replace(strFileURL, m_page.Server.MapPath("./"), "")

     
'Replace the NT backslash with the internet
      'forward slash in the URL to the file

     
strFileURL = Replace(strFileURL, "\", "/")

     
'Encode the file name and path into the URL code method
      strFileURL = m_page.Server.UrlEncode(strFileURL)

      'Just incase it's encoded any backslashes
      strFileURL = Replace(strFileURL.Trim(), _
               "%2f", "/", vbTextCompare)
      srchFile.Path = strFileURL
      m_page.Trace.Warn("Url", srchFile.Path)
    End Sub


The next section is the concluding part.




Added on July 27, 2007 Comment

Comments

Post a comment

Your name:

Comment: