Creating a Site Search Engine - Part III
Posted On July 27, 2007 by Priyadarshan Roy filed under
Programming
Search Method
Actual processing of the search begins here. The DataSet to store search results is created here and ProcessDirectory method is called. Check out code 1.
Code 1
Public Function Search(ByVal targetDirectory As String) As DataSet If Searchs.Site.EnglishLanguage = True Then m_searchWords = m_page.Server.HtmlEncode(m_searchWords) Else m_searchWords = Replace(m_searchWords, "<", "<", 1, -1, 1) m_searchWords = Replace(m_searchWords, ">", ">", 1, -1, 1) End If If m_dstPages Is Nothing Then m_dstPages = Searchs.PagesDataset.Create() End If ProcessDirectory(targetDirectory) Return m_dstPages End Function |
ProcessDirectory Method
The ProcessDirectory loops through all the files and calls the ProcessFile method. Later, it also loops through the subdirectories and calls itself. See code 2.
Code 2
Private Sub ProcessDirectory(ByVal targetDirectory As String) Dim fileEntries As String() Dim subdirectoryEntries As String() Dim filePath As String Dim subdirectory As String fileEntries = Directory.GetFiles(targetDirectory) For Each filePath In fileEntries m_totalFilesSearched += 1 ProcessFile(filePath) Next filePath
subdirectoryEntries = Directory.GetDirectories(targetDirectory) For Each subdirectory In subdirectoryEntries If Not InStr(1, Searchs.Site.BarredFolders, _ Path.GetFileName(subdirectory), vbTextCompare) > 0 Then ProcessDirectory(subdirectory) End If Next subdirectory End Sub |
ProcessFile Method
The ProcessFile calls GetInfo, which returns the Searchs.Page object that contains all the information of the particular file. Later, it checks if the matchcount is greater than 0 and calls CheckFileInfo to clean up information stored in the Page object. It then stores the file in PagesDataset. Refer code 3.
Code 3
Private Sub ProcessFile(ByVal FPath As String)
Dim srchFile As Searchs.Page srchFile = GetInfo(FPath) If Not IsNothing(srchFile) Then srchFile.Search(m_searchWords, m_searchCriteria) If srchFile.MatchCount > 0 Then m_totalFilesFound += 1 srchFile.CheckFileInfo() Searchs.PagesDataset.StoreFile(m_dstPages, srchFile) End If End If End Sub |
GetInfo Method
The GetInfo method's main task is to get data from the file. It calls the shared method Searchs.FileContent.GetFileInfo where much of the work is done. See code 4.
Code 4
Private Function GetInfo(ByVal FPath As String) As Searchs.Page
Dim fileInform As New FileInfo(FPath) Dim sr As StreamReader Dim srchFile As New Searchs.Page() Dim strBldFile As New StringBuilder() Dim strFileURL As String If InStr(1, Searchs.Site.FilesTypesToSearch, _ fileInform.Extension, vbTextCompare) > 0 Then If Not InStr(1, Searchs.Site.BarredFiles, _ Path.GetFileName(FPath), vbTextCompare) > 0 Then If Not File.Exists(FPath) Then Return Nothing End If
Searchs.FileContent.GetFileInfo(FPath, srchFile) Return srchFile End If End If Return Nothing End Function |
FileContent.vb

Figure 1
The contents of FileContent.vb are as depicted in figure 1.
GetFileInfo Method
Here, the chunk of data in the page is retrieved. The file content is read from the source if the files are static using GetStaticFileContent method. If the files are dynamic, then the page contents are retrieved from the server using GetDynamicFileContent method. Title information is retrieved from the title tags, while description and key words are recovered from meta tags by calling GetMetaContent method. Contents of the file are stripped from the HTML page by calling Searchs.CleanHtml.Clean method. Refer code 5.
Code 5
Public Shared Sub GetFileInfo(ByVal FPath As String, _ ByVal srchFile As Searchs.Page) Dim fileInform As New FileInfo(FPath) Dim strBldFile As New StringBuilder() Dim fileSize As Decimal = fileInform.Length \ 1024
srchFile.Size = fileSize GetFilePath(FPath, srchFile)
If InStr(1, Searchs.Site.DynamicFilesTypesToSearch, _ fileInform.Extension, vbTextCompare) > 0 Then m_page.Trace.Warn("Path", String.Format("{0}/{1}", "", _ srchFile.Path)) GetDynamicFileContent(srchFile) Else GetStaticFileContent(FPath, srchFile) End If
If Not srchFile.Contents.Equals("") Then srchFile.Contents = sr.ReadToEnd() srchFile.Title = GetMetaContent(srchFile.Contents,_ "<title>", "</title>") srchFile.Description = GetMetaContent(srchFile.Contents,_ "<meta name=""description"" content=""", ",""">") 'm_page.Trace.Warn("Page Desc", strPageDescription) 'Read in the keywords of the file srchFile.Keywords = GetMetaContent(srchFile.Contents,_ "<meta name=""keywords"" content=""", ",""">") srchFile.Contents = _ Searchs.CleanHtml.Clean(srchFile.Contents)
srchFile.Contents = _ strBldFile.AppendFormat("{0} {1} {2} {3}", _ srchFile.Contents, srchFile.Description, _ srchFile.Keywords, srchFile.Title).ToString.Trim()
End If End Sub Private Shared Sub GetStaticFileContent(_ ByVal FPath As String, ByVal srchFile As Searchs.Page)
Dim sr As StreamReader If Searchs.Site.Encoding.Equals("utf-8") Then sr = File.OpenText(FPath) Else sr = New StreamReader(FPath, _ Encoding.GetEncoding(Searchs.Site.Encoding)) End If
Try srchFile.Contents = sr.ReadToEnd() sr.Close() Catch ex As Exception m_page.Trace.Warn("Error", ex.Message) srchFile.Contents = ex.Message End Try End Sub |
GetDynamicFileContent
GetDynamicFileContent branches into two methods viz. GetDynamicFileContentOther or GetDynamicFileContentUTF, depending on the encoding. Check out code 6.
Code 6
Private Shared Sub GetDynamicFileContent(ByVal srchFile As Searchs.Page) Dim wcMicrosoft As System.Net.WebClient If Searchs.Site.Encoding.Equals("utf-8") Then GetDynamicFileContentUTF(srchFile) Else GetDynamicFileContentOther(srchFile) End If End Sub |
System.Net.WebClient provides common methods for sending data to and receiving data from a resource identified by an URI. We make use of DownloadData, which downloads data from a resource and returns a byte array. See code 7.
Applications that target the common language runtime use encoding to map character representations from the native character scheme (Unicode) to other schemes. Applications use decoding to map characters from non-native schemes (non-Unicode) to the native scheme. The System.Text namespace provides classes that allow you to encode and decode characters.
Code 7
Private Shared Sub GetDynamicFileContentOther( _ ByVal srchFile As Searchs.Page) Dim wcMicrosoft As System.Net.WebClient Dim fileEncoding As System.Text.Encoding Try fileEncoding = System.Text.Encoding.GetEncoding(_ Searchs.Site.Encoding) srchFile.Contents = fileEncoding.GetString( _ wcMicrosoft.DownloadData(String.Format("{0}/{1}", _ Searchs.Site.ApplicationPath, srchFile.Path)))
Catch ex As System.Net.WebException m_page.Trace.Warn("Error", ex.Message) srchFile.Contents = ex.Message Catch ex As System.Exception m_page.Trace.Warn("Error", ex.Message) srchFile.Contents = ex.Message End Try End Sub |
UTF8Encoding class encodes Unicode characters using UCS Transformation Format, 8-bit form (UTF-8). This encoding supports all Unicode character values and surrogates. Refer code 8.
Code 8
Private Shared Sub GetDynamicFileContentUTF( _ ByVal srchFile As Searchs.Page) Dim wcMicrosoft As System.Net.WebClient Dim objUTF8Encoding As UTF8Encoding
Try wcMicrosoft = New System.Net.WebClient() objUTF8Encoding = New UTF8Encoding() srchFile.Contents = objUTF8Encoding.GetString( _ wcMicrosoft.DownloadData(String.Format("{0}/{1}", _ Searchs.Site.ApplicationPath, srchFile.Path))) Catch ex As System.Net.WebException m_page.Trace.Warn("Error", ex.Message) srchFile.Contents = ex.Message Catch ex As System.Exception m_page.Trace.Warn("Error", ex.Message)
srchFile.Contents = ex.Message End Try End Sub |
GetFilePath Method
This method converts the local folder path to reflect the URL of the web site. See code 9.
Code 9
Private Shared Sub GetFilePath(ByVal strFileURL As String,_ ByVal srchFile As Searchs.Page) strFileURL = Replace(strFileURL, m_page.Server.MapPath("./"), "")
strFileURL = Replace(strFileURL, "\", "/")
strFileURL = m_page.Server.UrlEncode(strFileURL)
strFileURL = Replace(strFileURL.Trim(), _ "%2f", "/", vbTextCompare) srchFile.Path = strFileURL m_page.Trace.Warn("Url", srchFile.Path) End Sub |
The next section is the concluding part.
Added on July 27, 2007
Comment
Post a comment