Data Mining using Fuzzy SQL Queries
Posted On May 31, 2007 by Ramdas S filed under
Introduction
Data mining can be defined as "The non-trivial extraction of implicit, previously unknown and potentially useful information from data". It can also be described as "The science of extracting useful information from large data sets or databases". Although it is usually used in relation to analysis of data, data mining (like artificial intelligence) is an umbrella term that is used with varied meaning in a wide range of contexts. It is normally associated with a business or an organization's need to identify trends.
A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk sarees, the data mining system will make a correlation between that customer and silk sarees. The sales department can use this information to direct mail marketing of silk sarees to that customer, or it may alternatively attempt to get the customer to buy a wider range of products.
In the aforementioned case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company. Another widely used example (hypothetical, yet very popular) is that of the ‘beer-diaper paradigm’. Through intensive analysis of transactions and goods bought over a period of time at a supermarket, analysts found that beers and diapers were often bought together. Though explaining this interrelation would be difficult, taking advantage of it, on the other hand, should not be hard (e.g. placing the high-profit diapers next to the high-profit beers). This technique is often referred to as Market Basket Analysis.
Activities involved in Data Mining
Data mining domain can be classified broadly into the following categories:
§ Locate unknown trends (relationships among the variables/data items);
§ Understand the logic behind a certain trend; and
§ Search the required data under uncertain circumstances.
This categorization is not complete in itself, but it should prove sufficient in helping you understand this article.
In order to locate unknown trends in a database, one requires very high level skills in statistics to analyze the results for their significance. Understanding the logic behind a certain trend may prove to be next to impossible in most cases due to the complex relationships among the data items.
In light of this fact, searching the required data under uncertain circumstances appears to be the manageable activity under data mining, which can be performed by a novice user using common off-the-shelf (COTS) tools. The uncertainty involved in this domain is represented as Fuzzy factor. So, in this domain, we are interested in searching for data based on uncertain necessities, i.e. fuzzy requirements. We convey our requirement to a database (i.e. we make a query) using a query language. In this case, we are going to do the same, but with some extra pizzas added to encompass the fuzziness of the query.
Tools of the trade:
Getting results for our uncertain queries (fuzzy queries) is a tricky exercise, but we can perform it using commonly available software tools. Relational databases are the preferred options we generally use while SQL is the most popular query language employed to search through them. We are going to use the same to realize our fuzzy queries.
Central idea behind Fuzzy Queries required in Data Mining
The staple of SQL querying activity is SELECT statement with WHERE clause. When we use an SQL command line, we use SELECT statements with WHERE clauses that leaves us scrolling through rows of data. We can use wild cards in our WHERE clauses to create queries that return records based on "fuzzy" text matches.
Making Fuzzy Queries (a novice way)
Let us now have a look at the key words that can help you perform "fuzzy searches" on your database tables stored in an RDBMS.
Make the perfect matches (non-fuzzy search):
In its simplest form, a SELECT statement with a WHERE clause tells the SQL engine to show all the records where a certain condition is true about at least one field. We can use ‘=’ operator to make it return a perfect match.
We shall examine a sample statement that illustrates how to view records that contain exact matches with a particular value. In the WHERE clause, we add a single "is equal to" condition:
SELECT FName, LName, Phone FROM Customers WHERE Zip_code="10010"
This statement returns three columns (FName, LName and Phone) from the Customers table. But how many rows does the SQL engine return? The answer depends on the number of records with the entry 10010 in the Zip_code field. In this case, only exact matches count; records that contain anything but 10010 in the Zip_code field will be ignored.
Make close matches (search with a little bit fuzzy characteristics):
SQL syntax supports basic math operators such as less than lesser than (<), greater than (>), greater than or equal to (>=) and less than or equal to (<=). We can use these operators while looking for records with values that fall above or below a certain threshold.
Suppose we would like to locate records where the value in a field equals or exceeds 100. The statement can be written as follows:
SELECT FName, LName, Phone FROM Customers WHERE Last_purchase >=100
The condition for the WHERE clause, Last_purchase >=100, tells the SQL engine to ignore any records in which the value stored in the Last_purchase field is less than 100.
Combining fuzzy conditions with AND and OR:
The next level of fuzzy queries may combine various fuzzy clauses to select records based on two or more fuzzy conditions. The key words AND and OR let us craft queries that precisely zoom in on records.
Let’s suppose that we want to check for records of customers who have spent a certain amount of money. To do so, we can use the WHERE clause as follows:
SELECT FName, LName, Phone FROM Customers WHERE Last_purchase >=100 AND Last_purchase <=900
Before the SQL engine displays a record in your result set, both conditions, Last_purchase >=100 AND Last_purchase <=900, must be true. If we use the OR key word instead of AND, we will get records for which either or both of those conditions are true.
Making Fuzzy Queries (the expert way)
Using wild cards to generate fuzzy queries:
In addition to mathematical operators, most implementations of SQL support two wildcards, the percent sign and underscore (% and _). In SQL syntax, the percent sign is a special character, a wild card that can represent any number of characters. The underscore character is a wild card that can represent any single character.
By using these wild cards along with the LIKE key word, we can perform fuzzy queries on the records in our database. Table 1 lists the wild cards we can use in SQL.
Match any single character Match any string of characters
Match any single character | Match any string of characters | |
| SQL | _ | % |
| DOS | ? | * |
Note: SQL wild card characters allow you to customize your searches for records in the same way that DOS wild cards help you customize searches for files.
In a WHERE clause, the LIKE key word triggers a fuzzy lookup. For instance, suppose we are in charge of auditing customer records whose last names begin with P. Assuming that there is a field name called Lname in our table, our SELECT statement would look something like this:
SELECT * FROM Customer WHERE Lname LIKE P%
Records in the record set resulting from this query will only be for customers whose last names begin with the letter P.
Wild cards come in handy when we are not sure about the spelling of a customer's name. If we think the last name could be spelled ‘Patil’ or ‘Pateel’, we might issue a command in the form:
SELECT * FROM Customer WHERE Lname LIKE Pat%
This command will pick up records where the last name is Patil, Pateel, Patel and any other last name that begins with the letters Pat.
Sometimes, the only thing that is variable about a string is a single character. In such cases, we use the underscore (_) wild card character. For instance, the following command will display records that contain entries in the Lname field like La, Le, Li, Lo, Lu and Ly.
SELECT * FROM Customer WHERE Lname LIKE L_
Using IN to generate fuzzy queries:
The IN key word provides another great tool for data mining through fuzzy queries. To understand its use, let us consider the mathematical operators =, > and <=, which allow us to select records based on broad-based rules such as exactly equal to this, greater than that, or less than or equal to some other value.
Instead of matching records based on simple value comparisons, the IN key word allows us specify a custom list of values to be used when SQL queries your table. For example, suppose you are asked to generate a list of records for customers who live in the following ZIP codes: 48012, 49014 and 49088.
We could write our SELECT command as follows:
SELECT * FROM Customer WHERE Zip="48012" OR Zip="49014" OR Zip="49088"
However, the above command requires 70 characters (including spaces). In case we want to add additional ZIP codes to this custom list, we would have to add additional OR clauses. In order to save on the number of keystrokes, we can use the IN key word to write the above command with only 61 characters:
SELECT * FROM Customer WHERE Zip IN ("48012","49014","49088")
As such, if we want to add additional ZIP codes to our custom query, we can simply append them in the options listed within the parentheses.
NOTE: We used quotation marks to delimit the ZIP codes because they are stored in text (character) fields in our table. If we use the IN key word to list values that may appear in a numeric field, enter those values without the quotation marks.
Using BETWEEN to generate fuzzy queries:
The BETWEEN key word works in much the same way as IN. If we are looking for records based on values that fall within a certain range, we can save on keystrokes by using BETWEEN. For example, if we want to locate records with values in the Current_balance field between 1000 and 2000, inclusive, we could use a statement in the form:
SELECT * FROM Customer WHERE Current_balance >=1000 AND Current_balance <=20000
However, the above statement will require 79 keystrokes. Using the BETWEEN key word, the same statement can be written using just 64 keystrokes.
SELECT * FROM Customer WHERE Current_balance BETWEEN (1000,2000)
Conclusion
From the above discussion, it should have become clear by now that one can perform simple data mining tasks involving data search under uncertain circumstances, using fuzzy search facilities provided in SQL language. One can generate a proper fuzzy query by performing appropriate permutations and combinations of the related key words, which reflect the fuzzy characteristic of SQL language. It will be very interesting to experiment with them to get our hands wet in the niche domain of Data Mining. Happy Experimenting!!!
The author can be reached at: sachin_a_kadam@rediffmail.com.
