Abstract information) from large amounts of data. Data

Abstract – The first part of this paper give a general view of
Knowledge Discovery in Database. The information   system   contains  redundant attributes  that will not aid Knowledge Discovery and may
in fact mislead the process. The tools -Rough Set and the Soft Set, deals with
uncertainty of the   information 
 system in data mining. In the
second part the basic definitions in Rough and Soft Sets are included and at
last  a decision making problem is
addressed. The problem is represented in the form of an information table and
then the reduction of the knowledge representation system in Rough Set theory
to define the Reduct-Soft Set of a Soft Set is employed. An algorithm is
included to select an optimal choice object. The algorithm uses fewer
parameters to select the optimal objects for a decision making problem.

 

Keywords: Data mining, Soft Set, Reduct Soft Set, Rough Set, Knowledge
Discovery in Databases

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

1. INTRODUCTION

 

Data
mining refers to extracting or “mining” knowledge (hidden information) from
large amounts of data. Data mining treats as synonym for another popularly used
term, Knowledge Discovery in Databases or KDD 1. KDD is the nontrivial
process of identifying valid, novel, potentially useful and ultimately
understandable patterns in data.  The
goal of KDD is to extract meaningful knowledge from the database. The key steps
in Knowledge Discovery cycle are Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern Evaluation, Knowledge
presentation.  Data mining is not a
single technique, some other commonly used techniques are : statistical
methods, case-based reasoning (CBR), Bayesian Belief Networks (BBN ) etc.,.

 

1.1. SOFT SET

 

In
practice, the great deal of data involved in economy, engineering, environment,
social science and medicine is not always vivid and these data include all
kinds of uncertainty. But the most tools for modeling, reasoning and
calculation are certain or precise, which deal with certain problems in
economy, engineering, environment and so on. At present there are three
theories considered as mathematical tools for dealing with uncertainties:
theory of probability, theory of fuzzy sets, and theory of interval
mathematics. However, the three theories have respectively shortage.

 

Theory
of probability based on full samples need to make a great lot of experiments in
order to gain check samples. In practice, experiments on economy are difficult
to do, not like engineering experiments. Theory of interval mathematics takes
into account calculation error and sets up an interval estimation when need
precise solutions. But theory of interval mathematics can’t well describe the
smooth change of information, uncertain information and contradiction
information. Theory of fuzzy sets is fittest to describe fuzzy information,
which is firstly put forward by L. A. Zadeh. Fuzzy sets have great deal of an
application in many fields, but there is a piece of difficulty that how to
establish appropriate subject function. Since the character of subject function
is extreme individuation, there is no uniform ways to set up subject function.
For example every one understands F (x) = 0.7 by own way.

 

The
reason why there exists deficient as above is lack of the theory of expressing
parameters. The tools for making sure parameters are so poor that uncertainty
of parameters becomes the bottleneck of using these theories. To solve this
problem, 1999 D. Molodtsov set up the basic theory of soft sets which can deal
with uncertain fuzzy, unclear information.

 

      1.2. ROUGH SET

 

The
method was first introduced by Zdzislaw Pawlak in 1982. Rough set analysis is a
first step in analyzing incomplete or uncertain information. Rough Set analysis
uses only internal information and does not rely on additional model
assumptions as Fuzzy Set methods or probabilistic methods do. The Rough Set methodology provides
definitions and methods for finding which attributes separates one class or
classification from another. Since inconsistencies are allowed and membership
in a set does not have to be absolute, the potential for handling noise
gracefully is big. The results from a training phase when using the Rough Sets
approach will usually be a set of propositional rules which may be said to have
syntactic and semantic simplicity for a human. How the problems of dynamic
databases and time and memory constraints are dealt with will be different for
each system using the Rough Set approach, but typically the time complexity
will be high and perhaps polynomial.

 

The
Rough Sets 7 is one of the techniques for the identification and recognition
of common patterns in data, especially in the case of uncertain and incomplete
data. The mathematical foundations of this method are based on the set
approximation of the classification space.

 

 

Rough set theory was initially developed 7
for a finite universe of discourse in which the knowledge base is a partition,
which is obtained by any equivalence relation defined on the universe of discourse.
In Rough Sets theory, the data is organized in a table called decision table.
Rows of the decision table correspond to objects, and columns correspond to
attributes. In the data set, a class label to indicate the class to which each
row belongs. The class label is called as decision attribute, the rest of the
attributes are the condition attributes. Here, C is used to denote the condition
attributes, D for decision attributes, where C Ç D = ?, and tj denotes the jth
 tuple of the data table. Rough Sets
theory defines three regions based on the equivalent classes induced by the
attribute values: lower approximation, upper approximation, and boundary. Lower
approximation contains all the objects, which are classified surely based on
the data collected, and upper approximation contains all the objects, which can
be classified probably, while the boundary is the difference between the upper approximation
and the lower approximation 8.

    

Within
the frame work of Rough Sets the term “classification ” describes the
subdivision of the universal set of all possible categories into a number of
distinguishable categories called elementary sets. Each elementary set can be
regarded as a rule describing the object of the classification. Each object is
then classified using the elementary set of features which cannot be split up
any further, although other elementary sets of features may exist(Reduct
system, 1993). In the Rough Set model the classification knowledge is
represented by an equivalence relation IND defined on a certain universe of
objects U. The pair of the universe objects U and the associated equivalence
relation IND forms an approximation space. The approximation space gives an
approximation description on any subset x of U.

 

 

2. PRELIMINARIES

 

Definition
of Soft set

Let
U be an initial universal set and let E be a set of parameters.

 

DEFINITION
2.1 ( See 6.) A pair (F, E) is called a soft set over U if and only if F is a
mapping of E into the set of all subsets of the set U .i.e. F: E , where P(U) is the
power set of U.

 

In
other words, the    soft set 
 is   a parameterized
family of   subsets   of the set U.  Every set F(), for E.from the family
may be considered as the set of  elements of the soft
set (F,E) , or as the set of approximate elements of the soft set.

 

 

As
an illustration, let us consider the following examples

(1)     A soft set (F, E) describes the features of the
computer which Mr. X is going to buy.

 

U = the set of different computer models under
construction

E = the set of parameters .Each parameter is a word or
a sentence.

E = {{expensive; cheap; high speed processor; normal
speed processor; multimedia keyboard; LCD ; CRT; free annual maintenance
contract}.    

In this case to define 
a soft set means point out cheap ,high speed processor and so on . Its
worth noting that the sets F()may be empty for some  E.

(2)     Zadeh ‘s  fuzzy
set may be considered as a special case of the soft set. Let A be 

A fuzzy set of U with membership ,i.e., is a mapping of U into 0,1.

 

Let us consider the family of -level sets for the function  given by

                               

                                                F() = { () },  0,1.

 

If we know the family F, we can find the functions ()   by means of the
following formulae:

                    
 () =                   sup                   ().

                                             0,1.  F()

 

     
          Thus , every Zadeh ‘s
Fuzzy set A may be considered as the soft set (F, 0,1).

 

(3)    
Let (X,) be a Topological space, that is ,X is a set and  is a Topology , in
other

words, is a family of subsets of  X , called the open sets of X.

 

Then, the family
of open neighbourhoods T of point, where T = {V: V}, may be considered as the soft set (T,).

 

 

The way of
setting ( or describing ) any object in the soft set theory principally differs
from the way in which we use classical mathematics . In classical mathematics, we
construct a mathematical model of a object and define the notion of the exact
solution of this model . Usually the mathematical model is so complicated and
we cannot find the exact solution . So, in the second step, we introduce the
notion of approximate solution and calculate the solution

 

 

In the soft
set theory we have the opposite approach to this problem. The initial
description of the object has an approximate nature, and we do not need to
introduce the notion of exact solution

 

The absence
of any restrictions on the appropriate description in soft set theory makes
this theory very convenient and easily applicable in practice. We can use any
parameterization we prefer: with the help of words and sentences , real
numbers, functions, mappings, and so on .It means that the problem of setting
the membership function or any similar problem does nit arise in the soft set
theory

 

 

DEFINITION
2.2(See 7) A knowledge representation 
system can be formulated as follows :

 

Knowledge
representation system is a pair S = (U,A), where

 

                U = a nonempty, finite set
called the universe.

                A = a nonempty ,finite set of
primitive attributes.

 

Every
primitive attribute a A is a total function U, Where is the set of values of 
, called the domain of 
.

 

DEFINITION
2.3  (see 7)With every subset  of attributes B  A, we associate binary
relation IND(B)  called an  indiscernibility  relation , defined by

 

 

                                IND(B) = { every }.

 

Obviously
, IND(B) is an equivalent relation and

                               

                                                               

 

Every
subset B  A will be called an
attribute . If B is a single element set, then B is called primitive ,
otherwise the attribute is said to be compound. Attribute B may be considered
as a name of the relation IND(B) , or in other words – as a name of knowledge
represented by an equivalence relation IND(B).

 

 

 

DEFINITION
2.4 (See 7) Let  be a family of
equivalence relations and let A R. We will say that A
is dispensable in R if IND(R) = IND(R – {A}); otherwise A is indispensable in
P. The family R is independent if each A  R is indispensable in
R; otherwise R is dependent.

 

If
R is independent and P  R. then P is also
independent.

Q  P is a reduct of P if
Q is independent and IND(Q) = IND(P).

 

 

Intuitively
, a reduct of knowledge is essential 
part, which suffices to define all basic concepts occurring in the
considered knowledge ,whereas the core is in a certain sense its most important
part . The set of all indispensable relations in P will  be called the core of P , and will be denoted
CORE(P) . Clearly CORE(P) =  RED (P). where RED (P)
is the family of all reducts of P.

 

Then
use of the concept of the core is twofold. First it can be used as a basis for
computation of all reducts, for the core is included in every reduct , and its
computations straightforward. Secondly ,the core can be interpreted as the set
of the most  characteristic  part of 
knowledge  ,which cannot be
eliminated when reducing the knowledge.

 

 

DEFINITION
2.5 . DEPENDENCY OF KNOWLEDGE (see 7.) 
Formally the dependency can be defined as shown below: Let K = (U, R)  be a knowledge base and let P,Q R.

                               

 

1.       Knowledge Q depends on knowledge P if  IND(P)  IND(Q).

2.       Knowledge P and Q are equivalent denoted as P  Q, if   P   Q and Q  P.

3.       Knowledge P and Q are independent ,denoted as P ?  Q , if

 

 

Neither P  Q nor Q  P hold.

 

 

                Obviously , P  Q, if  IND(P) 
= IND(Q).

 

 

DEFINITION 2.6 (See 5 .)   For two soft sets (F,A) and (G, B) over a common universe U, we
say that (F,A) is a soft  subset of
(G,B)if

(i)                  
A   B, and

(ii)                
  A, F(). And G() are identical approximations.

 

We write (F,A)  (G,B) .

(F,A) is said to be a soft super set of (G,B) , if
(G,B) is a soft subset of (F,A). We denote it by (F,A) (G,B).

 

 

3. AN ILLUSTRATION FOR THE TOOLS DEALING WITH
UNCERTAINTY

 

Molodstov 6 presented some applications of the soft
set theory in several directions viz.,

study of smoothness of functions game theory,
operations research, Riemann-integration, Perron integration, probability ,
theory of measurement  etc.  In this section we present an application of
soft set theory in a decision making 
problem with the help of rough approach7.The problem we consider as
below.

 

 Let  U = {m1,m2,m3,m4,m5,m6
} , be a set of six models of computers ,E = {expensive; cheap; high
speed processor; normal speed processor; multimedia keyboard; LCD ; CRT; free
annual maintenance contract}, be a set of parameters.

 

Consider the soft set 
(F,E) which describes the ‘features of  the computers’ ,given by (F,E)= {expensive computers={m1,m4},
cheap = {m2,m3,,m5,m6 }, high
speed processor ={m1,,m4,m5,m6}normal
speed processor = {m2,m3, }. multimedia
key board= {m1,m4,m5,m6 }, LCD=
{m1,m2,,m4,m5,m6 },
CRT= {m3,,m6 }, free annual maintenance
contract= { m1,m2,,m4,,m6 }}.

 

Suppose that , Mr. X is interested to buy a computer
on the basis of his choice parameters ‘cheap; high speed processor; multimedia
keyboard; LCD ;free annual maintenance contract’ etc., which constitute the
subset  P = { cheap; high speed
processor; multimedia keyboard; LCD; free annual maintenance contract } of the
set E . That means, out of available models in U , he is to select the computer
model which qualifies with all (or with maximum number of) parameters of the
soft set P.

 

Suppose that, another customer Mr. Y wants to buy a
model on the basis of the sets of choice parameters Q  E and Mr. Z wants to
buy a model on the basis of another set of parameters R  E.

 

The problem is to select model which is most suitable
with choice parameters of Mr. X. The model which is most suitable for Mr. X , need
not be most suitable for Mr. Y or MR. Z as the selection is dependent  upon the set of choice parameters of each
buyer.

 

To solve the problem , we do some theoretical
characterizations of the soft set theory of Molodtsov, which we present below.

 

3.1      Tabular representation of a soft set

Tabular representation of soft sets were done by Lin3
and Yao 9 earlier. We present an almost analogous representation in the form
of binary table . For this consider the soft set (F,P) above on the basis of
the set P of choice parameters of Mr. .X .We can represent this soft set in a
tabular form as shown below . This style of representation will be useful for
storing a soft set in a computer memory . If 
mi  F() then  mij
= 1. otherwise mij = 0 where mij are
the entries in Table 1

 

                                                                       Table
1.

 

      U

e1     
e2      e3     e4         e5

m1
 m2
 
m3
 
m4
 
m5
 
m6
 

0         1        1    
  1      
1
 
1         0   
    0       1      
1
 
1         0  
     0       0       0 

 
0         1 
      1      
1       1
 
1         1        1      
1       0
 
1         1         1      
1       1

 

Thus soft set now can be viewed as a knowledge
representation system (Definition 2.2) where the set of attributes is to be
replaced by a set of parameters.

 

 

3.2      Reduct – Table of a Soft Set

         

Consider  the
Soft set (F,E). Clearly for any P  E. (F, P ) is a soft
subset of (F,E) . We will now define reduct-soft –set of the soft set (F,P).

 

 

Consider the tabular representation  of the soft set (F,P). If Q is a reduct of P
, then the soft set (F,Q) is called the reuduct-soft-set of the soft set (F,P).

 

Intuitively , a reduct –soft-set (F,Q) of the soft set
(F,P) is the essential part, which suffices to describe al basic approximate
descriptions of the soft set (F,P).

 

The core soft set of (F,P) is the soft set (F,C) .
where C is the CORE(P).

 

3.3      Choice value of an object mi

The choice value of an object mi U is Ci given by

                                                       Ci
=ij;         

 

Where hij are the entries in the table of
the reduct-soft-set.

 

 

3.4      Algorithm for Selection of the Computer

 

The following algorithm may be followed by MR. X to
select the model of computer he wishes to buy

1.       input the soft set (F,E),

2.       input the set P of choice parameters of Mr. X which is
a subset of E.

3.       find all reduct –soft-sets of (F,P).

4.       Choose one reduct-soft-set say (F,Q) of (F,P).

5.       find K, for which Ck= max ci .

Then mk is the optimal choice object
. If  k has more than one value, then any
one of them could be chosen by Mr. X by using his option.

 

Now we use the algorithm to solve our original
problem.

Clearly from the table we see that (e1,e3,e4,e5},
{ e1,e2,e4,e5},{ e1,e2,e3,,e5
}.  are the three reducts of

P =      { e1,e2,e3,e4,e5
}. Choose any one say Q = { e1,e2,,e4,e5}.

 

Incorporating the choice values, the reduct –soft-set
can be represented in Table 2 below.

 

 

 

 

 

                                                                                Table
2.

 

       

                                                      

      U

e1 
     e2        e4       e5        

Choice value

m1
 
m2
 
m3
 
m4
 
m5
 
m6
 
 

0         1        1  
     1          

 
1         0       
1        1                   
 
1         0        0        0        
 
0         1        1        1        
 
1         1        1        0       
 
1         1        1         1       
 

 C1 = 3
 
C2
= 3
 
C3
= 1
 
C4
= 3
 
C5
= 3
 
C6
= 4

 

Here max Ci = C6.

Decision :Mr x is better to  buy the model m6

 

                                                                4.CONCLUSION

 

  The Soft Set
theory of Molodtsov 6 and Rough Technique of Pawlak 8 offer a a general
tool for dealing  with uncertain
information. In this  paper, we gave an
application of Soft Set theory in a decision making problem by the rough
technique of Pawlak8.