Abstract – The first part of this paper give a general view of

Knowledge Discovery in Database. The information system contains redundant attributes that will not aid Knowledge Discovery and may

in fact mislead the process. The tools -Rough Set and the Soft Set, deals with

uncertainty of the information

system in data mining. In the

second part the basic definitions in Rough and Soft Sets are included and at

last a decision making problem is

addressed. The problem is represented in the form of an information table and

then the reduction of the knowledge representation system in Rough Set theory

to define the Reduct-Soft Set of a Soft Set is employed. An algorithm is

included to select an optimal choice object. The algorithm uses fewer

parameters to select the optimal objects for a decision making problem.

Keywords: Data mining, Soft Set, Reduct Soft Set, Rough Set, Knowledge

Discovery in Databases

1. INTRODUCTION

Data

mining refers to extracting or “mining” knowledge (hidden information) from

large amounts of data. Data mining treats as synonym for another popularly used

term, Knowledge Discovery in Databases or KDD 1. KDD is the nontrivial

process of identifying valid, novel, potentially useful and ultimately

understandable patterns in data. The

goal of KDD is to extract meaningful knowledge from the database. The key steps

in Knowledge Discovery cycle are Data cleaning, Data integration, Data

selection, Data transformation, Data mining, Pattern Evaluation, Knowledge

presentation. Data mining is not a

single technique, some other commonly used techniques are : statistical

methods, case-based reasoning (CBR), Bayesian Belief Networks (BBN ) etc.,.

1.1. SOFT SET

In

practice, the great deal of data involved in economy, engineering, environment,

social science and medicine is not always vivid and these data include all

kinds of uncertainty. But the most tools for modeling, reasoning and

calculation are certain or precise, which deal with certain problems in

economy, engineering, environment and so on. At present there are three

theories considered as mathematical tools for dealing with uncertainties:

theory of probability, theory of fuzzy sets, and theory of interval

mathematics. However, the three theories have respectively shortage.

Theory

of probability based on full samples need to make a great lot of experiments in

order to gain check samples. In practice, experiments on economy are difficult

to do, not like engineering experiments. Theory of interval mathematics takes

into account calculation error and sets up an interval estimation when need

precise solutions. But theory of interval mathematics can’t well describe the

smooth change of information, uncertain information and contradiction

information. Theory of fuzzy sets is fittest to describe fuzzy information,

which is firstly put forward by L. A. Zadeh. Fuzzy sets have great deal of an

application in many fields, but there is a piece of difficulty that how to

establish appropriate subject function. Since the character of subject function

is extreme individuation, there is no uniform ways to set up subject function.

For example every one understands F (x) = 0.7 by own way.

The

reason why there exists deficient as above is lack of the theory of expressing

parameters. The tools for making sure parameters are so poor that uncertainty

of parameters becomes the bottleneck of using these theories. To solve this

problem, 1999 D. Molodtsov set up the basic theory of soft sets which can deal

with uncertain fuzzy, unclear information.

1.2. ROUGH SET

The

method was first introduced by Zdzislaw Pawlak in 1982. Rough set analysis is a

first step in analyzing incomplete or uncertain information. Rough Set analysis

uses only internal information and does not rely on additional model

assumptions as Fuzzy Set methods or probabilistic methods do. The Rough Set methodology provides

definitions and methods for finding which attributes separates one class or

classification from another. Since inconsistencies are allowed and membership

in a set does not have to be absolute, the potential for handling noise

gracefully is big. The results from a training phase when using the Rough Sets

approach will usually be a set of propositional rules which may be said to have

syntactic and semantic simplicity for a human. How the problems of dynamic

databases and time and memory constraints are dealt with will be different for

each system using the Rough Set approach, but typically the time complexity

will be high and perhaps polynomial.

The

Rough Sets 7 is one of the techniques for the identification and recognition

of common patterns in data, especially in the case of uncertain and incomplete

data. The mathematical foundations of this method are based on the set

approximation of the classification space.

Rough set theory was initially developed 7

for a finite universe of discourse in which the knowledge base is a partition,

which is obtained by any equivalence relation defined on the universe of discourse.

In Rough Sets theory, the data is organized in a table called decision table.

Rows of the decision table correspond to objects, and columns correspond to

attributes. In the data set, a class label to indicate the class to which each

row belongs. The class label is called as decision attribute, the rest of the

attributes are the condition attributes. Here, C is used to denote the condition

attributes, D for decision attributes, where C Ç D = ?, and tj denotes the jth

tuple of the data table. Rough Sets

theory defines three regions based on the equivalent classes induced by the

attribute values: lower approximation, upper approximation, and boundary. Lower

approximation contains all the objects, which are classified surely based on

the data collected, and upper approximation contains all the objects, which can

be classified probably, while the boundary is the difference between the upper approximation

and the lower approximation 8.

Within

the frame work of Rough Sets the term “classification ” describes the

subdivision of the universal set of all possible categories into a number of

distinguishable categories called elementary sets. Each elementary set can be

regarded as a rule describing the object of the classification. Each object is

then classified using the elementary set of features which cannot be split up

any further, although other elementary sets of features may exist(Reduct

system, 1993). In the Rough Set model the classification knowledge is

represented by an equivalence relation IND defined on a certain universe of

objects U. The pair of the universe objects U and the associated equivalence

relation IND forms an approximation space. The approximation space gives an

approximation description on any subset x of U.

2. PRELIMINARIES

Definition

of Soft set

Let

U be an initial universal set and let E be a set of parameters.

DEFINITION

2.1 ( See 6.) A pair (F, E) is called a soft set over U if and only if F is a

mapping of E into the set of all subsets of the set U .i.e. F: E , where P(U) is the

power set of U.

In

other words, the soft set

is a parameterized

family of subsets of the set U. Every set F(), for E.from the family

may be considered as the set of elements of the soft

set (F,E) , or as the set of approximate elements of the soft set.

As

an illustration, let us consider the following examples

(1) A soft set (F, E) describes the features of the

computer which Mr. X is going to buy.

U = the set of different computer models under

construction

E = the set of parameters .Each parameter is a word or

a sentence.

E = {{expensive; cheap; high speed processor; normal

speed processor; multimedia keyboard; LCD ; CRT; free annual maintenance

contract}.

In this case to define

a soft set means point out cheap ,high speed processor and so on . Its

worth noting that the sets F()may be empty for some E.

(2) Zadeh ‘s fuzzy

set may be considered as a special case of the soft set. Let A be

A fuzzy set of U with membership ,i.e., is a mapping of U into 0,1.

Let us consider the family of -level sets for the function given by

F() = { () }, 0,1.

If we know the family F, we can find the functions () by means of the

following formulae:

() = sup ().

0,1. F()

Thus , every Zadeh ‘s

Fuzzy set A may be considered as the soft set (F, 0,1).

(3)

Let (X,) be a Topological space, that is ,X is a set and is a Topology , in

other

words, is a family of subsets of X , called the open sets of X.

Then, the family

of open neighbourhoods T of point, where T = {V: V}, may be considered as the soft set (T,).

The way of

setting ( or describing ) any object in the soft set theory principally differs

from the way in which we use classical mathematics . In classical mathematics, we

construct a mathematical model of a object and define the notion of the exact

solution of this model . Usually the mathematical model is so complicated and

we cannot find the exact solution . So, in the second step, we introduce the

notion of approximate solution and calculate the solution

In the soft

set theory we have the opposite approach to this problem. The initial

description of the object has an approximate nature, and we do not need to

introduce the notion of exact solution

The absence

of any restrictions on the appropriate description in soft set theory makes

this theory very convenient and easily applicable in practice. We can use any

parameterization we prefer: with the help of words and sentences , real

numbers, functions, mappings, and so on .It means that the problem of setting

the membership function or any similar problem does nit arise in the soft set

theory

DEFINITION

2.2(See 7) A knowledge representation

system can be formulated as follows :

Knowledge

representation system is a pair S = (U,A), where

U = a nonempty, finite set

called the universe.

A = a nonempty ,finite set of

primitive attributes.

Every

primitive attribute a A is a total function U, Where is the set of values of

, called the domain of

.

DEFINITION

2.3 (see 7)With every subset of attributes B A, we associate binary

relation IND(B) called an indiscernibility relation , defined by

IND(B) = { every }.

Obviously

, IND(B) is an equivalent relation and

Every

subset B A will be called an

attribute . If B is a single element set, then B is called primitive ,

otherwise the attribute is said to be compound. Attribute B may be considered

as a name of the relation IND(B) , or in other words – as a name of knowledge

represented by an equivalence relation IND(B).

DEFINITION

2.4 (See 7) Let be a family of

equivalence relations and let A R. We will say that A

is dispensable in R if IND(R) = IND(R – {A}); otherwise A is indispensable in

P. The family R is independent if each A R is indispensable in

R; otherwise R is dependent.

If

R is independent and P R. then P is also

independent.

Q P is a reduct of P if

Q is independent and IND(Q) = IND(P).

Intuitively

, a reduct of knowledge is essential

part, which suffices to define all basic concepts occurring in the

considered knowledge ,whereas the core is in a certain sense its most important

part . The set of all indispensable relations in P will be called the core of P , and will be denoted

CORE(P) . Clearly CORE(P) = RED (P). where RED (P)

is the family of all reducts of P.

Then

use of the concept of the core is twofold. First it can be used as a basis for

computation of all reducts, for the core is included in every reduct , and its

computations straightforward. Secondly ,the core can be interpreted as the set

of the most characteristic part of

knowledge ,which cannot be

eliminated when reducing the knowledge.

DEFINITION

2.5 . DEPENDENCY OF KNOWLEDGE (see 7.)

Formally the dependency can be defined as shown below: Let K = (U, R) be a knowledge base and let P,Q R.

1. Knowledge Q depends on knowledge P if IND(P) IND(Q).

2. Knowledge P and Q are equivalent denoted as P Q, if P Q and Q P.

3. Knowledge P and Q are independent ,denoted as P ? Q , if

Neither P Q nor Q P hold.

Obviously , P Q, if IND(P)

= IND(Q).

DEFINITION 2.6 (See 5 .) For two soft sets (F,A) and (G, B) over a common universe U, we

say that (F,A) is a soft subset of

(G,B)if

(i)

A B, and

(ii)

A, F(). And G() are identical approximations.

We write (F,A) (G,B) .

(F,A) is said to be a soft super set of (G,B) , if

(G,B) is a soft subset of (F,A). We denote it by (F,A) (G,B).

3. AN ILLUSTRATION FOR THE TOOLS DEALING WITH

UNCERTAINTY

Molodstov 6 presented some applications of the soft

set theory in several directions viz.,

study of smoothness of functions game theory,

operations research, Riemann-integration, Perron integration, probability ,

theory of measurement etc. In this section we present an application of

soft set theory in a decision making

problem with the help of rough approach7.The problem we consider as

below.

Let U = {m1,m2,m3,m4,m5,m6

} , be a set of six models of computers ,E = {expensive; cheap; high

speed processor; normal speed processor; multimedia keyboard; LCD ; CRT; free

annual maintenance contract}, be a set of parameters.

Consider the soft set

(F,E) which describes the ‘features of the computers’ ,given by (F,E)= {expensive computers={m1,m4},

cheap = {m2,m3,,m5,m6 }, high

speed processor ={m1,,m4,m5,m6}normal

speed processor = {m2,m3, }. multimedia

key board= {m1,m4,m5,m6 }, LCD=

{m1,m2,,m4,m5,m6 },

CRT= {m3,,m6 }, free annual maintenance

contract= { m1,m2,,m4,,m6 }}.

Suppose that , Mr. X is interested to buy a computer

on the basis of his choice parameters ‘cheap; high speed processor; multimedia

keyboard; LCD ;free annual maintenance contract’ etc., which constitute the

subset P = { cheap; high speed

processor; multimedia keyboard; LCD; free annual maintenance contract } of the

set E . That means, out of available models in U , he is to select the computer

model which qualifies with all (or with maximum number of) parameters of the

soft set P.

Suppose that, another customer Mr. Y wants to buy a

model on the basis of the sets of choice parameters Q E and Mr. Z wants to

buy a model on the basis of another set of parameters R E.

The problem is to select model which is most suitable

with choice parameters of Mr. X. The model which is most suitable for Mr. X , need

not be most suitable for Mr. Y or MR. Z as the selection is dependent upon the set of choice parameters of each

buyer.

To solve the problem , we do some theoretical

characterizations of the soft set theory of Molodtsov, which we present below.

3.1 Tabular representation of a soft set

Tabular representation of soft sets were done by Lin3

and Yao 9 earlier. We present an almost analogous representation in the form

of binary table . For this consider the soft set (F,P) above on the basis of

the set P of choice parameters of Mr. .X .We can represent this soft set in a

tabular form as shown below . This style of representation will be useful for

storing a soft set in a computer memory . If

mi F() then mij

= 1. otherwise mij = 0 where mij are

the entries in Table 1

Table

1.

U

e1

e2 e3 e4 e5

m1

m2

m3

m4

m5

m6

0 1 1

1

1

1 0

0 1

1

1 0

0 0 0

0 1

1

1 1

1 1 1

1 0

1 1 1

1 1

Thus soft set now can be viewed as a knowledge

representation system (Definition 2.2) where the set of attributes is to be

replaced by a set of parameters.

3.2 Reduct – Table of a Soft Set

Consider the

Soft set (F,E). Clearly for any P E. (F, P ) is a soft

subset of (F,E) . We will now define reduct-soft –set of the soft set (F,P).

Consider the tabular representation of the soft set (F,P). If Q is a reduct of P

, then the soft set (F,Q) is called the reuduct-soft-set of the soft set (F,P).

Intuitively , a reduct –soft-set (F,Q) of the soft set

(F,P) is the essential part, which suffices to describe al basic approximate

descriptions of the soft set (F,P).

The core soft set of (F,P) is the soft set (F,C) .

where C is the CORE(P).

3.3 Choice value of an object mi

The choice value of an object mi U is Ci given by

Ci

=ij;

Where hij are the entries in the table of

the reduct-soft-set.

3.4 Algorithm for Selection of the Computer

The following algorithm may be followed by MR. X to

select the model of computer he wishes to buy

1. input the soft set (F,E),

2. input the set P of choice parameters of Mr. X which is

a subset of E.

3. find all reduct –soft-sets of (F,P).

4. Choose one reduct-soft-set say (F,Q) of (F,P).

5. find K, for which Ck= max ci .

Then mk is the optimal choice object

. If k has more than one value, then any

one of them could be chosen by Mr. X by using his option.

Now we use the algorithm to solve our original

problem.

Clearly from the table we see that (e1,e3,e4,e5},

{ e1,e2,e4,e5},{ e1,e2,e3,,e5

}. are the three reducts of

P = { e1,e2,e3,e4,e5

}. Choose any one say Q = { e1,e2,,e4,e5}.

Incorporating the choice values, the reduct –soft-set

can be represented in Table 2 below.

Table

2.

U

e1

e2 e4 e5

Choice value

m1

m2

m3

m4

m5

m6

0 1 1

1

1 0

1 1

1 0 0 0

0 1 1 1

1 1 1 0

1 1 1 1

C1 = 3

C2

= 3

C3

= 1

C4

= 3

C5

= 3

C6

= 4

Here max Ci = C6.

Decision :Mr x is better to buy the model m6

4.CONCLUSION

The Soft Set

theory of Molodtsov 6 and Rough Technique of Pawlak 8 offer a a general

tool for dealing with uncertain

information. In this paper, we gave an

application of Soft Set theory in a decision making problem by the rough

technique of Pawlak8.