Abstract

The inside data in the organization may increase rapidly

with time. In order to reduce cost of organization, they may chose third-party

storage provider to store entire data. There is a leakage crisis when provider

cannot be trusted. Another scenario, a dealer collects all transaction data and

publishes to the data analysis company for marketing purpose. It may reveal

privacy when the company is malicious. For these reason, preserving privacy in

database becomes very important issue. This paper concerns prediction

disclosure risk in numerical database. We present an efficient noise generation

which relies on Huffman coding algorithm. We also build a noise matrix that can

add intuitively noise to original value. Moreover, we adopt clustering

technique before generating noise. The result shows the running time of noise

generation of clustering scheme is faster than unclustering scheme.

Keywords:

Huffman coding, noise matrix, disclosure control,

microaggregation, database privacy

1.Introduction

Technology brings convenience, and the cloud

computing technique rises in recent years. The inside data in organization may

increase rapidly. In spite of organization may build storage space by himself,

but they may publish these data into the data analysis company for some

marketing purposes. Hence, the data mining techniques plays an important role

in the Knowledge Discovery in Databases (KDD). But the malicious data analysis

company may record personal data when organization publishes statistical

database to the company end. If the company is not trusted, there is a leakage

crisis. For these reasons, it leads to privacy research becomes more popular in

these years. Statistical Data Bases (SDBs) are used to produce result of

statistical aggregates, such as sum, average, max and min. The results of

statistical aggregates do not reveal the content about any single individual

tuple. However, the user may ask many legal queries to infer confidential

information from the gaining of database responses.

In

recent years enhancing the security of statistical database has gotten a lot of

attention. The problem of security in classical statistical database involves

three different roles 17: statistician, who interest is to gain aggregate

data; data owner, who wishes individual records are security; database

administrator, who needs to satisfy both of above roles. The privacy challenges

in statistical database are classified to two aspects 15: for data owner, it

should avoid data theft by hacker, avoid data abuse by service provider, and

restrict user access right; for user, it should hide query content, and database

does not reveal query detail. There are many approaches have been proposed. Navarro-Arribas and Torra

organize four categories approaches as following 16: 1) Perturbative methods, which modify the original data to reach a

degree of privacy. They usually called noise; 2) Non-perturbative methods, that technique masks the data without

introducing error. In contrast to perturbative methods, data is not distortion;

3) Cryptographic methods, which use

classical cryptography system; 4) Synthetic

data generation, which generates random data while retaining relationship

with the original data. In order to protect confidential information in database,

Statistical Disclosure Control (SDC) is most used for privacy preserving

solution on statistical database. Micro-Aggregation Techniques (MATs) are

considered to the family of SDC, and belong to perturbative methods. The method

of microaggregation has many attractive features including robust performance,

consistent responses, and ease to implementation 6. User is able to get

useful information since this method would not reduce the information in the

content. In other words, there is minimum information loss through this method.

Furthermore, we review

some approaches for preserving privacy 1-5,8.12-14,17. In particular, the

microaggregation scheme is attracted to be use on statistical databases in

these years, because it replaces the original value, lower distortion, to

prevent the identity and prediction disclosure. And the replaced data has not lead

to the problem for data analysis or data mining applications. All of records in

database can be represented to a data point in coordinate systems.

This

paper considers a combination of two or more nonconfidental attributes, such as

age, weight, can be used to link an individual. Such set of attributes is

collectively called a quasi identifier. A popular approach for replacing

original data is to use clustering-based technique to prevent identity

disclosure. Hence, the adversary may be confused when the original data is

replaced by grouping measure. Although the data in the dataset is homogeneous

by clustering-based technique, but there is a problem of prediction disclosure.

2.Proposed scheme

The

paper concerns the problem of prediction disclosure that quasi-identifier is

generalized by homogenous of microaggregation method. The quasi-identifier has

one or more attributes may link to an individual. For briefly, we only consider

a quasi-identifier with two attributes. First, all values of quasi-identifier

are converted to data point on coordinate system. To address prediction

disclosure, the homogenous values after the process of original

microaggregation method do cluster first. Then we generate noise based on

centroid of these groups. In order to enhance speed of noise injection, all

noise values are formed to a set, which is called noise matrix in this paper.

Each original value corresponds to a noise value. In this section, we introduce

concept of microaggregation, and next illustrate the clustering technique which

is based on Prim’s MST. The paper mainly idea are generating noise and noise

injection procedure. These of two will describe in the rest of this section.

2.1Preliminary

Microaggregation technique is the family of statistical disclosure

control and is applied to numerical data, categorical data, sequences, and heterogeneous

data 16. It computes a value to represent a group and replace original value

to confuse the adversary. All of records are formed a group with its nearest records. is a constant value, threshold, preset by data

protector. If is higher, the degree of privacy is higher,

but data quality is lower. In the contrast, is lower, the degree of privacy is lower, but

data quality is higher. It is a trade-off between data disclosure risk and less

information loss. Although this method may damages the original data and may

leads to data distortion. But it just makes sure that the low levels of data

distortion. It has not affected to database operation. Therefore, minimizing

the information loss is a major challenge of this method. There are two main

operations for microaggregation, that are partition and aggregation which we

describe in detail as follows:

l

Partition: records are partition to

several disjoint groups, and each group is included at least records.

l Aggregation: each

record in group is replaced by the centroid of group, which is a computed value

to represent the group..

2.2MST Clustering

We

adopt Prim’s minimum-cost spanning tree clustering technique which is proposed

by Lazlo and Mukherjee in 2005 11.

First

step, the proposed clustering technique relies on Prim’s minimum-cost spanning

tree, which is constructed based on all records in the dataset. Prim’s

algorithm is a greedy algorithm that finds a minimum-cost spanning tree for a

connected edges undirected graph. It finds a subset of edges to form a

minimum-cost spanning tree that connects to all nodes, where the total weight

of all edges is minimized. Some notation is defined to facilitate the

discussion. Each record with more attributes in the dataset D can be converted

to data point on the coordinate systems and is considered a node u in the

minimum-cost spanning tree. The node u can be connected to the other node v in

the dataset D and forms an edge e(u,v), u,v?D. All

of edges can be computed to a value by random two nodes in the dataset. That

computed value can be used to as a weight w for each edge. According to Prim’s algorithm, it first selects a single node u?D and builds a minimum-cost spanning tree F={u}, no edges. The

next step of Prim’s

algorithm selects another node v? F-D,

where v is closest to the set F and is closest to the node u. There is a new

edge e(u,v) be formed by two nodes u,v?D, and

node v points to parent node u, and add v to the set F, F={u,v}. Each node

points to its parent node in the tree, but the initial node points null. In

this case, the node u points null. This is an iterative process until F=D.

Prim’s algorithm selects a single node, regarded as a root of tree, in the

graph to grow to a minimum-cost spanning tree. The total weight in all selected

edges is minimized. The result of Prim’s MST algorithm is shown in Fig 1, where

nodes of tree are connected by red lines and the number of weight is near to

each edge.

Fig 1. Minimum-cost spanning tree

Second step, in order to partition all nodes to form cluster in

the MST, we should consider that how many edges in the MST are removable. The

idea is visits all edges in the MST from longest to shortest, and determines

the edge cutting while retaining the remaining edges. After edge cutting, the

MST partitions to several subtrees and these can be formed cluster. All of

edges are assigned to a priority queue in descending order. Then, we obtain an edge in sequence from the priority queue , and

consider each edge whether is removable,

where is visiting node and is parent node of . We consider

the two subtrees size from visiting node and parent node respectively, and determine each size is

greater than which preset by protector. The edge is removable when both of two subtrees sizes

are greater than ,

respectively. In the contrast, the edge is not removable. First, we obtain a subtree

size from visiting node by , where is used to obtain subtree size from node . Second, we

consider the root node from visiting node toward its parent node . Then we

obtain another subtree size by . For briefly

illustrate, we suppose these two subtrees size are greater than , that edge is removable. We remove the edge from priority queue , , and replace

the parent pointer of to to represent it is a root node of subtree.

Final

step is a simple processing for all nodes partition to disjoint cluster. Each

root of subtree can be formed cluster by traversal its descendant nodes. We

find out all node which parent pointer is , and assign to a set of root . The parent pointer with which represents root of subtree, and each subtree

can be formed cluster , , where is a set of clusters, , where . We obtain a root node from front of root set , and traverse all descendant

nodes by following the root node of subtree. After traversing the subtree, the

root node and its all descendant nodes can be form a new cluster . And then remove root node from , . We can find next cluster

follows above procedure. This is an iterative process until . Finally, all nodes are

partitioned into disjoint clusters.

2.3Noise Generation

After

clustering all data points, the next step is generating noise based on centroid

of these groups. Our scheme is based on Huffman coding which proposed by

Huffman in 1952 9. Huffman coding algorithm is popular on data compression

technique 710. We can identify distinct data point by building Huffman

coding tree. Because Huffman coding has some features, such as 1) each

character has a corresponding Huffman code; 2) the character with higher

probability has shorter Huffman code. In the contrast, the character with lower

probability has longer Huffman code. These features can be used on generating

noise to preserve privacy in database. There is longer noise to be injected to

original data with lower probability, easy reveal privacy, to confuse

adversary. In the other words, the data with high probability means not easy

reveal privacy for personal.

2.4Noise Injection procedure

As

mentioned above, the noise is built by Huffman coding tree based on the

probability of original value, then it is converted to a set, we called the

noise matrix in this paper. Each data point is the original value v may

correspond to a noise value r in the noise matrix. This method can simplify the

process of the original value of the disturbance, and the noise injection

process easier and more intuitive. After the building noise matrix, we describe

the process of noise injection. We put the sting of noise into the queue, and

add ‘1’ sequentially to original data by the function of least significant bit

(LSB) until the queue is empty. Due to the use of the LSB function disturb the

original value, the data distortion may be significantly reduced.

3.Results

We consider running

time of noise generation that calculates per unit time in milliseconds. In

order to estimate the precise time, we obtain the average of 61 times of running

time of noise generation. Our experiments were conducted to explore time

changes between unclustering and clustering. Which the unclustering scheme has

not include MST clustering technique. Moreover, the clustering scheme has

various k which are group size preset by data protector. We also discuss the

time changes of instances from 10 to 1,000.

The

experimental results show the running time of noise generation will be slower

when the records are increased. The running time of noise generation of

clustering scheme is faster than unclustering scheme. In the experiments, we

also find a noise in the running time in clustering scheme, but overall the

growing of time is very smooth. In addition to running time examining, we also

explore the data quality after noise injection procedure. The measure of data

quality is follow Domingo-Ferrer proposed in 2002. The experimental results

show less information loss in unclustering scheme. But all of the information

loss results are not exceed to 50 percentages. However, it is a trade-off

between minimum information loss and value disclosure risk. Summary, our

proposed scheme is an efficient to generate noise to preserve privacy in

database.

4.Conclusion

Due to

cloud computing technique becomes very popular in these years, and technology

brings convenience. It leads to the inside data in the organization increase

rapidly. The concept of database as a service has been proposed in 2002.

However, it may reveal personal privacy when all the data publish to the

third-party service provider, but it cannot be trusted. Another scenario is

when dealer collects transaction data about personal and publishes to the data

analysis company for some research or marketing purposes, but the company is

malicious. It also has leakage crisis. For these reason, how to preserve

privacy in database becomes more important in recently years. Although security

issues in database are a huge problem, this paper only concerns prediction

disclosure issue that adversary is able to predict the confidential value of an

individual. We present an efficient noise generation scheme which relies on

Huffman coding algorithm. We also build a noise matrix that can add intuitively

noise to original value. Moreover, records in the dataset are partitioned to

disjoint cluster before generating noise. Our scheme can only be used in

numerical database or statistical database. In the future, we will consider

non-numeric values and propose a conversion mechanism. The mechanism is

adaptive to our scheme and can be converted between non-numeric value and

numeric value. When all non-numeric can be converted to numeric value, it can

adapt to our scheme and extend this study.