You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.
Go to headerGo to navigationGo to searchGo to contentsGo to footer
In content section. Select this link to jump to navigation

Fuzzy clustering-based microaggregation to achieve probabilistic k-anonymity for data with constraints

Abstract

Microaggregation is an effective data-driven protection method that permits us to achieve a good trade-off between disclosure risk and information loss. In this work we propose a method for microaggregation based on fuzzy c-means, that is appropriate when there are constraints (linear constraints) on the variables that describe the data. Our method leads to results that satisfy these constraints even when the data to be masked do not satisfy them.

1Introduction

Microaggregation is an effective masking method for statistical disclosure control. Given a dataset X it permits to build a masked data set X′ = ρ (X) with a good compromise between disclosure risk and information loss. Roughly speaking, microaggregation builds small clusters and replaces the original values by the centers of the clusters. In order to avoid disclosure, all clusters must have at least a predefined number of records. Then, in order to have an acceptable information loss it is usual to define a partition of the variables and microaggregate each variable independently.

The literature on microaggregation is vast. Microaggregation was proposed in [4]. Then, [5] defined an heuristic method, and [30] defined an approach for categorical data. Hansen and Mukherjee [9] defined a polynomial algorithm for univariate microaggregation. [21] proved that the problem is NP for multivariate microaggregation. Since then, different methods have been proposed to improve the effectiveness. See e.g. [13, 17, 18, 22, 24, 26]. The problem of selecting a good partition of the variables is discussed in [19]. Microaggregation to achieving k-anonymity [27] is discussed in [7].

The transparency principle (see e.g., [35]) establishes that when data is published we need to include information on all processes applied to the data. This naturally includes how data has been protected. When users know this information, they can use it to compute statistics from the data with better accuracy. Nevertheless, intruders can use this information to attack the data. Transparency attacks for microaggregation have been studied in [20, 36, 37]. It has been proven that very effective attacks can be built for multivariate microaggregation.

In order to define a microaggregation method resistant to transparency attacks, fuzzy microaggregation was introduced in [6]. The method builds first a fuzzy partition where all clusters had at least k-elements and then data is replaced by cluster centers. As in fuzzy clustering elements can belong to different clusters, assignment to a cluster center was done at random. In this way, given a record intruders do not know for sure which cluster has been used in the replacement. This increases intruders uncertainty and decreases disclosure risk. Another fuzzy clustering based approach was proposed in [12] in terms of an optimization problem in line with [8]. In recent papers [33, 34] a simpler algorithm for fuzzy microaggregation was introduced.

Edit constraints correspond to restrictions established on the metadata, or the schema, of the database. They establish how some variables relate to each other. For example, we may have three variables in the database net, tax, gross and a constraint that specifies net + tax = gross. It is usual that data is edited before publication so that it satisfies the edit constraints. Nevertheless, when data is protected by means of a masking method it is possible that the resulting masked file violates the constraints. In this case the file has to be edited again but it may be not so easy this time. Note that if the masked file is such that means are the same as in the original file, and we have added random noise that has caused some ages to be negative, replacing these negatives values by zero would change the mean. The combination of edit constraints and masking methods has been studied by Shlomo and De Waal [28, 29], by Torra et al. [2, 3, 31], and later by Kim et al. [10, 11].

In this paper we study the problem of defining a method for microaggregation that, following [34] is resitant to transparency attacks and at the same time is able to deal with edit constraints. The approach is based on fuzzy c-means. We introduce here two variations of this algorithm to deal with linear constraints. The algorithms are able to build a masked data set that satisfies the linear constraints even in the case that the original data does not satisfy them. Note that this is an important property, as we can then combine in a single step effectively data edition and data masking.

The structure of the paper is as follows. In Section 2 we review clustering and fuzzy c-means. In Section 3, we introduce two methods for achieving constrained fuzzy c-means and use them to define constrained microaggregation. We study the properties of the approaches. In Section 4 we give an example. The paper finishes with some conclusions and lines for future research.

2Preliminaries

This section is divided in three parts and describes three topics that are needed later. We begin with a review of fuzzy clustering, which is used in our approach. Then we review how fuzzy clustering is used in microaggregation. Finally, we make a short summary of edit constraints and, more particularly, on linear constraints.

2.1Fuzzy clustering

Clustering is an approach in statistics and machine learning (more specifically, in unsupervised machine learning) to extract relevant structures and patterns from data. There is a large number of methods for clustering that depend on different assumptions on the underlying model of the data and the type of structure built from the data (i.e., dendrograms, partitions, fuzzy partitions,...). In this paper, following [34] we will use a fuzzy clustering algorithm.

Fuzzy clustering algorithms [1, 14, 15] typically build a fuzzy partition from the data. Recall that in a fuzzy set, membership to the set is partial instead of Boolean. This is usually represented by a membership function over the reference set u : X → [0, 1] where u (x) =0 means no membership and u (x) =1 means total membership. Then, fuzzy clustering builds partitions in which elements may belong to more than one cluster with non-null membership. We will use fuzzy partitions in the sense of Ruspini [1, 25] which can be interpreted in terms of probability distributions because memberships of an element to all clusters add to one (this is not necessarily the case in other types of fuzzy partitions).

Fuzzy c-means [1] is one of the most well known algorithms of this type. This method is defined in terms of an objective function to be minimized. It is a generalization of k-means resulting into a fuzzy partition instead of a crisp (standard) one. This is achieved by means of replacing the objective function of k-means by another one where memberships are involved, and including a parameter m that controls the fuzziness of the solution. Entropy-based fuzzy c-means is another fuzzy clustering method. Fuzziness is achieved by means of requiring membership functions to optimize the entropy. In this case, a parameter λ is used to control the degree of fuzziness.

The formulation of fuzzy c-means [1] follows. In this formulation xk for k = 1, …, N represent the records to be clustered, uki the membership degree of record k to the ith cluster, and vi represents the cluster center of the ith cluster. Fuzzy c-means finds uki and vi given xk and the parameter m solving the following optimization problem. minJ(u,v)=i=1ck=1N(uki)m||xk-vi||2 subject to i=1cuki=1 for all k = 1, …, N               uki ≥ 0 for all i = 1, …, c                          and for all k = 1, …, N

In this definition, m is such that m ≥ 1. When m = 1 the problem is equivalent to the one of k-means and solutions are crisp (i.e., elements xk are only assigned to one cluster and uki ∈ {0, 1} instead of being in [0,1]). In contrast when m is large (e.g. m > 2.5) solutions tend to be extremely fuzzy and membership values uik tend to be equal to 1/c.

Fuzzy c-means is usually solved using an iterative algorithm that interleaves two optimization problems. One that optimizes u given cluster centers vi, and another that optimizes v given memberships uki. That is, the following algorithm is used to find the membership functions uki and the cluster centers v:

  • 1. Generate c initial values for cluster centers v = (v1, …, vc)

  • 2. Calculate

    uˆ=argminuJ(u,V)

  • 3. Calculate

    vˆ=argminvJ(U,v)

  • 4. Iterate the last two steps until convergence.

The expressions for computing uˆ and vˆ in steps (2) and (3) above are given in the next proposition. See e.g. [1] for details.

Proposition 1. The alternate optimization algorithm for the fuzzy c-means uses the following expressions for computing the new uki and vis.

  • The solution of uˆ=argminuJ(u,v) given v is:

    uki=(i=1c(||xk-vj||2||xk-vi||2)1m-1)-1

  • The solution of vˆ=argminvJ(u,v) given u is:

    (1)
    vis=k=1N(uki)mxksk=1N(uki)m

This iterative approach converges to a local optima. It can be proven that the local optima obtained when m is large satisfies the following properties.

Proposition 2. For large values of m, the iterative algorithm for fuzzy c-means with the equations in Proposition 1 leads to

  • memberships uij = 1/c for all objects xj ∈ X and clusters j, and

  • cluster centers vi=vj=X¯ for all clusters j

There have been several extensions of this method as well as alternative approaches for defining fuzzy clustering. One of them is entropy-based fuzzy c-means (EFCM) [16], which is defined in a way similar to the one of fuzzy c-means. The function to be minimized is a regularized version of ∑∑uki||xk - vi||2. Instead of adding m as in fuzzy c-means, a term based on the entropy is introduced in the optimization problem. This term is combined using a parameter λ. This parameter λ ≥ 0 is used to control the fuzziness of the solution. Formally, the entropy-based fuzzy c-means is defined in terms of the following optimization problem.

(2)
JEFCM(u,v)=k=1ni=1c{uki||xk-vi||2+λ-1ukiloguki}

This objective function is subject to the constraints uki ∈ [0, 1] and i=1cuki=1 for all k, as in the case of fuzzy c-means.

EFCM is also solved using an iterative algorithm. The expressions for u and v are as follows (see [16] for details):

(3)
uki=e-λ||xk-vi||2j=1ce-λ||xk-vj||2
(4)
vi=k=1nukixkk=1nuki

The parameter λ plays a role similar to m in fuzzy c-means. Here, the smaller the λ, the fuzzier the solution.

Proposition 3. When λ tends to zero, the iterative algorithm for EFCM using Equations 3 and 4 leads to

  • memberships uij = 1/c for all objects xj ∈ X and clusters j, and

  • cluster centers vi=vj=X¯ for all clusters j

When λ tends to infinity, the second term becomes negligible and the algorithm yields to a crisp solution.

FCM and EFCM lead to different clusters for the same data. An important difference is about the membership values. For example, the centroids have a membership equal to one in the FCM while in the EFCM a lower membership is possible.

2.2Fuzzy microaggregation for data privacy

Microaggregation is about building small clusters (with at least k records) and replace each record in the set by the cluster center. When the whole set is microaggregated at once (considering all the variables at the same time), the resulting file is such that there are sets of at least k indistinguishable records. Note that all records that belong to the same cluster are replaced by the same cluster center.

As processing the file in this way causes a high information loss, it is usual to consider a partition of the variables, and apply microaggregation to each subset of variables. Proceeding in this way, records that are indistinguishable with respect to one set of variables (because they belong to the same cluster for these variables) may be distinguishable with respect to another set (because they belong to different clusters for these other variables). This causes that the protected file has no longer sets of k indistinguishable records.

Nevertheless, this approach can be attacked effectively (see e.g. [20]. Any intruder can exploit the fact that records are usually masked replacing values by cluster centers that are near. Attacks are specially effective for optimal univariate microaggregation as in this case we know with certainty which clusters can be used for replacement.

To avoid this type of (transparency) attacks, we proposed in [34] an approach for microaggregation based on fuzzy microaggregation. Algorithm 1 reproduces this method. The approach is based on computing fuzzy clusters and then replacing values by cluster centers using the membership values as a probability distribution. In this way, intruders cannot know which clusters have been used. The algorithm uses two parameters m1 and m2 in relation to fuzzy c-means. The first one is to compute the fuzzy clusters, and the second one to build the probability distribution.

It can be proven that the larger the m1, the larger the information loss. Similarly, the larger the m2, the larger the information loss. In addition, we can also prove that for large values of m2, and with c = |X|/k, the expected size of all clusters is k. Therefore, for large values of m2 when microaggregation is applied to the whole file, we have probabilistic k-anonymity.

Algorithm 1. Fuzzy Microaggregation with parameters c, m1, and m2

Step 1: Apply fuzzy c-means with a given c and a given m : = m1.

Step 2: For each xj in X, compute memberships u to all clusters in i = 1, …, c for a given m2. Use:

uij=(r=1c(||xj-vi||2||xj-vr||2)1m2-1)-1

Step 3: For each xj determine a random value χ in [0, 1], and assign xj to the ith cluster using the probability distribution u1j, …, ucj (i.e., the membership values computed in Step 2).

2.3Constraints and linear constraints

A problem so-far not much considered in the literature in data privacy is the one of data with constraints. Up to our knowledge, only the three groups mentioned in the introduction (Shlomo and De Waal, Kim et al., and ourselves) have considered this problem.

With data with constraints we refer to the fact that some databases include metadata expressing that some of the variables in the data sets should satisfy some constraints. Some of these constraints are enforced when data is introduced, and others are checked once all the data are available.

When data are processed for ensuring privacy, data is modified, and if such constraints are not taken into consideration, incompatible data are generated. The usual approach is to proceed in this way: (i) protect the dataset ignoring the constraints, then (ii) check the constraints, and finally (iii) if constraints are violated, data is modified again so that it is compliant with the constraints.

Such approach can lead to inconsistences or larger information loss than necessary. Recall the example in the introduction. Protection with random noise is known to keep means. However, if variables are required to be positive, just transforming to zero all negative values can cause a rellevant positive bias to the data. In addition, when data is first edited, and then protected we modify the original data twice. This adds extra noise to the data.

The results introduced in this paper are to edit and protect in a single step. In this way, the noise introduced into the data for data protection is used at the same time to solve the inconsistencies of the data with respect to the constraints.

Constraints on the variables are known by edit constraints in the field of data privacy (in particular, in the subfield of statistical disclosure control). There are different types of constraints. E.g., constraints on the possible values, linear constraints, one variable that governs the values of another (e.g., sex=male, implies number of pregnancies=0). See [23, 28, 31] for details on the classification of the constraints. In this paper, we will consider linear constraints. Note that some other types of equality constraints can be transformed into linear ones. Naturally, we have a linear constraint when a variable can be expressed as a linear combination of a set of other variables. For example, the following relation between family income, person income, and other persons income should hold:

EC-LC: person income + other person income = family income

In general, linear constraints can be expressed in its more general form as V = ∑sαiVi + A, for some values αi, variables Vi, the dependent variable V and a constant A. In this paper, we will rewrite them, equivalently, as ∑sαiVi = A, or, in a vector form with v = (V1, …, Vt) and α = (α1, …, αt), as α · v = A. Here · represents the inner product.

3Constrained fuzzy clustering for data masking

As our goal is that the masked data satisfies the linear constraints (independently of whether the original data satisfies them or not), we investigate in this section clustering algorithms that lead to cluster centers that satisfy linear constraints.

We use again xk for k = 1, …, N to represent the set of elements to be clustered. Each element xk belongs to a t dimensional space (i.e., xkt ). Elements x will be clustered in c different clusters. Then, we need to consider cluster centers vi for i = 1, …, c with vi also represented in a t dimensional space. For both xk and vi we will use here a second subindex to express its sth component. In other words, xks represents the sth component of the kth object and vis represents the sth component of the ith cluster center.

As stated above, linear constraints on the cluster centers are represented in terms of a vector α = (α1, …, αt) and the inner product ·, by means of the equation α · vi = A for all clusters i = 1, …, c. Of course, this means that, s=1tαsvis=A .

3.1Constrained FCM

We start considering the formalization of fuzzy c-means when we require the cluster centers to satisfy linear constraints. Therefore, using the notation above, the constrained fuzzy c-means with linear constraints corresponds to the following problem. minCF(u,v)=i=1ck=1N(uki)m||xk-vi||2 subject to i=1cuki=1 for all k = 1, …, N               α · vi = A for all i = 1, …, c               uki ≥ 0 for all i = 1, …, c                          and for all k = 1, …, N

Note that this problem is the same we saw before for the fuzzy c-means adding linear constraints for each cluster center. To solve this optimization problem we can apply the alternative algorithm problem using new expressions for u and v. The following proposition gives these expressions.

Proposition 4. The alternate optimization algorithm for the fuzzy c-means with linear constraints will use the following expressions for computing uki and vis.

  • The solution of uˆ=argminuCF(u,v) given v is:

    uki=((i=1c||xk-vj||2||xk-vi||2)1m-1)-1

  • The solution of vˆ=argminvCF(u,v) given u is:

    (5)
    vis=k=1N(uki)mxks-αsk=1N(uki)m[r=1tαrxkr-A]r=1tαrαrk=1N(uki)m

We now present the proof of this proposition. We considered a simpler version of this problem in [32].

Proof. The solution of this problem is based on the Lagrange multipliers. We consider two sets of Lagrange multipliers. One set are for the constraints i=1cuki=1 and the other set for the constraints α · vi = A. These multipliers are called, respectively, λk and νi. Then, we have that the expression to be minimized is:

L=i=1ck=1N(uki)m||xk-vi||2+k=1Nλk(i=1cuki-1)+i=1cνi(αvi-A).

In order to compute an expression for uki, let us now consider the derivative of L with respect to uki. We obtain

Luki=m(uki)m-1||xk-vi||2+λk.

Making this expression equal to zero, and assuming that there is no xk = vi, we can obtain the following expression for ukj (for convenience we use j instead of i).

(6)
ukj=(-λkm||xk-vj||2)1m-1.

In order to eliminate λk in this expression, let us consider the constraint i=1cuki=1 . Replacing ukj by Equation 6 we obtain:

(7)
1=i=1cuki=i=1c(-λkm||xk-vi||2)1m-1=(-λk)1m-1i=1c(1m||xk-vi||2)1m-1.

Now, if we compute the quotient between Equation 6 and 6.sumen1 we obtain the following:

(8)
ukj=(-λkm||xk-vj||2)1m-1(-λk)1m-1i=1c(1m||xk-vi||2)1m-1(i=1c(||xk-vj||2||xk-vi||2)1m-1)-1.

This solution is a valid solution because it satisfies uki ≥ 0. As the objective function is convex, this is the unique solution of the problem.

We now consider the derivative of L with respect to vk in order to compute an expression for the later. As vk is a vector, we consider one of its components: vis. We decompose L above in L1 + L2 + L3 and make explicit vis in these components:

L1=i=1ck=1N(uki)m||xk-vi||2=i=1ck=1Ns=1t(uki)m(xks-vis)2
L2=k=1Nλk(i=1cuki-1)
and
L3=i=1cνi(αvi-A)=i=1cνis=1tαsvis-i=1cνiA

It is easy to see that the derivative of L1 with respect to vis is k=1N(uki)m2(xks-vis)(-1) , the derivative of L2 with respect to vis is zero, and the one of L3 with respect to vis is νiαs. Therefore, it follows that

Lvis=k=1N(uki)m2(xks-vis)(-1)+0+νiαs.

If we assign this expression to zero we obtain,

0=k=1N(uki)m2(xks-vis)(-1)+νiαs.

Therefore, we can say that vis should be as follows.

(9)
vis=k=1N(uki)m2xks-νiαsk=1N(uki)m2.

We study the case of αs ≠ 0 (otherwise the variable s is not in the linear constraint). In this case, let us consider the equation A = α · vi, which corresponds to

A=s=1tαsvis=s=1tαsk=1N(uki)m2xks-νiαsk=1N(uki)m2

If the ith cluster contains at least one element with some non null membership, we can obtain by algebraic transformation the following expression for νi:

νi=s=1tαs(k=1N(uki)m2xks)-2k=1N(uki)mAs=1tαsαs

We use this expression for νi in Expression 9, and then with further algebraic manipulation we obtain Equation 5 in the proposition.□

Note that for variables vs not involved in the linear constraint (i.e., variables vs such that αs = 0), vis will be computed in the same way as for the standard fuzzy c-means. This is so because Equation 5 reduces to vis=(ukimxks)/(ukim) (Equation 1). Because of this result, we do not need to consider separately those variables in a database that satisfy linear constraints and those that do not take part of such type of constraint.

This algorithm will produce a set of clusters that satisfy the linear constraints. We can also prove that when the data satisfies these constraints, the expressions for vis above reduce to the ones of standard fuzzy c-means (Equation 1).

These two properties are established in the following proposition.

Proposition 5. The solution of Proposition 4 is such that

  • 1. When αs = 0 (i.e., variable vs is not involved in the linear constraint), Equation 5 reduces to Equation 1 for all i = 1, …, c.

  • 2. When data satisfy linear constraints, Equation 5 reduces to Equation 1 for all i = 1, …, c and all s = 1, …, t.

Proof. To prove 1 it is enough to replace αs by zero in Equation 5. To prove 2 observe that when data satisfy the linear constraints, then for all k = 1, …, N it holds

r=1tαrxkr-A,
so, Equation 5 reduces to Equation 1.□

3.2Constrained EFCM

We focus now on entropy-based fuzzy c-means. We will obtain similar results. Considering the linear constraints for all cluster centers, we define the following optimization problem. minCE(u,v)=i=1ck=1Nuki||xk-vi||2+λ-1i=1ck=1Nukiloguki subject to i=1cuki=1 for all k = 1, …, N               α · vi = A for all i = 1, …, c               uki ≥ 0 for all i = 1, …, c                              and for all k = 1, …, N

For this problem, the following result is obtained.

Proposition 6. The alternate optimization algorithm for this optimization problem leads to the following two expressions.

(10)
uki=e-λ||xk-vi||2j=1ce-λ||xk-vj||2
(11)
vis=k=1Nukixks-αs(k=1Nuki(r=1tαrxkr-A)r=1cαrαr)k=1Nuki

Proof. The proof of this proposition follows the same schema as the case of the fuzzy c-means. That is, we use the Lagrange multipliers to define an objective function that includes subexpressions for each constraint.

L=i=1ck=1Nuki||xk-vi||2+λ-1i=1ck=1Nukiloguki+k=1Nλk(i=1cuki-1)+i=1cνi(α·vi-A).

Derivating L with respect to uki and making it equal to zero, we can later obtain the following expression for uki:

uki=e-1-λλk-λ||xk-vi||2.
Then, as ∑jukj = 1, we can compute

uki=ukijukj=e-1-λλk-λ||xk-vi||2j=1ce-1-λλk-λ||xk-vj||2=e-λ||xk-vi||2j=1ce-λ||xk-vj||2
which corresponds to the expression for uki.

The proof of the expression for vis starts derivating L with respect to vis. Then, when this derivative is zero we can obtain the following expression for vis:

vis=2kukixks-νiαs2kuki.
On the other hand, replacing this expression for vis in the equation s=1tαsvis=A we obtain:
s=1tαs2kukixks-νiαs2kuki=A
from which we get the following expression for νi:
(12)
νi=2r=1tαrkukixkr-2Akukir=1tαrαr=2kuki(r=1tαrxkr-2A)r=1tαrαr.
Replacing this expression for νi in vis we have proven Equation 11

Note that the expression for uki is the same we had before when no linear constraints were considered. In contrast, the expression for vis is different as it includes terms corresponding to the constraints.

The next proposition establishes that the solution given in Proposition 6 corresponds to the standard solution in EFCM when data already satisfies the constraints.

Proposition 7. When the original data satisfy the linear constraints α · vi = A, the alternate optimization algorithm for entropy-based FCM reduces to the one of standard fuzzy c-means.

3.3Constrained microaggregation

The application of these algorithms to microaggregation consists in replacing in Algorithm 1 the fuzzy c-means in Step 1 by one of the two clustering methods discussed in this paper (constrained FCM and constrained EFCM). They will lead to solutions that satisfy the linear constraints. Algorithm 1 will have two parameters m1 and m2 for the constrained FCM, and two parameters λ1 and λ2 for the constrained EFCM.

Proposition 8. Contrained FCM applied according to Algorithm 1 satisfies the following properties:

  • The larger the m1, the larger the information loss. This follows from Proposition 2. Clusters will collide, cluster centers will be vi=X¯ , and all protected data will be vi.

  • The larger the m2, the larger the information loss. Proposition 2 shows that in this case all memberships tend to be 1/c. Therefore, any cluster center can be used to replace a given value.

  • The smaller the number of clusters c, the larger the information loss. Naturally, when c = 1, all data are replaced by a single cluster center. Minimum loss can be achieved when c = |X| and each data has its own cluster (and with m1 and m2 near to one).

  • For large m2, and c = |X|/k for a given k, the expected size of all clusters is k. I.e., we have probabilistic k-anonymity. This follows from the uniform distribution with membership/probability 1/c, and the number of clusters.

These properties show that with an appropriate selection of m1 and m2 information loss ranges from no information loss (i.e., c = |X|, m1 = m2 = 1 with the masked data being equal to the original one) to maximum loss (i.e., c = 1). We can also obtain data that probabilistically satisfies k-anonymity (i.e., c = |X|/k and m2 large). In addition to these properties, the resulting file satisfies given linear constrains (edit linear constraints) even in the case that the original file does not satisfy them.

We have detailed the properties for the case of the constrained FCM. These properties can also be inferred for constrained EFCM. In that case, values of λ1 and λ2 near to zero play the role of large m1 and m2.

Table 1

Original data set

Exp 16%Exp 7%Total
v1v2v3
152342.01
124359.93
64229319.27
124562.07
283974.21
71102191.50
236495.16
25102138.14
48230301.78
325090.62
90200318.40
13100125.56
Table 2

Original data set with noise following N (0, 1.5)

Exp 16%Exp 7%Total
v1v2v3
16.9169526.6702141.37696
15.4822042.6148160.60212
65.86964228.47892318.70371
12.9775045.8461760.80475
25.9350838.5544475.96227
72.14286103.34332191.54478
24.4355065.8489596.49401
24.56774101.54299137.49281
47.97780226.75840302.78913
28.4372748.0299589.97244
91.86226197.98087318.96431
11.64466100.13359127.12980

4Experiments

For illustration, we have applied our approach to a small data set. We have considered the example in [31] where a linear constraint involving three variables was considered. The data is reproduced in Tables 1 and 2. Variables v1, v2 and v3 stand for Expenditure at 16% , Expenditure at 7% , and Total Expenditure. Then, variables v1, v2 and v3 define the following linear constraint: V3 = α1V1 + α2V2. That is, v3 = 1.16v1 + 1.07v2.

Table 1 contains the original data. In addition, we have also considered the same data set with gaussian noise (Table 2). We have considered three cases with noise following N (0, 0.5), N (0, 1.0) and N (0, 1.5) and we have microaggregated the files and computed the cluster centers. We used c = 4 that corresponds to a parameter k, as understood in microaggregation, equal to k = |X|/4 = 12/4 =3. In the table we include the data with noise N (0, 1.5).

Table 3

Centroids obtained for the original and noisy data

DataClusterCenter
Original datav113.4407537.1623655.35500
v267.32890219.64071313.11708
v327.5996352.6469888.34783
v437.11288101.71213151.88292
N (0, 0.5)v114.7703337.4646255.53745
no const.v267.66607219.68178313.78662
v328.2567852.8114088.66240
v437.37631101.05854153.25147
N (0, 0.5)v114.2109336.9486256.01970
w/t const.v267.74399219.75366313.71945
v328.0495252.6202288.84108
v437.96198101.59877152.74659
N (0, 1.0)v113.3535637.4338955.51220
no const.v268.10518219.87425313.35174
v328.3208252.6248387.69683
v436.05573102.03179152.27787
N (0, 1.0)v113.3428637.4240255.52142
w/t const.v267.80086219.59354313.61408
v327.8343352.1760888.11623
v436.48085102.42393151.91139
N (0, 1.5)v115.3841038.4615454.85374
no const.v268.56108217.71406313.44706
v326.2716852.1560788.81411
v436.43006101.70231152.30233
N (0, 1.5)v114.0063737.1907056.04144
w/t const.v268.88084218.00900313.17140
v327.1131352.9322488.08872
v436.83617102.07690151.95224
Table 4

Distance between original centroids and centroids of noisy data. For each noisy data, first row corresponds to distances with cluster centers using standard fuzzy c-means and second row to distances with cluster centers obtained using constrained fuzzy c-means

Noised1d2d3d4
N (0, 0.5)1.37565660.750780640.74688251.5393139
N (0, 0.5)1.03956130.740217270.66813981.2164527
N (0, 1.0)0.32564160.84392990.971806171.1729174
N (0, 1.0)0.32518120.68701170.574867840.95232975
N (0, 1.5)2.39076042.31065731.49058960.8013973
N (0, 1.5)0.88990322.252540.62063890.4630664

The centroids using standard fuzzy c-means applied to the original data set satisfy the constraints, but this is not so when the data set with noise is considered. In this case, however, the alternative expressions developed in this paper lead to appropriate cluster centers (i.e., cluster centers satisfying the constraints).

Table 3 display the cluster centers we have obtained for the original data set without noise, and the ones obtained with noise. For these three last data sets we include both the cluster centers obtained by the standard fuzzy c-means and by the new method.

Comparison between the cluster centers of the noisy data with the ones of the original data show that the cluster centers are more similar to the original ones when constraints are considered. This is shown in Table 4. This table shows the differences, measured using the Euclidean distance, between the 4 cluster centers obtained using a particular data set and the clustering algorithm with respect to the cluster center using fuzzy c-means with the original (no-noise) data. It can be seen that for each of the noisy data, the nearest cluster is the one where constraints are considered (second row for each of the noisy data).

These results permits us to show on the one hand the suitability of the method for combining in the same step the edit constraints and the data protection method, and on the other hand, that our approach leads to protected data with less information loss. This is so because the cluster centers with constrained FCM are more similar to the original cluster centers than the cluster centers using standard FCM.

5Conclusions and future work

We have presented two variations of fuzzy c-means to deal with linear constraints and shown their use for data masking. We have given the solution of the optimization problem when linear constraints are considered on the data. The solution leads to cluster centers that satisfy the constraints even when the data does not. This permits to combine in the same step data editing and data masking. This approach is one of the first to combine these two steps. We have also given an example that shows how the method can be applied and that for noisy data we obtain cluster centers that are nearer to the ones without noise.

We have also discussed the effects of the parameters in information loss. Further work is about the appropriate selection of the parameters of the method, and comparison with other approaches as e.g. [10, 11]. That is, m1 and m2 for constrained FCM and λ1 and λ2 in constrained EFCM. We have seen however, that with an appropriate selection we can range from no information loss (and, thus, maximum disclosure risk) to maximum loss (and, thus, no risk).

Acknowledgments

Partial support of the project Swedish Research Council (Vetenskapsrådet) (grant number VR 2016-03346) is acknowledged.

References

[1] 

Bezdek J.C. , Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, (1981) .

[2] 

Cano I. , Navarro-Arribas G. and Torra G. , A new framework to automate constrained microaggregation, Proc. CIKM-PAVLAD 2009 ((2009) ), 1–8.

[3] 

Cano I. , Torra V. , Edit Constraints on Microaggregation and Additive Noise, Proc. PSDML 2010 ((2010) ), 1–14.

[4] 

Defays D. , Nanopoulos P. , Panels of enterprises and confidentiality: The small aggregates method,, Proc. of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, Statistics Canada ((1993) ), 195–204.

[5] 

Domingo-Ferrer J. and Mateo-Sanz J.M. , Practical data-oriented microaggregation for statistical disclosure control, IEEE Trans. on Knowledge and Data Engineering 14: (1) ((2002) ), 189–201.

[6] 

Domingo-Ferrer J. and Torra V. , Towards fuzzy c-means based microaggregation, in Grzegorzewski,P. , Hryniewicz,O. , Gil,M.A. (Eds), Soft Methods in Probability and Statistics, ((2002) ), 289–294.

[7] 

Domingo-Ferrer J. , Torra V. , Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation, Data Mining and Knowledge Discovery 11: (2) ((2005) ), 195–212.

[8] 

Endo Y. , Hamasuna Y. , Hirano T. and Kinoshita N. , Even-Sized Clustering Based on Optimization and its Variants, JACIII 22: ((2018) ), 62–69.

[9] 

Hansen S. , and Mukherjee S. , A Polynomial Algorithm for Optimal Univariate Microaggregation, Trans on KDE15: (4) ((2003) ), 1043–1044.

[10] 

Kim H.J. , Karr A.F. and Reiter J.P. , Statistical Disclosure Limitation in the Presence of Edit Rules, Journal of Official Statistics 31: (1 ((2015) ), 121–138.

[11] 

Kim H.J. , Cox L.H. , Karr A.F. , Reiter J.P. and Wang Q. , Simultaneous editing and imputation for continuous data, Journal of the American Statistical Association 110: ((2015) ), 987–999.

[12] 

Kitajima K. , Endo Y. , Hamasuna Y. , Fuzzified Even-Sized Clustering Based on Optimization, JACIII 22: ((2018) ), 537–543.

[13] 

Laszlo M. , Mukherjee S. , Iterated local search for microaggregation, , Journal of Systems and Software 100: ((2015) ), 15–26.

[14] 

Miyamoto S. , Introduction to fuzzy clustering (in Japanese), Ed. Morikita, Japan, (1999) .

[15] 

Miyamoto S. , Ichihashi H. , Honda K. , Algorithms for fuzzy clustering, Springer, (2008) .

[16] 

Miyamoto S. , Mukaidono M. , Fuzzy c – means as a regularization and maximum entropy approach, Proc. of the 7th International Fuzzy Systems Association World Congress (IFSA’97), June 25– 30, Prague, Chech, Vol.II ((1997) ), 86–92.

[17] 

Mortazavi R. , Jalili S. , Fast data-oriented microaggregation algorithm for large numerical datasets, Knowl-Based Syst 67: ((2014) ), 195–205.

[18] 

Mortazavi R. , Jalili S. , Preference-based anonymization of numerical datasets by multi-objective microaggregation, Information Fusion 25: ((2015) ), 85–104.

[19] 

Nin J. , Herranz J. , Torra V. , How to Group Attributes in Multivariate Microaggregation, Intl J of Unc Fuzz and Knowledge-Based Systems 16: (1) ((2008) ), 121–138.

[20] 

Nin J. , Herranz J. , Torra V. , On the Disclosure Risk of Multivariate Microaggregation, Data and Knowledge Engineering 67: ((2008) ), 399–412.

[21] 

Oganian A. , Domingo-Ferrer J. , On the Complexity of Optimal Microaggregation for Statistical Disclosure Control, Statistical J. United Nations Economic Commission for Europe 18: (4) ((2000) ), 345–354.

[22] 

Oommen B.J. , Fayyoumi E. , On utilizing dependence-based information to enhance microaggregation for secure statistical databases, Pattern Anal Applic 16: ((2013) ), 99–116.

[23] 

Pierzchala M. , A review of the state of the art in automated data editing and imputation, in Statistical Data Editing, Vol. 1, Conference of European Statisticians Statistical Standards and Studies (1994) .

[24] 

Rebollo-Monedero D. , Mezher A.M. , Casanova-Colomé X. , Forné J. , Sorianoa M. , Efficient k-anonymous microaggregation of multivariate numerical data via principal component analysis, Information Sciences 503: ((2019) ), 417–443.

[25] 

Ruspini E.H. , A new approach to clustering, Inform Control 15: ((1969) ), 22–32.

[26] 

Salari M. , Jalili S. , Mortazavi R. , TBM, a transformation based method for microaggregation of large volume mixed data, Data Mining and Knowledge Discovery 31: ((2017) ), 65–91.

[27] 

Samarati P. , Sweeney L. , Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression,, SRI Intl Tech Rep (1998) .

[28] 

Shlomo N. , De Waal T. , Preserving edits when perturbing microdata for statistical disclosure control, Conference of European Statisticians WP11, (2005) .

[29] 

Shlomo N. , De Waal T. , Protection of micro-data subject to edit constraints against statistical disclosure, Journal of Official Statistics 24: (2) ((2008) ), 229–253.

[30] 

Torra V. , Microaggregation for categorical variables: a median based approach, Proc. PSD, Lecture Notes in Computer Science 3050: ((2004) ), 162–174.

[31] 

Torra V. , Constrained Microaggregation: Adding Constraints for Data Editing, Transactions on Data Privacy 1: (2) ((2008) ), 86–104.

[32] 

Torra V. , On the Definition of Linear Constrained Fuzzy c-Means,, Proc. of EUROFUSE (2009) .

[33] 

Torra V. , A fuzzy microaggregation algorithm using fuzzy c-means, Proc. CCIA 2015, IOS Press ((2015) ), 214–223.

[34] 

Torra V. , Fuzzy microaggregation for the transparency principle, Journal of applied logics 23: ((2017) ), 70–80.

[35] 

Torra V. , Data privacy: Foundations, new developments and the big data challenge, Springer, (2017) .

[36] 

Torra V. , Miyamoto S. , Evaluating fuzzy clustering algorithms for microdata protection, PSD, Lecture Notes in Computer Science 3050: ((2004) ), 175–186.

[37] 

Winkler W.E. , Single ranking micro-aggregation and re-identification, Statistical Research Division report RR 2002/08, (2002) .