Classification from positive and unlabeled data based on likelihood invariance for measurement

Yoshida, Takeshi; Washio, Takashi; Ohshiro, Takahito; Taniguchi, Masateru

doi:10.3233/IDA-194980

Classification from positive and unlabeled data based on likelihood invariance for measurement

Article type: Research Article

Authors: Yoshida, Takeshi | Washio, Takashi^* | Ohshiro, Takahito | Taniguchi, Masateru

Affiliations: The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan

Correspondence: [*] Corresponding author: Takashi Washio, The Institute of Scientific and Industrial Research, Osaka University, %****␣ida-25-ida194980_temp.tex␣Line␣50␣**** 8-1, Mihogaoka, Ibarakishi, Osaka, 567-0047, Japan. Tel.: +81 6 6879 8540; Fax: +81 6 6879 8544; E-mail: [email protected].

Keywords: Semi-supervised learning, classification from positive and unlabeled data, maximum likelihood estimation, likelihood invariance, application to measurement

DOI: 10.3233/IDA-194980

Journal: Intelligent Data Analysis, vol. 25, no. 1, pp. 57-79, 2021

Published: 26 January 2021

Get PDF

Abstract

We propose novel approaches for classification from positive and unlabeled data (PUC) based on maximum likelihood principle. These are particularly suited to measurement tasks in which the class prior of the target object in each measurement is unknown and significantly different from the class prior used for training, while the likelihood function representing the observation process is invariant over the training and measurement stages. Our PUCs effectively work without estimating the class priors of the unlabeled objects. First, we present a PUC approach called Naive Likelihood PUC (NL-PUC) using the maximum likelihood principle in a nontrivial but rather straightforward manner. The extended version called Enhanced Likelihood PUC (EL-PUC) employs an algorithm iteratively improving the likelihood estimation of the positive class. This is advantageous when the availability of the labeled positive data is limited. These characteristics are demonstrated both theoretically and experimentally. Moreover, the practicality of our PUCs is demonstrated in a real application to single molecule measurement.

1.Introduction

Classification from positive and unlabeled data (PUC) [7, 27] has been actively researched in recent years. PUC is required in many problems where large amounts of unlabeled data are easily acquired but only limited quantities of one-class-labeled data are available. Many measurement tasks in scientific, industrial, and social sensing face such problems.

For example, the reinforced concrete in various real-world environments is classified as damaged or undamaged by ultrasonic testing [12]. The measurements of actually damaged samples are acquired from a limited number of past incidents, while many non-incident samples, which are either damaged or undamaged, are acquired without their labels. Similar situations are common in the field of industrial nondestructive inspection. Another example is the problem of classifying the risks of an individual suffering heart failure within the next 10 years based on cardiac function indices such as stroke volume (SV) and fractional shortening (%FS) which are geometrically measured by ultrasonic imaging [16]. Occurrences of heart failure are only recorded for persons who are actually hospitalized within the 10 years after their heart inspections, whereas most of those who are inspected but did not experience the heart failure remain unlabeled. Similar problems appear in other medical fields. As described later, some advanced scientific measurement such as single molecule measurement also includes the similar problem.

The problem setting underlying many measurement tasks, including these examples, is that we have a large data set DU of unlabeled samples X and a small data set DP of other samples X having positive labels Y=P such as “damaged” and “heart failure.” X∈DP is drawn from the positive marginal density p(X|Y=P), and X∈DU is from the marginal density:

pD(X)=πDp(X|Y=P)+(1-πD)p(X|Y=N),

where πD is the positive class prior probability pD(Y=P). PUC learns a classifier from DP and DU, and this classifier can be used to obtain the measurement consequence.

A distinguishing feature of the measurement tasks is that p(X|Y=P) and p(X|Y=N) are defined by the sensing devices as likelihood functions of Y; these functions are mostly invariant because the sensing devices are designed to preserve their characteristics for long periods to ensure high accuracy. In the aforementioned two examples, the ultrasonic sensors provide invariant likelihood functions on the concrete and the heart motion as long as they are well maintained. Another distinguishing feature of the measurement tasks is that the class prior probability of an individual target object or a set of target objects is unknown in each measurement and can be significantly different from the class prior probabilities of the objects targeted in other measurements. For instance, the prior probability of concrete damage deviates according to the real-world environment surrounding the individual concrete object. The prior probability of the heart failure is heavily dependent on the individual’s living environment. Thus, the problem setting underlying the measurement tasks can be more precisely stated as being that a newly measured sample follows the new marginal density

pM(X)=πMp(X|Y=P)+(1-πM)p(X|Y=N),

where an unknown prior probability πM(=pM(Y=P)) can be very different from πD of DU used for the PUC training.

These arguments reveal difficulties to apply conventional PUCs to measurement problems [7, 27].

First, all past PUCs learn maximum a posteriori (MAP) classifiers in the standard setting of machine learning. A MAP classifier estimates the label Y assuming that p(Y|X)∝p(X|Y)p(Y)=p(X,Y) is unchanged over the training and test (measurement) stages, and thus does not anticipate a large difference between πD and πM. This nature does not fit the aforementioned problem setting of the measurement tasks.

Second, all past PUCs require the availability of the class prior probabilities πD and πM before the training and test (measurement) stages [27] or a special labeling mechanism in the training data acquisition [7]. These requirements also do not meet the conditions of the measurement tasks.

In measurement tasks having the invariant p(X|Y) and the large discrepancies between πD and unknown πM, the maximum likelihood estimation of Y from X relying on the invariance of p(X|Y) has been widely employed to avoid the above difficulty of the MAP approach, when both positive and negative labeled data sets: DP and DN are available for estimating p(X|Y) [28]. However, to the best of our knowledge, there is no principle for accurately estimating p(X|Y) from positive and unlabeled data sets: DP and DU without knowing πD. If maximum likelihood PUCs based on such a principle could be established, they would enable various new measurement tasks.

The contributions of the present study are as follows.

1. We propose the novel principle of maximum likelihood PUCs to address the aforementioned problems. These are the first PUCs that learn from the independently acquired two sample datasets: DP and DU and that do not require the estimation of the class prior probability.
2. The proposed PUCs are accurate even when a small labeled data set DP and a massive unlabeled data set DU are given, and thus have a practical advantage.
3. These characteristics and the statistical accuracies of the proposed PUCs are theoretically demonstrated.
4. We further show a feasible performance measure of the PUCs requiring positive and unlabeled data sets only. This measure can be used to select appropriate parameters for the PUCs in cross validation.
5. The performance of our proposed PUCs is confirmed through numerical experiments using artificial and benchmark data and a real-world application to noise reduction in advanced single molecule measurement.

2.Problem setting

Here, we define our problem setting more precisely. Let X∈𝒳⊆ℝd(d∈ℕ) and Y∈{P,N} be a sample in its domain 𝒳 and a positive (P) or negative (N) label, respectively. Given a training data set DU of unlabeled samples X and a training data set DP of other samples X having positive labels Y=P, let MU be an unlabeled test data set given by measuring new objects. MU may include only one sample, when the target object is unique. Samples in DP are i.i.d. sampled from the positive marginal density pD(X|Y=P), and samples in DU and MU are i.i.d. sampled from marginal densities pD⁢(X) and pM⁢(X), respectively. No negative data are given.

We now introduce two assumptions on p(X|Y).

.

pD(X|Y=P), pD⁢(X) and pM⁢(X) share an invariant p(X|Y). □

.

p(X|Y) is positive and has bounded and continuous second derivatives on 𝒳. □

These are not special assumptions. All past PUCs assumed a common p(X|Y) over all data sets [7, 27]. All past classifiers using a non-parametric estimation of the probability densities in a continuous space rely on Assumption 2. This assumption is used in a later section. From the first assumption, pD(X|Y=P)=p(X|Y=P) holds, and pD⁢(X), pM⁢(X) consist of the common p(X|Y) and the positive class prior probabilities πD=pD(Y=P) and πM=pM(Y=P), respectively, as follows.

(1)

pD(X)=πDp(X|Y=P)+(1-πD)p(X|Y=N),

(2)

pM(X)=πMp(X|Y=P)+(1-πM)p(X|Y=N).

πD,πM∈(0,1) are unknown and independently given.

Our problem is to learn an accurate non-parametric classifier of unseen samples in the unlabeled test data set MU using only DP and DU. This is a two sample setting of PUC [27], whereas DU and MU follow mutually different unknown class prior probabilities.

3.Related work

There are two issues in tackling this problem. The first relates to the distinguishing feature of the measurement tasks such that no information about MU is known before classifying a sample in MU while its class prior πM can be significantly different from πD of DU. In the field of traditional classifiers that learn from both positive and negative labeled data, many approaches for adapting the classifiers to slight or gradual concept drift [8] and larger differences between the training and test data sets [19, 20] have been proposed. Classical semi-supervised classifiers that learn from small positive and negative labeled data sets and a large unlabeled data set are based on the transductive learning framework [29, 1], and the adaptation of the classifiers to different class prior distributions has been studied. Saerens et al. proposed a covariate shift adaptation for the difference in the class prior in the semi-supervised setting [22]. More recent study has proposed a transductive transfer learning method to estimate the change in the class prior under the same setting [5]. In the PUC setting, most of the conventional PUCs are unsuitable because they assumes an invariant p⁢(X,Y) as mentioned earlier. An exception is the PUC proposed by Li et al. which adapts to gradual changes in micro-cluster labels in a data stream [15]. However, all these approaches rely on the availability of a certain size of MU before classifying its samples or a gradual change in the class prior distribution of MU, and they are therefore not applicable to our problem.

The second issue comes from the fact that the values of πD and πM are unknown, while the training processes of many PUCs need these values. The past PUCs use one of the following two options to overcome this difficulty. The PUC of Elkan and Noto [7] uses an option called the one sample setting which assumes that πD and πM are identical, positive samples drawn from pD⁢(X) are labeled with a constant probability and included in DP, and the remaining unlabeled samples are included in DU [4]. This PUC implicitly estimates the class priors by partially matching the distribution of DP to that of DU. However, these assumptions are not always valid in measurement tasks where DP and DU are independently given and πD and πM are largely different. Some PUCs use the other option called the two sample setting which accepts independently acquired DP and DU [27, 6, 18]. This setting assumes that πD and πMare correctly estimated from DP, DU and MU by relying on irreducibility of p(X|Y) that p(X|Y=P) is not represented by a linear combination of p(X|Y=N) and another probability density function of X [2, 6, 23, 21]. However, this irreducibility is not generally guaranteed in measurement tasks. Moreover, MU is not usually available for estimating πM before the new measurement, and so cannot be applied to our problem setting.

In the next section, we propose two maximum likelihood PUCs not requiring MU in advance nor the estimation of πD and πM. Thus, they do not suffer the difficulties for the measurement tasks.

4.Proposed method and theoretical analysis

4.1Naive Likelihood PUC (NL-PUC)

To design a PUC that is robust to differences between πD and πM, we employ the following maximum likelihood measure that does not depend on the class prior probabilities. According to Assumption 1, the maximum likelihood estimation of Y under x∈MU is given as follows:

(3)

y={Pifp(x|Y=P)⩾p(x|Y=N),Nifp(x|Y=P)<p(x|Y=N).

Associated with this measure, the following lemma holds.

.

Given a marginal density

pπ(X)=πp(X|Y=P)+(1-π)p(X|Y=N),

the following inequalities are equivalent for any π∈(0,1) and X.

p(X|Y=P)⩾pπ(X)⇔p(X|Y=P)⩾p(X|Y=N).

□

.

We obtain the following relation by substituting the definition of pπ⁢(X) to p(X|Y=P)⩾pπ(X).

p(X|Y=P)⩾πp(X|Y=P)+(1-π)p(X|Y=N)

⇔(1-π)p(X|Y=P)⩾(1-π)p(X|Y=N)

⇔p(X|Y=P)⩾p(X|Y=N) s.t. π∈(0,1).

□

From this lemma and Assumption 1, we immediately obtain the following, which is our core theorem.

.

For any combination of πD and πM in (0,1), the following formula is the maximum likelihood measure equivalent to Eq. (3) for all x∈MU following pM⁢(X).

(4)

y={Pifp(x|Y=P)⩾pD(x),Nifp(x|Y=P)<pD(x).

□

.

From Eqs (1) and (2) and Lemma 1, the following holds:

p(X|Y=P)⩾pD(X)⇔p(X|Y=P)⩾pM(X)

⇔p(X|Y=P)⩾p(X|Y=N).

Because the last inequality is independent of πD and πM according to Assumption 1, these three inequalities equivalently and invariantly hold for any πD and πM in (0,1). Equation (4) is therefore equivalent to Eq. (3), and the theorem holds. □

We construct a maximum likelihood PUC by substituting the non-parametric estimations p^(x|Y=P):=p^D(x|Y=P) and p^D⁢(x) derived from DP and DU, respectively, into this measure. They are provided by

p^D(x|Y=P)=1|DP|∑x∈DPpK(X|x)andp^D(x)=1|DU|∑x∈DUpK(X|x),

where we use kernel densities pK(X|x) such as Gaussian kernel which well approximates p(x|Y=P) and pD⁢(x) under the condition of Assumption 2. We refer to this classification as Naive Likelihood PUC (NL-PUC).

4.2Enhanced Likelihood PUC (EL-PUC)

p^(k)(X|Y=P)

(5)

:=(1-r)p^D(X|Y=P)+rp~(k-1)(X|Y=P).

where r∈(0,1) and k=2,3,…. p~(k-1)(X|Y=P) is given by the following non-parametric approximation of p(X|Y=P) using kernel densities pK(X|x) and their weights w~(k-1)⁢(x). We use pK(X|x) such as Gaussian kernel as well as in NL-PUC.

(6)

p~(k-1)(X|Y=P)=∑x∈DUw~(k-1)(x)pK(X|x).

(7)

w~(k-1)⁢(x)=p^(k-1)(x|Y=P)/p^D(x)∑x′∈DUp^(k-1)(x′|Y=P)/p^D(x′).

Their initial values are set as follows.

p~(1)(X|Y=P)=∑x∈DUw~(x)pK(X|x)

(8)

s.t. ⁢w~⁢(x)=p^D(x|Y=P)/p^D(x)∑x′∈DUp^D(x′|Y=P)/p^D(x′).

Equation (4.2) is a weighted non-parametric approximation of p(X|Y=P) using the samples in DU and the normalized weights w~⁢(x) where DU follows pD⁢(X) and w~⁢(x) approximates w⁢(x) defined by pD(x|Y=P)∝w(x)pD(x) [26, 10]. Equations (6) and (7) further repeat the weighted non-parametric approximation of p(X|Y=P) using its estimation from the previous step.

In the next subsection, we show that, if DU is of sufficient size and we choose an appropriate r, the mean square error of p^(k)(X|Y=P) can be no more than that of the non-parametric approximation p^D(X|Y=P) directly derived from DP. Moreover, the mean square error of p^D(X|Y=P) is limited under Assumption 2, because p(X|Y=P) is positive and has bounded and continuous second derivatives on 𝒳 [24, 10]. Accordingly, p^(k)(X|Y=P) converges to a value having better accuracy than p^D(X|Y=P). In practice, our numerical experiments show that p^(k)(X|Y=P) using almost any r∈(0,1) widely gives a better estimation of p(X|Y=P) than p^D(X|Y=P).

When the rate of change of w~(k-1)⁢(x) converges within a small tolerance ε=10-4 for all x∈DU, we obtain a more accurate p^(k)(X|Y=P) than p^D(X|Y=P) given by NL-PUC, and apply p^(k)(X|Y=P) to Eq. (4) with p^D⁢(X). This is called Enhanced Likelihood PUC (EL-PUC) and expected to have better accuracy particularly when |DP| is small.

4.3Theoretical accuracy of EL-PUC

pD⁢(X) is positive and has bounded and continuous second derivatives on 𝒳 from Assumption 2, Eq. (1) and πD∈(0,1). Moreover, the size of DU is sufficiently large in most practical cases. Accordingly, the non-parametric estimation p^D⁢(X) given by DU and used on the r.h.s. of Eq. (4) has a limited mean square error [24, 10]. Thus, we focus on the error of p^(k)(X|Y=P) which is used on the l.h.s. of Eq. (4).

Let q^⁢(X) be the estimation of q⁢(X), B⁢[q^⁢(X)] be the absolute value of the bias |E⁢[q^⁢(X)-q⁢(X)]|, V⁢[q^⁢(X)] be the variance E⁢[{q^⁢(X)-E⁢[q^⁢(X)]}2], and MSE⁢[q^⁢(X)] be the mean square error B⁢[q^⁢(X)]2+V⁢[q^⁢(X)]. Here, E⁢[⋅]denotes the expectation over the distribution p⁢(DP∪DU).

.

Let Δ(k)(X)=|E[p~(k)(X|Y=P)]-E[p^(k)(X|Y=P)]|, then Δ𝑠𝑢𝑝⁢(X)=supk⩾1⁡Δ(k)⁢(X) is bounded, and the following holds for k→∞.

B[p^(k)(X|Y=P)]⩽r1-rΔ𝑠𝑢𝑝(X)+B[p^D(X|Y=P)].

□

.

From Eqs (6) and (7), p~(k)(X|Y=P) is a weighted non-parametric approximation of p^(k)(X|Y=P). As p^(k)(X|Y=P) in Eq. (4.2) is a linear combination of p^D(X|Y=P) and p~(k-1)(X|Y=P) which are positive and have bounded and continuous second derivatives by the use of the kernel density pK(X|x) satisfying the condition of Assumption 2, the absolute expected error of p~(k)(X|Y=P) against p^(k)(X|Y=P) is bounded [24, 10]. Hence, Δ(k)⁢(X) is bounded for any k, i.e., Δ𝑠𝑢𝑝⁢(X) is bounded. Accordingly, by subtracting p^D(X|Y=P) from both sides of Eq. (4.2) and taking their absolute expectations, the following holds.

|E[p^(k)(X|Y=P)-p^D(X|Y=P)]|

=r|E[p~(k-1)(X|Y=P)-p^D(X|Y=P)]|

=r|E[p^(k-1)(X|Y=P)]±Δ(k-1)(X)-E[p^D(X|Y=P)]|

⩽r|E[p^(k-1)(X|Y=P)-p^D(X|Y=P)]|+rΔ(k-1)(X).

By repeatedly applying this formula and applying p^(1)(x|Y=P):=p^D(x|Y=P) in Eq. (4.2) when i=k-1, we derive:

|E[p^(k)(X|Y=P)-p^D(X|Y=P)]|⩽

∑i=1k-1ri⁢Δ(k-i)⁢(X)⩽Δ𝑠𝑢𝑝⁢(X)⁢∑i=1k-1ri→k→∞r1-r⁢Δ𝑠𝑢𝑝⁢(X).

Moreover, the following holds.

⩽|E[p^(k)(X|Y=P)-p^D(X|Y=P)]|

+|E[p^D(X|Y=P)-p(X|Y=P)]|

=|E[p^(k)(X|Y=P)-p^D(X|Y=P)]|+B[p^D(X|Y=P)].

By substituting the former inequality into this last line, we obtain our following result for k→∞.

B[p^(k)(X|Y=P)]⩽r1-rΔsup(X)+B[p^D(X|Y=P)].

□

This lemma shows that B[p^(k)(X|Y=P)] can be less than B[p^D(X|Y=P)], if r is sufficiently small.

.

The following holds for all k⩾2.

V[p^(k)(X|Y=P)]⩽{V[p^D(X|Y=P)]1/2

+r(V[p~(k-1)(X|Y=P)]1/2-V[p^D(X|Y=P)]1/2)}2.

□

.

By taking the expectation of Eq. (4.2) and subtracting it from Eq. (4.2) itself, we obtain the following for k⩾2.

p^(k)(X|Y=P)-E[p^(k)(X|Y=P)]

=(1-r){p^D(X|Y=P)-E[p^D(X|Y=P)]}

+r{p~(k-1)(X|Y=P)-E[p~(k-1)(X|Y=P)]}.

By taking the mean square of both sides, the following formula is derived.

V[p^(k)(X|Y=P)]=(1-r)2V[p^D(X|Y=P)]

+2⁢(1-r)⁢r⁢E[(p^D(X|Y=P)-E[p^D(X|Y=P)])¯

×(p~(k-1)(X|Y=P)-E[p~(k-1)(X|Y=P)])]¯

+r2V[p~(k-1)(X|Y=P)].

According to the Cauchy-Schwarz inequality and the positivity of the variances, the underlined term is no more than V[p^D(X|Y=P)]1/2V[p~(k-1)(X|Y=P)]1/2. Thus the following inequality is obtained, and the lemma holds.

V[p^(k)(X|Y=P)]⩽(1-r)2V[p^D(X|Y=P)]

+2(1-r)rV[p^D(X|Y=P)]1/2V[p~(k-1)(X|Y=P)]1/2

+r2V[p~(k-1)(X|Y=P)]

={(1-r)V[p^D(X|Y=P)]1/2+rV[p~(k-1)(X|Y=P)]1/2}2

={V[p^D(X|Y=P)]1/2

+r(V[p~(k-1)(X|Y=P)]1/2-V[p^D(X|Y=P)]1/2)}2.

□

This lemma indicates that V[p^(k)(X|Y=P)] can be less than V[p^D(X|Y=P)], if r is sufficiently small. The following theorem is easily derived from these two lemmas.

.

If the following holds, there exists some r∈(0,1) such that MSE[p^(k)(X|Y=P)]<MSE[p^D(X|Y=P)].

B[p^D(X|Y=P)]Δs⁢u⁢p(X)<V[p^D(X|Y=P)]

(9)

-V[p~(k-1)(X|Y=P)]1/2V[p^D(X|Y=P)]1/2.

□

.

MSE[p^(k)(X|Y=P)] is bounded above by

MSE𝑠𝑢𝑝[p^(k)(X|Y=P)]

={B[p^D(X|Y=P)]+r1-rΔ𝑠𝑢𝑝(X)}2

+{V[p^D(X|Y=P)]1/2

+r(V[p~(k-1)(X|Y=P)]1/2-V[p^D(X|Y=P)]1/2)}2

according to Lemmas 2 and 3. Therefore,

MSE𝑠𝑢𝑝[p^(k)(X|Y=P)]→r→0+MSE[p^D(X|Y=P)],

and MSE𝑠𝑢𝑝[p^(k)(X|Y=P)]→r→1-∞

hold. These facts imply that there exists some r∈(0,1) such that MSE[p^(k)(X|Y=P)]<MSE[p^D(X|Y=P)], if the differential of MSEs⁢u⁢p[p^(k)(X|Y=P)] for r is negative in a vicinity of 0+, i.e.,

∀ϵ>0,∃δ∈(0,1),c<0, s.t. ⁢∀r∈(0,δ),

⇒|dMSE𝑠𝑢𝑝[p^(k)(X|Y=P)]d⁢r-c|⩽ε.

The above condition is equivalent to the following, which proves the lemma.

limr→0+⁡dMSEs⁢u⁢p[p^(k)(X|Y=P)]d⁢r

=2{B[p^D(X|Y=P)]+r1-rΔ𝑠𝑢𝑝(X)}1(1-r)2Δ𝑠𝑢𝑝(X)

+2{V[p^D(X|Y=P)]1/2

+r(V[p~(k-1)(X|Y=P)]1/2-V[p^D(X|Y=P)]1/2)}

×(V[p~(k-1)(X|Y=P)]1/2-V[p^D(X|Y=P)]1/2) s.t. r→0+

=2B[p^D(X|Y=P)]Δ𝑠𝑢𝑝(X)

+2V[p~(k-1)(X|Y=P)]1/2V[p^D(X|Y=P)]1/2

-2V[p^D(X|Y=P)]<0.

⇔B[p^D(X|Y=P)]Δ𝑠𝑢𝑝(X)<V[p^D(X|Y=P)]

-V[p~(k-1)(X|Y=P)]1/2V[p^D(X|Y=P)]1/2.

□

(10)

V[p^D(X|Y=P)]≃C|DP|p(X|Y=P),and

(11)

V[p~(k-1)(X|Y=P)]≃C|DU|eff(k-1)p(X|Y=P),

respectively [26, 10, 14]. C is a positive constant given by the integrated square of pK(X|x), and |DU|eff(k-1) is the effective size of DU with weights w~⁢(x)(k-1) defined as

|DU|eff(k-1)=1∑x∈DU(w~⁢(x)(k-1))2⁢ s.t.⁢∑x∈DUw~⁢(x)(k-1)=1.

From Eqs (1) and (7), x in a certain fraction α of DU have w~⁢(x)(k-1)>1/|DU|, and thus ∑x∈DU(w~⁢(x)(k-1))2⩾α⁢|DU|⁢(1/|DU|)2=α/|DU|⇒|DU|eff(k-1)=O⁢(|DU|) holds. According to this fact, Eqs (10) and (11), the r.h.s. of Eq. (2) is large, i.e., Eq. (2) may hold, when |DP| is small and |DU| is large. Therefore, when there are few samples in DP and many samples in DU, MSE[p^(k)(X|Y=P)] is generally less than MSE[p^D(X|Y=P)] if using an appropriate r∈(0,1), and EL-PUC is expected to attain higher accuracy than NL-PUC.

4.4Performance measure and parameter selection

When a certain amount of test data is available, the performance of our PUCs can be empirically evaluated for the purpose of comparison with other candidate PUCs and to tune the PUC parameters. However, standard performance indices such as AUC and F-measure are not applicable, because the test data sets in practical problems targeted by PUCs usually do not include labeled negative samples.

For such circumstances, a number of studies have proposed performance indices similar to the traditional AUC that assess the goodness of the classifier classes, but not particular parameter selections of the classifiers [9, 17, 11]. Only a few studies have proposed indices that are directly applicable to parameter selection. Lee and Liu proposed an index equivalent to the geometric mean of precision and recall in the one sample setting [13]. Calvo et al. proposed a generic pseudo F-measure that matches the traditional F-measure if we substitute the correct πM [3]. We employ the latter index because it is applicable to the two sample setting.

The unavailability of πM in our problems presents an issue. Given a labeled positive test data set MP and a unlabeled test data set MU with Assumption 1, assuming that both MP and MU have certain sizes for statistically stable evaluations, we derive the following lemma for our likelihood based PUC which uses Eq. (3) or equivalent Eq. (4). It ensures that the pseudo F-measure using an arbitrary π~ in place of πM is a maximum if and only if the PUC correctly classifies all samples in MU. In the case where MP is not given, we use a subset DP′⊂DP as MP under Assumption 1 and use DP∖DP′ to train the PUC. In this study, we use a moderate π~= 0.5.

.

Let MUP be the set of all positive samples in MU, and let M^P and M^UP be the sets of all samples classified as positive in MP and MU, respectively, by using Eq. (3) or equivalent Eq. (4). The following pseudo F-measure [3] is expected to be a maximum for any π~>0, if and only if M^UP=MUP.1 1

F~=2⁢π~⁢|M^P|/|MP||M^UP|/|MU|+π~

□

.

By Assumption 1, the positive samples in MUP and MP have identical p(X|Y=P). Thus, the probability of MUP that a positive sample in MUP is correctly classified as positive by applying Eq. (3) or equivalent Eq. (4) is identical with that of MP. In addition, M^P⊆MP always holds. Accordingly, E⁢[|MUP∩M^UP|/|MUP|]=E⁢[|MP∩M^P|/|MP|]=E⁢[|M^P|/|MP|], and the following holds.

F~≃2⁢π~⁢|MUP∩M^UP|/|MUP||M^UP|/|MU|+π~.

When |M^UP|⩾|MUP|, the denominator is positive under π~>0 and attains a minimum when |M^UP|=|MUP|, because the denominator monotonically increases on |M^UP|∈[|MUP|,|MU|]. The numerator with π~>0 attains a maximum when MUP⊆M^UP where |MUP∩M^UP| is a maximum for any |M^UP|(⩾|MUP|). Thus, F~ is a maximum when M^UP=MUP.

Similarly, when |M^UP|<|MUP|, the numerator with π~>0 is a maximum for any |M^UP|(<|MUP|) if M^UP⊆MUP where F~ is represented as

F~≃2⁢π~/|MUP|1/|MU|+π~/|M^UP|,

which is monotonically increasing on |M^UP|∈[0,|MUP|] under π~>0. Thus, F~ is maximized when M^UP=MUP.

By combining these two cases, we conclude that, for any π~>0, F~ is maximized if and only if M^UP=MUP. □

The analysis in Subsection 4.3 indicated that the parameter r in Eq. (4.2) affects the accuracy of the EL-PUC. We set up EL-PUC using two candidate values of r. EL-PUCdf uses arbitrary default r= 0.5 which is a moderate value in the interval (0,1). Another EL-PUCcv uses r selected by cross validation using F~. In the first stage of EL-PUCcv, we apply 10CV to select the best r by randomly splitting DP and DU into 10% fractions: DP′ and DU′ for testing, i.e., MP and MU, and remaining 90% fractions: DP∖DP′ and DU∖DU′ for training in each fold, respectively.2 2 Finally, EL-PUCcv is trained using the entire DP and DU and the selected parameter.

We use a Gaussian kernel density with kernel width h for the non-parametric estimation of the probability density functions required in NL-PUC and EL-PUC. Wang empirically showed that the theoretical value of h that minimizes the mean integrated square error of the un-weighted non-parametric estimator is also optimal in case of a weighted non-parametric estimator, and called this a plug-in estimator [26]. We use h of this plug-in estimator.

5.Experimental evaluation

Using both artificial and UCI data sets, we evaluated the accuracy of NL-PUC, EL-PUCdf and EL-PUCcv and their robustness to differences between πM and πD. The results were compared with those given by Elkan & Noto’s PUCs [7] using Gaussian Naïve Bayes based (NB-E&N) and Gaussian kernel density based (KD-E&N) estimators of p^D(X|Y=P) and p^D⁢(X). As with our proposed PUCs, we applied the kernel width h of the plug-in estimator to KD-E&N. We did not include other PUCs for the two sample setting in comparison [27, 15, 6, 18], because they require MU to estimate πM, and this is not available in advance in our problem setting.

The codes of the five PUCs were implemented using MATLAB and run in Windows 10 PC equipped with Intel Core i7-4790-3.6GHz-8 cores CPU and 32GB RAM.

5.1Performance evaluation using artificial data

We used artificial data containing p(X|Y=P) and p(X|Y=N) which are two Gaussians having a common covariance Σ=[1001] and distinct means μP=(0, 0), μN=(2, 0). Given πD and πM, the three data sets DP, DU and MU were generated by i.i.d. sampling from pD(X|Y=P), pD⁢(X) in Eq. (1) and pM⁢(X) in Eq. (2), respectively.

Figure 1.

Performance for artificial data having three different πD= 0.1, 0.5 and 0.9 (|DP|= 1000, |DU|= 1000, |MU|= 1000 and πM∈ {0.1, 0.5, 0.9}).

Figure 1 shows comparisons of precision, recall and F-measure between the five PUCs in terms of the three class prior probabilities of DU: πD= 0.1, 0.5 and 0.9. Under given πD with fixed data sizes |DP|= 1000, |DU|= 1000 and |MU|= 1000, we generated 100 cases of {DP,DU,MU} for each class prior probability of MU: πM∈ {0.1, 0.5, 0.9},i.e., 300 cases in total. The respective PUC trained using DP and DU is tested by using MU to measure its performance indecies in each case. A bar, a whisker and a diamond shape of a PUC in the figures represent the mean, the standard deviation and the minimum/maximum of the performance index over the 300 cases covering the three πM∈ {0.1, 0.5, 0.9}, respectively. The mean represents the accuracy of a PUC over the difference of the class prior probabilities between DU and MU. The less standard deviation and the less interval between the minimum and the maximum show the higher robustness of a PUC to the difference.

In these figures, the accuracy and robustness of all PUCs mostly decrease with increasing πD, because the negative samples in DU become less while they are the unique information source on p(X|Y=N). NB-E&N and KD-E&N particularly lose their recall in this condition, since they use classifiers of labeled and unlabeled samples which do not capture the difference between the labeled positive samples and the unlabeled negative samples when almost all samples are positive. Nevertheless, all the three indecies show high accuracy and higher robustness of our three PUCs than NB-E&N and KD-E&N in almost all conditions. This comes from the fact that the classification measures Eqs (3) and (4) used in our PUCs are independent of the class prior probabilities. The accuracy of EL-PUCdf and EL-PUCcv is particularly higher than that of NL-PUC under πD= 0.9. This may be because V[p~(k-1)(X|Y=P)] is kept small by estimating p~(k-1)(X|Y=P) from both DP and DU, while B[p^D(X|Y=P)] is small as long as Assumption 2 holds on a continuous space 𝒳 [24, 10]. These effects make Eq. (2) in Theorem 2 to widely hold even under the large πD where the performance of NL-PUC is degraded by the shortage of the information on p(X|Y=N).

Figure 2.

Performance for artificial data having various |DP|= 10, 100, 1000 and 10000 (πD= 0.5, |DU|= 1000, |MU|= 1000 and πM∈ {0.1, 0.5, 0.9}).

Figure 3.

Performance for artificial data having various |DU|= 100, 1000 and 10000 (πD= 0.5, |DP|= 1000, |MU|= 1000 and πM∈ {0.1, 0.5, 0.9}).

Figure 4.

Computation times of the five PUCs.

Figure 2 shows comparison of the three indices between the five PUCs over |DP|= 10, 100, 1000 and 10000. As in Fig. 1, we generated 300 cases over πM= 0.1, 0.5 and 0.9 under given |DP|, |DU|= 1000 with fixed πD= 0.5 and |MU|= 1000 in an experiment, and measured the mean, the standard deviation and the minimum/maximum of the three indices of each PUC over the 300 cases covering the three πM∈ {0.1, 0.5, 0.9}, respectively. Figure 3 was created for the comparison over |DU|= 100, 1000 and 10000 under fixed πD= 0.5, |DP|= 1000 and |MU|= 1000 in similar manner. In these figures, our three PUCs again show higher accuracy and robustness than NB-E&N and KD-E&N in the most conditions including the case of DP= 10 in Fig. 2. Elkan & Noto’s PUCs show larger standard deviations and intervals between the minimum and the maximum of the accuracy. Particularly, they show significant decreases of the accuracy in the imbalanced conditions between |DP| and |DU| such as the cases of |DP|= 10 and 10000 in Fig. 2, and DU= 100 in Fig. 3. This is because they use classifiers of labeled and unlabeled samples which are statistically degraded under the highly imbalanced conditions. The accuracy of EL-PUCdf and EL-PUCcv is particularly higher than that of NL-PUC when a small size |DP|= 10 and a large size |DU|= 1000 are provided in Fig. 2. This character is consistent with the suggestion of Theorem 2 and (10) and (11).

Figures 4a and b represent the dependency of the computation times required to train the five PUCs using the artificial data in log-log plots. The computation times under given |DP| were averaged over all combinations of |DU| and πD in the former, and those under given |DU| were averaged over all combinations of |DP| and πD in the latter. The line of “EL-PUCcv” shows its computation times after the tuning of the parameter r, and the line of “r tuning time” indicates its times required for the tuning by the 10 fold cross validation. Since NL-PUC simply learns Eq. (4), its complexity for training is O⁢(|DP|+|DU|). Both EL-PUCdf and EL-PUCcv have their complexity of O⁢(K⁢(r)⁢|DU|⁢(|DP|+|DU|)) where K⁢(r) is the number of the iteration of Eqs (4.2)–(7). In each iteration step, the computation of w~(k-1)⁢(x) is repeated |DU| times in Eq. (6) where each w~(k-1)⁢(x) is computed by Eqs (4.2) and (7) using |DP|+|DU| samples. In the r tuning, a test computes the pseudo F-measure using the samples proportional to |DP|+|DU| where each sample is classified by using the other |DP|+|DU| samples. This is repeated 10 times with the training of EL-PUCcv. Thus, the total complexity of the r tuning is O⁢(10⁢(|DP|+|DU|)2+10⁢K⁢(r)⁢|DU|⁢(|DP|+|DU|)). NB-E&N and KD-E&N process |DP|+|DU| samples for classifying a sample if it is labeled or unlabeled. They repeat this process |DP| times to evaluate the probability that a positive sample is labeled. Thus, their total complexity is O⁢(|DP|⁢(|DP|+|DU|)). These facts are well reflected to Fig. 4a and b where the dependency of their computation times is linear or quadratic. Particularly, KD-E&N needs computation times comparable with or more than our proposed PUCs. Note that these computation times may be positively biased by some overhead process under small |DP|= 10.

Figure 5.

Performance for UCI concrete compressive strength data having various πD= 0.1, 0.5 and 0.9 (|DU|= 200, |DP|∈ {10, 30, 90, 270} and πM∈ [0.1, 0.5, 0.9]).

5.2Performance evaluation using UCI data

In the UCI repository, there are only a few data sets having known measurement processes. We used “Concrete Compressive Strength" (#samples: 1030, #observed variables [X]: 7, target variable [Y]: compressive strength), “Airfoil Self-Noise" (#samples: 1503, #observed variables [X]: 5, target variable [Y]: angle of attack), and “Steel Plates Faults" (#samples: 1941, #observed variables [X]: 26, target variable [Y]: sigmoid of areas) from the physical sciences. We removed their categorical variables, binarized the target variables by applying their median values as a threshold, and chose one of the binary values as positive. Thus, they became balanced data sets in terms of their positive and negative samples. In each data set, we drew 20% of the total to produce three DU having the probabilities πD= 0.1, 0.5 and 0.9 to include positive samples, respectively. We also took extra 1%, 3%, 9% and 27% of the total to create four DP, respectively, by choosing positive samples only. Furthermore, we drew extra 10% of the total to produce three MU having πM= 0.1, 0.5 and 0.9, respectively. For each of the three DU, we computed the mean, the standard deviation, and the minimum/maximum of the accuracy indices of the five PUCs by marginalizing over all combinations of DP and MU.

Figures 5–7 compare the results using these UCI data sets. Because the figures of precision and recall of the last two data sets show similar behaviors with those of the first one, they are omitted. The figures mostly show a similar tendency with those of the artificial data. Under the most conditions of πD, our three PUCs are more accurate and robust than NB-E&N and KD-E&N to the varieties of the size |DP| and the differences of the class prior πM. Though the difference between our three PUCs are not very significant, EL-PUCcv provides the top average accuracy in many cases.

Figure 4c indicates the computation times to train the five PUCs using Concrete Compressive Strength data set. Because the size of the data set is small, the computation times of their overhead processes became dominant. Nevertheless, NL-PUC having lower complexity in essence was faster than NB-E&N, because the load of the kernel density based estimation was not significant for the small data set. We observed similar behaviors for the other two data sets.

Figure 6.

Performance for UCI airfoil self-noise data having various πD (|DU|= 300, |DP|∈ {15, 45, 135, 405} and πM∈ [0.1, 0.5, 0.9]).

Figure 7.

Performance for UCI steel plates faults data having various πD (|DU|= 400, |DP|∈ {20, 60, 180, 540} and πM∈ [0.1, 0.5, 0.9]).

Figure 8.

Comparison between F-measure: F and pseudo F-measure: F~ of EL-PUCdf and NB-E&N for various data (|DP|= 10, |DU|= 1000 and πD,πM= {0.1, 0.5, 0.9} for artificial data, and the conditions of Fig. 5–7 for UCI data).

6.Discussion

All results in Section 5 indicate that EL-PUCcv, which tunes r by using the pseudo F-measure: F~, achieves the highest or comparable accuracy and robustness among our three PUCs. This reflects Lemma 4 that we can appropriately tune the parameters of the classifiers by using F~. The traditional F-measure: F and F~ of EL-PUCdf depicted in Fig, 8 more directly represents their consistency. Note that F~ can be larger than unity according to its definition in Lemma 4. Though the standard deviations and the intervals between the minimum and the maximum of F~ are larger than those of F, i.e., F~ is less robust to the deviations of πM from πD under various conditions, the means of F and F~ show consistent behaviors over the four data sets. This supports the use of F~ to evaluate the accuracy of our likelihood based PUCs. Figure 8 also shows the comparisons of F and F~ for NB-E&N which uses the aforementioned one sample setting. Although this setting does not match with the assumptions required to the consistency of F~, F and F~ of NB-E&N indicates their consistency comparable to those of EL-PUCcv. This may be because the probability to correctly classify a positive sample as positive by the Gaussian Naïve Bayes classifier used in NB-E&N is moderatly affected by the variation of the class priors π. This shows that the pseudo F-measure F~ is widely applicable to qualitatively assess the performance of the PUCs.

Figure 9.

Dependency of EL-PUC on parameter r for artificial data having various πD (|DP|∈ {15, 45, 135, 405}, |DU|∈ {100, 1000, 10000} and πD∈ [0.1, 0.5, 0.9]).

According to the dependency of EL-PUC’s accuracy on the parameter r as indicated by the lemmas and the theorems in Subsection 4.3, we proposed EL-PUCcv which selects an optimal r using the cross validation to achieve the highest accuracy. Meanwhile, Fig. 9 shows F-measure of EL-PUC over various r for the artificial data having three πD= 0.1, 0.5 and 0.9 where all other conditions were marginalized. In the marginalized view, the accuracy of EL-PUC does not have any clear dependency on r. This supports the experimental results in Section 5 indicating the small differences of the accuracy and the robustness between EL-PUCdf and EL-PUCcv in most cases. On the other hand, EL-PUCcv requires extremely large computation times for the parameter tuning as shown in Section 5. These facts suggest the limited applicability of EL-PUCcv to practical measurement tasks particularly when the available time for training is limited.

Besides, EL-PUCdf shows higher accuracy and robustness than NL-PUC under large πD and small |DP| while the computation time of EL-PUCdf is larger than that of NL-PUC. Therefore, we need to consider the trade-off between the classification performance and the computation time for the selection of the two approaches.

7.Real-world application

We applied our PUCs to noise reduction in a real-world single molecule measurement [25]. Figure 10 depicts the core part of the sensor called a “nano-gap.” The passage of an object between two nanoscopic electrodes is observed as an electric pulse induced by the quantum tunneling effect. Our task is to classify each pulse into a target molecule and a contaminant in the solvent based on the pulse outline in real time. A technical difficulty here is that the solvent containing no contaminants is hardly produced. Therefore, we always obtain a data set containing the target molecule’s pulses with the contaminant’s noise pulses, whereas we can get a data set containing the noise pulses of the contaminants only by observing the solvent without the target molecules. Moreover, the nano-gap sensor is disposable in every measurement task. Because characters of the nano-gaps are mutually different depending on the fluctuations of their manufacturing processes while a nano-gap is stable during a measurement task, we need to train the classifier in a short on-line period for the quick start at the beginning of every measurement task. Accordingly, the time for acquiring the noise pulses is very limited, while the target pulses with the noise pulses are acquired for a long period in their measurement. These required specifications meet with the conditions addressed by our proposed PUCs.

Table 1

Performance comparison of noise reduction

	Time for	F~
Method	Training	πD≃πM	πD<πM	πD≪πM
NB-E&N	1.47 msec	0.572	0.582	0.592
KD-E&N	4.90 msec	0.318	0.322	0.282
NL-PUC	7.29 msec	0.715	0.671	0.652
EL-PUCdf	736.19 msec	0.726¯	0.695¯	0.672¯

Figure 10.

A Single molecule sensor.

In our application, a pulse signal time series between the start and the end of the pulse is partitioned into 10 equi-width time intervals, and the signal points in each interval are averaged. The pulse outline is a 10 dimensional vector consisting of the 10 averages. We acquired DP consisting of the noise pulse outlines, which were labeled as positive, in a short initial on-line period for the quick start. Successively, the unlabeled pulse outlines of the target molecules with the contaminants are acquired for a certain limited period to create DU where their labels are estimated by a PUC during its training and the pulses classified into the target molecules become the initial part of the measurement output. After the PUC is trained by using DP and DU, the on-line noise reduction of new incoming pulses starts. The ratio of the targets and the contaminants in the solvent, i.e., the class prior, rapidly varies in the on-line measurement.

We used the pseudo F-measure F~ to evaluate the performance of the noise reduction, because the ground truth of the classification was unknown. We acquired an extra labeled positive test set MP together with DP in the initial short period, and further acquired some unlabeled test sets MU for computing F~ of the five PUCs after the on-line measurement started. Their sizes were defined as |DP|=|MP|= 20, |DU|= 800 and |MU|= 100 by following the aforementioned specifications and the convenience to compute F~.

Table 1 shows the performances of the four PUCs applied to three MU. We excluded EL-PUCcv because its large computation time is not suitable for on-line processing. The same codes and the same PC with Section 5 were used. The class prior πM of the first MU acquired immediately after the start of the on-line measurement is almost same with πD. Then, contaminants in the solvent rapidly increased over the periods acquiring the second and third MU which made the second MU as πD<πM and the third MU as πD≪πM. Note that F~ is not normalized into [0,1], and its value does not represent the absolute accuracy, while the larger F~ show a better performance. The best numbers are underlined in each column. Particularly, EL-PUCdf showed the best F~, while it takes relatively large computation time for its training. NL-PUC is better for the instant start of the on-line measurement, and EL-PUCdf is better if the dead time for nearly 1sec is allowed. The entire performance of NB-E&N and KD-E&N were significantly worse than those of NL-PUC and EL-PUCdf becuase of the small |DP|= 20 available for the training. This is consistent with the result for the artificial data depicted in Fig. 2.

8.Conclusion

Our proposed PUCs show high accuracy and robustness under wide conditions. They are particularly advantageous when the numbers of the labeled positive samples and the unlabeled samples are imbalanced and when the positive class prior probability is dominant. Moreover, our proposed EL-PUC provides better accuracy than our proposed simple NL-PUC when the number of the labeled positive samples is limited and when the positive class prior probability is dominant, while EL-PUC needs more computation time.

In various practical problems, the labeled positive dataset DP may be contaminated by the falsely labeled negative instances. Some instances in DP and DU may have missing values. The PUCs including our proposed approaches and the past approaches are not applicable to such incomplete datasets. These issues remain for future work. Moreover, we proposed our PUCs using the nonparametric estimation. The idea may be extended to discriminative frameworks.

Notes

1 MUP is unknown since MU is unlabeled.

2 We use a line search to seek the best r.

References

[1]	Y. Bengio, O. Delalleau and N.L. Roux, Efficient non-parametric function induction in semi-supervised learning, in Proc. AISTATS05: the 10th International Workshop on Artificial Intelligence and Statistics, (2005) , pp. 96–103.
[2]	G. Blanchard, G. Lee and C. Scott, Semi-supervised novelty detection, J. Machine Learning Research 11: ((2010) ), 2973–3009.
[3]	B. Calvo, I. Inza, P. Larranaga and J.A. Lozano, Wrapper positive bayesian network classifiers, Knowledge and Information Systems 33: ((2010) ), 631–654.
[4]	M.C. du Plessis, G. Niu and M. Sugiyama, Class-prior estimation for learning from positive and unlabeled data, in Proc. ACML15: the 7th Asian Conf. on Machine Learning, vol. 45, (2015) , pp. 221–236.
[5]	M.C. du Plessis and M. Sugiyama, Semi-supervised learning of class balance under class-prior change by distribution matching, Neural Networks 50: ((2014) ), 110–119.
[6]	M.C. du Plessiss, G. Niu and M. Sugiyama, Analysis of learning from positive and unlabeled data, in Proc. NIPS14: Advances in Neural Information Processing Systems, vol. 27, (2014) , pp. 703–711.
[7]	C. Elkan and K. Noto, Learning classifiers from only positive and unlabeled data, in Proc. KDD08: the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, (2008) , pp. 213–220.
[8]	J. Gama, I. Žliobait e, A. Bifet, M. Pechenizkiy and A. Bouchachia, A survey on concept drift adaptation, ACM Computing Surveys (CSUR) 46: ((2014) ), 44:1–44:37.
[9]	S. Hajizadeh, Z. Li, R.P.B.J. Dollevoet and D.M.J. Tax, Evaluating classification performance with only positive and unlabeled samples, in Proc. S+SSPR14: Structural, Syntactic, and Statistical Pattern Recognition, vol. LNCS 8621, (2014) , pp. 233–242.
[10]	N.W. Hengartner and E. Matzner-Lober, Asymptotic unbiased density estimators, ESAIM: Probability and Statistics 13: ((2009) ), 1–14.
[11]	S. Jain, M. White and P. Radivojac, Recovering true classifier performance in positive-unlabeled learning, in Proc. AAAI17: the 31st AAAI Conf. on Artificial Intelligence, (2017) , p. 3060.
[12]	K. Komlos, S. Popovics, T. Nurnbergerova, B. Babal and J.S. Popovics, Ultrasonic pulse velocity test of concrete properties as specified in various standards, Cement and Concrete Composites 18: ((1996) ), 357–364.
[13]	W.S. Lee and B. Liu, Learning with positive and unlabeled examples using weighted logistic regression, in Proc. ICML03: the 20th Int. Conf. on Machine Learning, (2003) .
[14]	A. Lewis, Getdist: Kernel density estimation, github: getdist document, University of Sussex, (2015) . http://cosmologist.info/notes/GetDist.pdf.
[15]	X.-L. Li, P.S. Yu, B. Liu and S.-K. Ng, Positive unlabeled learning for data stream classification, in Proc. SDM09: the 2009 SIAM Int. Conf. on Data Mining, (2009) , pp. 259–270.
[16]	E.A. Marina De Marco, Influence of left ventricular stroke volume on incident heart failure in a population with preserved ejection fraction (from the strong heart study), American Journal of Cardiology 119: ((2017) ), 1047–1052.
[17]	A. Menon, B.V. Rooyen, C.S. Ong and B. Williamson, Learning from corrupted binary labels via class-probability estimation, in Proc. ICML15: the 32nd Int. Conf. on Machine Learning, vol. 37, (2015) , 125–134.
[18]	G. Niu, M.C. du Plessis, T. Sakai, Y. Ma and M. Sugiyama, Theoretical comparisons of positive-unlabeled learning against positive-negative learning, in Proc. NIPS16: Advances in Neural Information Processing Systems, Vol. 29, (2016) , pp. 1199–1207.
[19]	S.J. Pan and Q. Yang, A survey on transfer learning, IEEE Trans. on Knowledge and Data Engineering 22: ((2010) ), 1345–1359.
[20]	D. Pfeffermann, C. Skinner, D.J. Holmes, H. Goldstein and J. Rasbash, Weighting for unequal selection probabilities in multilevel models, J. the Royal Statistical Society. Series B (Statistical Methodology) 60: ((1998) ), 23–40.
[21]	H.G. Ramaswamy, C. Scott and A. Tewari, Mixture proportion estimation via kernel embedding of distributions, in Proc. ICML16: the 33rd Int. Conf. on Machine Learning, vol. 5, (2016) , pp. 2996–3004.
[22]	M. Saerens, P. Latinne and C. Decaestecker, Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure, Neural Computation 14: ((2002) ), 21–41.
[23]	C. Scott, A rate of convergence for mixture proportion estimation, with application to learning from noisy labels, in Proc. AISTATS15: the 18th Int. Conf. on Artificial Intelligence and Statistics, vol. 38, (2015) , pp. 838–846.
[24]	B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, (1985) , ch. 3.3 and 43.
[25]	M. Tsutsui, M. Taniguchi, K. Yokota and T. Kawai, Identifying single nucleotides by tunneling current, Nature Nanotechnology 5: ((2010) ), 286–290.
[26]	B. Wang and X. Wang, Bandwidth selection for weighted kernel density estimation, Electronic J. of Statistics ((2007) ). doi: 10.1214/154957804100000000.
[27]	G. Ward, T. Hastie, S. Barry, J. Elith and J.R. Leathwick, Presence-only data and the em algorithm, Biometrics 65: ((2009) ), 554–563.
[28]	T. Washio, G. Imamura and G. Yoshikawa, Machine learning independent of population distributions for measurement, in Proc. DSAA17: the 4th IEEE Int. Conf. on Data Science and Advanced Analytics, (2017) , pp. 212–221.
[29]	X. Zhu, Z. Ghahramani and J. Laffer, Semisupervised learning using gaussian elds and harmonic functions, in Proc. ICML03: the 20th Int. Conf. on Machine Learning, (2003) .