Unions of Intervals

Problem Setup

Let the domain be X=RX = \mathbb{R}, and let C=CsC = C_s be the class of concepts defined by unions of ss intervals. That is, each concept c is defined by real numbers a1b1a2b2...asbsa_1 \leq b_1 \leq a_2 \leq b_2 \leq ... \leq a_s \leq b_s where c(x)=1c(x) = 1 iff x[a1,b1][a2,b2]...[as,bs]x \in [a_1, b_1] \cup [a_2, b_2] \cup ... \cup [a_s, b_s].

Question

Describe an efficient algorithm that learns the class CsC_s for every ss. Assume that ss is known ahead of time to the learner. You should describe a single algorithm that works for all CsC_s. Since ss is known, the learner can choose the number of examples needed as a function of ϵ,δ\epsilon, \delta and ss. You can use any hypothesis space you wish.

Learning Algorithm

Given a training set SS, the learning algorithm will return a union of smallest intervals that contain positive labels. The algorithm will go through the training data in ascending order. When it encounters the first positive label, that will be the beginning of the first interval. The end of the first interval will be the last continuous positive label since beginning of the interval. Then the algorithm will find the start and the end the second interval and so forth.

hh is consistent with the training data. But hh does not necessarily need to have s intervals. Two adjacent intervals I1,I2I_1, I_2 can have the same value for I1I_1's close and I2I_2's start. Therefore, the hypothesis space HH is not the same as the concept class CC.

Prove PAC

The generalization error of hh will be the region where hh and cc differs.

Let clic_{l_i} be the left boundary and cric_{r_i} be the right boundary of an arbitrary interval in c.

We have 4 possible cases: 1. LL_- hlih_{l_i} is more than ϵ2s\frac{\epsilon}{2 s} away from clic_{l_i} (to the left) 2. L+L_+ hlih_{l_i} is more than ϵ2s\frac{\epsilon}{2 s} away from clic_{l_i} (to the right) 3. RR_- hrih_{r_i} is more than ϵ2s\frac{\epsilon}{2 s} away from cric_{r_i} (to the left) 4. R+R_+ hrih_{r_i} is more than ϵ2s\frac{\epsilon}{2 s} away from cric_{r_i} (to the right)

Pr[errD>ϵ]n=1sPr[LL+RR+]s×Pr[LL+RR+]s×(Pr[L]+Pr[L+]+Pr[R]+Pr[R+])(union  bound)4s(1ϵ2s)m(see  explanation)4seϵm2s(1xex)\begin{aligned} Pr[err_D > \epsilon] & \leq \sum_{n=1}^{s} Pr[L_- \lor L_+ \lor R_- \lor R_+] \\ & \leq s \times Pr[L_- \lor L_+ \lor R_- \lor R_+] \\ & \leq s \times (Pr[L_-] + Pr[L_+] + Pr[R_-] + Pr[R_+]) \quad (union \; bound) \\ & \leq 4 s(1 - \frac{\epsilon}{2 s})^m \quad (see\;explanation)\\ & \leq 4 s \mathrm{e}^{-\frac{\epsilon m}{2 s}} \quad\quad\quad (1 - x \leq \mathrm{e}^{-x}) \end{aligned}

Explanation: The probability that we get all m examples from the non-error region is at most (1ϵ2s)m(1 - \frac{\epsilon}{2 s})^m. Here we use the fact that each example is random and independent.

Now we set 4seϵm2sδ4 s \mathrm{e}^{-\frac{\epsilon m}{2s}} \leq \delta (because our confidence measure in the good event must be at least 1δ1 - \delta). Solve for mm.

4seϵm2sδeϵm2sδ4sln(eϵm2s)ln(δ4s)ϵm2sln(δ4s)ϵm2sln((δ4s)1)m2sϵln(4sδ)\begin{aligned} & 4 s \mathrm{e}^{-\frac{\epsilon m}{2 s}} \leq \delta \\ & \mathrm{e}^{-\frac{\epsilon m}{2 s}} \leq \frac{\delta}{4 s} \\ & \ln(\mathrm{e}^{-\frac{\epsilon m}{2 s}}) \leq \ln(\frac{\delta}{4 s}) \\ & \frac{\epsilon m}{2 s} \leq -\ln(\frac{\delta}{4 s}) \\ & \frac{\epsilon m}{2 s} \geq \ln({(\frac{\delta}{4 s})}^{-1}) \\ & m \geq \frac{2 s}{\epsilon} \ln(\frac{4 s}{\delta}) \end{aligned}

Thus, this learning algorithm takes at least mm examples to ensure a probability at least 1δ1 - \delta for the hypothesis hh to have an error at most ϵ\epsilon. We have proved CC is PAC-learnable by HH. QEDQED.

Edge Cases

  • Overlapping Intervals: If two or more intervals "stick" together (the algorithm will detect less than ss intervals), that is still fine. Because each error-region has probability mass at most ϵ2s\frac{\epsilon}{2 s}.

  • All negative labels: The algorithm does not explicitly handle the case where the training data is all negative labels. But this case has a very small probability to happen (less than δ\delta).

  • All positive labels: The algorithm will only detect one interval in this case. But again a case like this only has very low probability to happen. We can minimize the likelihood to happen with a larger sample size.

There are a lot more edge cases that are not covered here. But these trivial cases all share the similar property that we have shown here - low probability.

Run-time

Argue briefly that your algorithm runs in time polynomial in 1ϵ,1δ\frac{1}{\epsilon}, \frac{1}{\delta}, and ss.

Since the cost of processing each example is a polynomial (check + or -), the running time is a polynomial of mm. We know mm is a polynomial in 1ϵ,1δ\frac{1}{\epsilon}, \frac{1}{\delta}, and ss. Therefore, the algorithm runs in time polynomial in 1ϵ,1δ\frac{1}{\epsilon}, \frac{1}{\delta}, and ss as required.

Last updated