Unions of Intervals

Problem Setup

Let the domain be $X = \mathbb{R}$ , and let $C = C_s$ be the class of concepts defined by unions of $s$ intervals. That is, each concept c is defined by real numbers $a_1 \leq b_1 \leq a_2 \leq b_2 \leq ... \leq a_s \leq b_s$ where $c(x) = 1$ iff $x \in [a_1, b_1] \cup [a_2, b_2] \cup ... \cup [a_s, b_s]$ .

Question

Describe an efficient algorithm that learns the class $C_s$ for every $s$ . Assume that $s$ is known ahead of time to the learner. You should describe a single algorithm that works for all $C_s$ . Since $s$ is known, the learner can choose the number of examples needed as a function of $\epsilon, \delta$ and $s$ . You can use any hypothesis space you wish.

Learning Algorithm

Given a training set $S$ , the learning algorithm will return a union of smallest intervals that contain positive labels. The algorithm will go through the training data in ascending order. When it encounters the first positive label, that will be the beginning of the first interval. The end of the first interval will be the last continuous positive label since beginning of the interval. Then the algorithm will find the start and the end the second interval and so forth.

$h$ is consistent with the training data. But $h$ does not necessarily need to have s intervals. Two adjacent intervals $I_1, I_2$ can have the same value for $I_1$ 's close and $I_2$ 's start. Therefore, the hypothesis space $H$ is not the same as the concept class $C$ .

Prove PAC

The generalization error of $h$ will be the region where $h$ and $c$ differs.

Let $c_{l_i}$ be the left boundary and $c_{r_i}$ be the right boundary of an arbitrary interval in c.

We have 4 possible cases: 1. $L_-$ $h_{l_i}$ is more than $\frac{\epsilon}{2 s}$ away from $c_{l_i}$ (to the left) 2. $L_+$ $h_{l_i}$ is more than $\frac{\epsilon}{2 s}$ away from $c_{l_i}$ (to the right) 3. $R_-$ $h_{r_i}$ is more than $\frac{\epsilon}{2 s}$ away from $c_{r_i}$ (to the left) 4. $R_+$ $h_{r_i}$ is more than $\frac{\epsilon}{2 s}$ away from $c_{r_i}$ (to the right)

$\begin{aligned} Pr[err_D > \epsilon] & \leq \sum_{n=1}^{s} Pr[L_- \lor L_+ \lor R_- \lor R_+] \\ & \leq s \times Pr[L_- \lor L_+ \lor R_- \lor R_+] \\ & \leq s \times (Pr[L_-] + Pr[L_+] + Pr[R_-] + Pr[R_+]) \quad (union \; bound) \\ & \leq 4 s(1 - \frac{\epsilon}{2 s})^m \quad (see\;explanation)\\ & \leq 4 s \mathrm{e}^{-\frac{\epsilon m}{2 s}} \quad\quad\quad (1 - x \leq \mathrm{e}^{-x}) \end{aligned}$

Explanation: The probability that we get all m examples from the non-error region is at most $(1 - \frac{\epsilon}{2 s})^m$ . Here we use the fact that each example is random and independent.

Now we set $4 s \mathrm{e}^{-\frac{\epsilon m}{2s}} \leq \delta$ (because our confidence measure in the good event must be at least $1 - \delta$ ). Solve for $m$ .

\begin{aligned} & 4 s \mathrm{e}^{-\frac{\epsilon m}{2 s}} \leq \delta \\ & \mathrm{e}^{-\frac{\epsilon m}{2 s}} \leq \frac{\delta}{4 s} \\ & \ln(\mathrm{e}^{-\frac{\epsilon m}{2 s}}) \leq \ln(\frac{\delta}{4 s}) \\ & \frac{\epsilon m}{2 s} \leq -\ln(\frac{\delta}{4 s}) \\ & \frac{\epsilon m}{2 s} \geq \ln({(\frac{\delta}{4 s})}^{-1}) \\ & m \geq \frac{2 s}{\epsilon} \ln(\frac{4 s}{\delta}) \end{aligned}

Thus, this learning algorithm takes at least $m$ examples to ensure a probability at least $1 - \delta$ for the hypothesis $h$ to have an error at most $\epsilon$ . We have proved $C$ is PAC-learnable by $H$ . $QED$ .

Edge Cases

Overlapping Intervals: If two or more intervals "stick" together (the algorithm will detect less than $s$ intervals), that is still fine. Because each error-region has probability mass at most $\frac{\epsilon}{2 s}$ .
All negative labels: The algorithm does not explicitly handle the case where the training data is all negative labels. But this case has a very small probability to happen (less than $\delta$ ).
All positive labels: The algorithm will only detect one interval in this case. But again a case like this only has very low probability to happen. We can minimize the likelihood to happen with a larger sample size.

There are a lot more edge cases that are not covered here. But these trivial cases all share the similar property that we have shown here - low probability.

Run-time

Argue briefly that your algorithm runs in time polynomial in $\frac{1}{\epsilon}, \frac{1}{\delta}$ , and $s$ .

Since the cost of processing each example is a polynomial (check + or -), the running time is a polynomial of $m$ . We know $m$ is a polynomial in $\frac{1}{\epsilon}, \frac{1}{\delta}$ , and $s$ . Therefore, the algorithm runs in time polynomial in $\frac{1}{\epsilon}, \frac{1}{\delta}$ , and $s$ as required.

PreviousPAC Examples NextAdvanced Data Structures

Last updated 5 years ago