# Find Median from Data Stream

## Question ([LC.295](https://leetcode.com/problems/find-median-from-data-stream/description/))

> Design a data structure that supports the following two operations&#x20;
>
> * void addNum(int num) - Add a integer number from the data stream to the data structure
> * double findMedian() - Return the median of all elements so far

The median is defined as the middle element of an ordered list. If the list is even, the median is the mean of the two middle values.&#x20;

## Example

```
median_finder = MedianFinder()
median_finder.addNum(1)
median_finder.findMedian()  # 1.0
median_finder.addNum(2)
median_finder.findMedian()  # 1.5
median_finder.addNum(3)
median_finder.findMedian()  # 2.0
median_finder.addNum(4)
median_finder.findMedian()  # 2.5
median_finder.addNum(5)
median_finder.findMedian()  # 3.0 

```

## Analysis&#x20;

Insertion sort is good choice to start with&#x20;

* addNum is O(n)&#x20;
* findMedian O(1)

`findMedian()` is already optimal. We can figure out ways to optimize `addNum(num)`.&#x20;

A valuable insight is that we don't have to keep the entire list sorted as long as we know where the median value is.&#x20;

A classic approach is to use 2 heaps. One min heap to keep all the bigger elements. One max heap to keep the smaller elements. As long as we can guarantee, at all times, the smallest element in the min heap is bigger than the biggest element in the max heap and the sizes are about the same (no more than 1), then we can compute the median in O(1).&#x20;

Both `pop()` and `push()` for heaps are O(logn). This will bring the time complexity down to&#x20;

* addNum is O(logn)&#x20;
* findMedian O(1)

A visual of two heaps will help with the understanding.&#x20;

```
self.min_heap: [1]
self.max_heap: []
self.min_heap: [2]
self.max_heap: [-1]
self.min_heap: [2, 3]
self.max_heap: [-1]
self.min_heap: [3, 4]
self.max_heap: [-2, -1]
self.min_heap: [3, 4, 5]
self.max_heap: [-2, -1]
```

## Code&#x20;

```python
import heapq 



class MedianFinder:

    def __init__(self):
        self.min_heap = []
        self.max_heap = []
        

    def addNum(self, num: int) -> None:
        
        # determine which heap to push 
        
        if len(self.min_heap) == 0 or num > self.min_heap[0]:
            heapq.heappush(self.min_heap, num)
        else:
            heapq.heappush(self.max_heap, -num)
        
        # balance if necessary 
        
        if len(self.min_heap) - len(self.max_heap) == 2:
            # pop from min and give to max 
            min_pop = heapq.heappop(self.min_heap)
            heapq.heappush(self.max_heap, -min_pop)
        elif len(self.max_heap) - len(self.min_heap) == 2:
            max_pop = -heapq.heappop(self.max_heap)
            heapq.heappush(self.min_heap, max_pop)
        

    def findMedian(self) -> float:
        
        if len(self.min_heap) < len(self.max_heap):
            return float(-self.max_heap[0])
            
        elif len(self.min_heap) > len(self.max_heap):
            return float(self.min_heap[0])
        else:
            return (self.min_heap[0] + -self.max_heap[0]) / 2.0 
```

## Alternative Solution&#x20;

An alternative data structure to use is a balanced binary search tree with time complexity&#x20;

* addNum is O(logn)&#x20;
* findMedian O(logn)

## Code&#x20;

```python
import sortedcontainers

class MedianFinder:

    def __init__(self):
        self.sorted_list = sortedcontainers.SortedList()
        

    def addNum(self, num: int) -> None:
        self.sorted_list.add(num)
    
    # [1, 2, 3, 4]
    def findMedian(self) -> float:
        n = len(self.sorted_list) 
        
        if n == 0:
            return 0
        
        if n % 2 == 1:
            return float(self.sorted_list[n // 2])
        else:
            return (self.sorted_list[(n // 2) - 1] + self.sorted_list[n // 2]) / 2.0 

```

## Follow Up&#x20;

1. If all integer numbers from the stream are between 0 and 100, how would you optimize it?&#x20;
   1. Use a hashmap to the bucket sort thing&#x20;
   2. Then a total count to see where the median falls&#x20;
   3. Iterating over a constant count is O(1)&#x20;
2. If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it?
   1. I think the exceptions are what making this interesting&#x20;
   2. We can do a ranged count strategy (i.e. `(-inf, -1], [0, 100], [101, inf)`)
   3. When numbers are large, the median will be between 0 and 100 anyway


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zedive.gitbook.io/project-l/part-1/basic_data_structure/heap/find-median-from-data-stream.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
