Find Median from Data Stream

Question (LC.295)

Design a data structure that supports the following two operations

  • void addNum(int num) - Add a integer number from the data stream to the data structure

  • double findMedian() - Return the median of all elements so far

The median is defined as the middle element of an ordered list. If the list is even, the median is the mean of the two middle values.

Example

median_finder = MedianFinder()
median_finder.addNum(1)
median_finder.findMedian()  # 1.0
median_finder.addNum(2)
median_finder.findMedian()  # 1.5
median_finder.addNum(3)
median_finder.findMedian()  # 2.0
median_finder.addNum(4)
median_finder.findMedian()  # 2.5
median_finder.addNum(5)
median_finder.findMedian()  # 3.0 

Analysis

Insertion sort is good choice to start with

  • addNum is O(n)

  • findMedian O(1)

findMedian() is already optimal. We can figure out ways to optimize addNum(num).

A valuable insight is that we don't have to keep the entire list sorted as long as we know where the median value is.

A classic approach is to use 2 heaps. One min heap to keep all the bigger elements. One max heap to keep the smaller elements. As long as we can guarantee, at all times, the smallest element in the min heap is bigger than the biggest element in the max heap and the sizes are about the same (no more than 1), then we can compute the median in O(1).

Both pop() and push() for heaps are O(logn). This will bring the time complexity down to

  • addNum is O(logn)

  • findMedian O(1)

A visual of two heaps will help with the understanding.

self.min_heap: [1]
self.max_heap: []
self.min_heap: [2]
self.max_heap: [-1]
self.min_heap: [2, 3]
self.max_heap: [-1]
self.min_heap: [3, 4]
self.max_heap: [-2, -1]
self.min_heap: [3, 4, 5]
self.max_heap: [-2, -1]

Code

import heapq 



class MedianFinder:

    def __init__(self):
        self.min_heap = []
        self.max_heap = []
        

    def addNum(self, num: int) -> None:
        
        # determine which heap to push 
        
        if len(self.min_heap) == 0 or num > self.min_heap[0]:
            heapq.heappush(self.min_heap, num)
        else:
            heapq.heappush(self.max_heap, -num)
        
        # balance if necessary 
        
        if len(self.min_heap) - len(self.max_heap) == 2:
            # pop from min and give to max 
            min_pop = heapq.heappop(self.min_heap)
            heapq.heappush(self.max_heap, -min_pop)
        elif len(self.max_heap) - len(self.min_heap) == 2:
            max_pop = -heapq.heappop(self.max_heap)
            heapq.heappush(self.min_heap, max_pop)
        

    def findMedian(self) -> float:
        
        if len(self.min_heap) < len(self.max_heap):
            return float(-self.max_heap[0])
            
        elif len(self.min_heap) > len(self.max_heap):
            return float(self.min_heap[0])
        else:
            return (self.min_heap[0] + -self.max_heap[0]) / 2.0 

Alternative Solution

An alternative data structure to use is a balanced binary search tree with time complexity

  • addNum is O(logn)

  • findMedian O(logn)

Code

import sortedcontainers

class MedianFinder:

    def __init__(self):
        self.sorted_list = sortedcontainers.SortedList()
        

    def addNum(self, num: int) -> None:
        self.sorted_list.add(num)
    
    # [1, 2, 3, 4]
    def findMedian(self) -> float:
        n = len(self.sorted_list) 
        
        if n == 0:
            return 0
        
        if n % 2 == 1:
            return float(self.sorted_list[n // 2])
        else:
            return (self.sorted_list[(n // 2) - 1] + self.sorted_list[n // 2]) / 2.0 

Follow Up

  1. If all integer numbers from the stream are between 0 and 100, how would you optimize it?

    1. Use a hashmap to the bucket sort thing

    2. Then a total count to see where the median falls

    3. Iterating over a constant count is O(1)

  2. If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it?

    1. I think the exceptions are what making this interesting

    2. We can do a ranged count strategy (i.e. (-inf, -1], [0, 100], [101, inf))

    3. When numbers are large, the median will be between 0 and 100 anyway

Last updated