LeetCode - The Hard Way - Part 1

Introduction

Leetcode style interview is the most popular interview strategy nowadays, so although engineers rarely need to solve problems like Leetcode in realy work, we need to practice them a lot.

I felt tired and lost my motivation after 200 problems. Thus, I decide to start a fresh approach to solve leetcode problems.

Goals

This series will focus a lot on how we solve the problem in the most optimized solution for each programminng languages. It is not just time or space complexity but also how we optimize for each line of code.

We may also discuss the best practice for each programming language while implementing the solution.

For easier to keep track and more friendly to software engineers, I will use the famous list curated 75 leetcode problems from Blind with extenssion.

Problems

1. Contains Duplicate

Level: Easy

Overall Solution Idea:

To solve this problem, we can start with an obvious solution which uses a hashmap to track and count the numbers with for loop. If a number is counted to 2, we return false otherwise we return true after completing the loop.

Not optimized C++ solution


class Solution {
public:
    bool containsDuplicate(vector<int>& nums) {
        unordered_set<int> dict;
        for(int num: nums) {
            if (dict.find(num) != dict.end()) {
                return true;
            }
            dict.insert(num);
        }
        return false;
    }
};

Few things may cause to the unexpected latency issues:

std::unordered_set starts with a default number of buckets in the hashtable:
- hash table collisions.
- bucket_count - minimal number of buckets to use on initialization. If it is not specified, implementation-defined default value is used. Thus, when the number of distinguished items reachs the limit of container, it will cause to rehash which has the complexity linear O(n) to container case in average or quadratic O(n^2)
- https://cplusplus.com/reference/unordered_set/unordered_set/reserve/
- https://en.cppreference.com/w/cpp/container/unordered_set/unordered_set
- hash key calculation when accessing items.
cost of branching with if statement (https://en.algorithmica.org/hpc/pipelining/branching/)
- https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array

From the above perspectives, our approach should solve the following questions:

How can we improve the performance of std::unordered_set?
How can we optimize branching cost?

During the trip to optimize the unordered_set, there are alternative hash_map libraries used by companies like google, robin hood or hft trading. They have better benchmark performance to std::unordered_set. For more detail, we can follow the talk

Some obvious approach we can have:

reserve a number of items, so we will reduce the chance of rehash
dont use the branch prediction, just push the item into the set.

Optimized C++ solution

class Solution {
public:
    bool containsDuplicate(vector<int>& nums) {
        unordered_set<int> dict;
        dict.reserve(nums.size());
        for(int num: nums) {
            // unlikely to go to this branch for worst case scenario, if it goes to this case we can fail fast.
            if (dict.find(num) != dict.end())  [[unlikely]] {
                return true;
            }
            dict.insert(num);
        }
        return false;
    }
};

For fun, this problem is really short if we solve it in Python but it is hard to optimize python.

class Solution:
    def containsDuplicate(self, nums: List[int]) -> bool:
        return len(set(nums)) != len(nums)

2. Missing Number

Level: Easy

Overall Solution Idea: The first approach is that the missing number is the difference between sum of the array and sum from zero -> size of the array

Not optimized solution

class Solution {
public:
    int missingNumber(vector<int>& nums) {
        int sum = 0;
        int size = nums.size();

        for (int num: nums) {
            sum += num;
        }
        return size * (size + 1)/ 2 - sum;
    }
};

Few things may cause to problems:

if n is at the limit of int, floating point exception with sum calculation

We can use bitwise operation XOR to have the same result. XOR and ADD has the same cost in term of latency instruction_table

Optmized Solution

class Solution {
public:
    int missingNumber(vector<int>& nums) {
        int missing = nums.size();

        for (int i = 0; i < nums.size(); i++) {
            missing ^= (i^nums[i]);
        }
        return missing;
    }
};

Python for fun

class Solution(object):
    def missingNumber(self, nums):
        """
        :type nums: List[int]
        :rtype: int
        """
        missing = len(nums)
        for i in range(len(nums)):
            missing ^= i ^ nums[i]
        return missing

Note

There is really good talk for low latency programming in a HFT firm.

Some cool keywords: Profile Guide Optimization, GCC versions: performance impact and some other cool stuffs
Measurement of low latency systems:
- Profiling: examining what your code is doing (bottlenecks in paticular)
- Benchmarking: timing the speed of your system
- Perfect measurement using low latency hardware
- Software Solution: Google benchmark (but not support for the real environment)