Other FB onsite coding/SDI questions

— Q: design type-ahead i.e. search suggestion. Scalability is key.

— Q: innerProduct2SparseArray (Table aa, Table bb). First you need to design the undefined “Table” class to represent a sparse array.
Then you need to write real code for innerProduct2SparseArray(), assuming the two Tables are already populated according to your definition.

— Q: add2DecimalsSameLength (string decimal1, string decimal2) to return a string version of the sum. Can’t use the python integer to hold the sum, as in java/c++ integers have limited size. You must use string instead.

Aha — carry can only be 0 or 1
I forget to add the last carry as an additional digit beyond the length of the input strings 😦

— Q: add2longBinary(string binaryNum1, string binaryNum2). I told interviewer that my add2BigDecimal solution should work, so we skipped this problem.

— Q: checkRisingFalling(listOfInts) to return 1 for rising, -1 for falling and 0 for neither. Rising means every new num is no lower than previous

— Q: checkDiameter(rootOfBinTree)

Advertisements

anagramIndexOf(): frqTable+slidingWindow

Q: Similar to java indexOf(string pattern, string haystack), determine the earliest index where a permutation of pattern starts.

====analysis

https://github.com/tiger40490/repo1/blob/py1/py/algo_str/anagramIndexOf.py is my tested solution featuring

  • O(H+P) where H and P are the lengths
  • elegant sliding window with frequency

Aha — worst input involves only 2 letters in haystack and pattern. I used to waste time on the “average” input.

On the spot I had no idea, so I told interviewer (pregnant) about brute force solution to generate all P! permutations and look for each one in the haystack

Insight into the structure of the problem — the pattern can be represented by a frequency table, therefore, this string search is easier than regex!

Then I came up with frequency table constructed for the pattern. Then I explained checking each position in haystack. Interviewer was fine with O(H*P) so I implemented it fully, with only 5 minutes left. Basically implementation was presumably slower than other candidates.

A few hours later, I realized there’s an obvious linear-time sliding window solution, but it would have taken more than the available time to implement. Interviewer didn’t hint at all there was a linear time solution.

— difficulty and intensity

I remember feeling extremely intense and extreme pressure after the interview, because of the initial panic. So on the spot I did my best. I improved over the brute force solution after calming down.

Many string search problems require DP or recursion-in-loop, and would be too challenging for me. This problem was not obvious and not easy for me, so I did my best.

I didn’t know how to deep clone a dict. The Indeed.com interviewers would probably consider that a serious weakness.

iterate K pre-sorted uneven immutable lists #FB

Interviewer (Richard) was kind enough to offer me a tip early enough. so we didn’t waste time (which could easily result in out-of-time)

Q: given K pre-sorted immutable lists, each up to N items, return an iterator that on demand yields each of the (up to K*N) items in sorted sequence.

Estimate time and space complexities.

====analysis

— I first proposed pair-wise merge. Since there are logK passes, Time complexity == O(NK logK)

Space complexity is tricky. Very first pass i have to create a single list up to NK items. Then I can reuse this list in each merge. so space complexity == NK [1], but I said NK*logK. Interviewer wasn’t familiar with this solution and didn’t correct me.

[1] See https://www.geeksforgeeks.org/sort-array-two-halves-sorted/. Suppose 8 lists to merge. I will merge A+B into first quarter of the big array (size NK), then C+D into 2nd quarter… In next pass, I will merge AB + CD in-place using the first half of my big array.

The advantage of this solution — once I create a single merged array, each iteration is O(1). This can be valuable if run-time iteration speed is crucial but initial warm-up speed is unimportant.

bigO insight — merging N pre-sorted arrays is O(N logN), same as merge sort?

— Then interviewer suggested iterating over the K lists so I implemented the solution in https://github.com/tiger40490/repo1/blob/py1/py/88miscIVQ/itr_K_presortedLists_FB.py

  • Space complexity = K
  • Time complexity:
    • next() O(logK) due to popping. I said Fib heap has O(1) insert
    • init() O(K)
    • hasNext() O(1)

How about a RBTree instead of heap? Space complexity is unchanged.  Time complexity:

  • next() O(logK) for both remove and insert
  • init()  O(K logK), worse than priorityQ
  • hasNext() O(1)

— FB interviewer asked why I prefer to keep state in global var rather than a per-record field

%%A: Runtime heap allocation is slower if the record is bigger. In contrast, the global dictionary is tiny and likely lives in L1-cache

max rectangle formed us`any 4points#Google#Rahul 50%

Q (Google onsite): given N Points defined by x,y coordinates, find the largest possible rectangle formed with four corner points drawn from the N points.

====analysis

worst input? many points line up along the north and west of a rectangle?

From N points, we get (roughly) NN line segments, actual count is up N(N-1)/2.

Any 3 points forming a right-angle corner is represented as a Corner object with 5 fields (or more)

  • 3 Points: clockwise start point, corner point, clockwise end point
  • 2 line segments

A FloatingLineSegment object has 2 fields {length, slope}

Any 2 line segments of equal length and slope forming a parallelogram is represented as a Para object having fields (or more):

  • 4 Points
  • 4 line segments
  • 2 FLSs
  • a single point representing the central intersection

Each line segment has one FLS and one slope. So we scan the NN line segments to populate some hashmaps.

A hashmap of {FLS -> set of lineSegments}

A hashmap of {slope -> set of FLS }. From any one slope value, we can look up the FLSs of that slope + the FLSs of the opposite slope.

Scan the NN line segments. For each line segment L1, compute the opposite slope. Use slope hashmap to get the FLSs. Use these FLSs to get the line segments perpendicular to L1. Discard any of those line segments not joining L1. This is how to populate .. A hashmap of {lineSegment -> set of Corners}

Scan the NN line segments. For each line segment L1, compute FLS. Use the FLS hashmap to get the other line segments. This is how to populate … A hashmap of {lineSegment -> set of Paras}

Are there more Corners or more Paras? I hope at least one of them is O(NN)

(From NN line segments, we could get a large number of Corners objects. I think the count of Corners is too big if all points line up along some big rectangle. I feel we should start from one corner and try all possible rectangles rooted there, then remove this corner from the search space.)

Solution 1: At any point C1, there are N-1 relevant line segments. Using a hashmap {slope ->?}, it takes up to O(NN) time to identify all Corners at C1. If N-1 is 8, then at most 4×4 such Corners, so worst case O(NN) such corners. For each Corner, check if the opposite corner point exists. This check is O(1). This entire process at C1 takes O(NN).

Overall O(NNN) since there are N points like C1. Average case is much better.

I feel O(NN logN) is possible.

— O(NNN) by Rahul — pick any 3 points, check if they form a corner. If yes, compute the opposite corner and check its existence.

–Idea 3: for the N(N-1)/2 line segments, sort by slope, length, lowest x-coordinate, lowest y-coordinate…?
sorting takes O(NN logN)

–idea 4: group N(N-1)/2 line segments by slope and length.

Within each group, there are up to N points and O(NN) line segments.

— O(NN) solution … wrong!

Pick any 2 points as a pair-of-opposite-corners. Work out the locations of the other 2 corners and check if they exist. This check is O(1).

There are O(NN) such pairs, so overall O(NN).

 

hash table using linear probing

Linear probing is less popular than separate chaining but has comparable O().

More important, it is easier to implement in speed coding.

https://github.com/tiger40490/repo1/blob/py1/py/88miscIVQ/re_hash_table_LinearProbing.py is my implementation

— vacated bucket

In the absence of delete(), linear probing can stop when hitting an empty slot. In put(), this means lucky. In get() this means input key is invalid.

In the presence of delete(), an empty slot could be a vacated bucket. In put(), this means lucky. In get(), this is ambiguous, so we may need to put a sentinel into such a bucket. A bucket with sentinel is treated as occupied.

spreadsheet concretize #Junli Part2

Note the java algo is event-queue based — every newly concretized cell is an event added to the event queue. When we encounter this cell again after a dequeue, All registered dependents of this cell are checked. If the check results in a new cell concretized, this cell is enqueued.

In contrast, my c++ algo is a modified BFT. Key idea is, whenever a branch node can’t be concretized (due to an unresolved upstream reference) we basically ignore that node’s subtree. The other root nodes’s BFT would eventually visit this node, unless there’s a cycle.

I believe both algorithms are relatively easy to visualize at a high level. Which algo implementation is more tricky and error-prone? I guess the BFT but not really sure.

— Topo-sort — “topological sorting” is the reusable general technique for similar problems like even scheduling. As I described to Kyle, the idea is “Given a directed graph, assign (artificial) integer ranks to all nodes so that every arrow is from a low rank to a high rank”

There are linear time algorithms to assign the ranks. I think some form of BFT may work… need more thinking.

I think it depends on what’s more natural — start from leaf nodes or start from root nodes. The start level would use lower integers.

For a typical spreadsheet, I feel it’s natural to start from nodes that have no downstream.

My c++ implementation was similar to Kahn’s algorithm.

[[Algorithms]] P 550 presents an elegant DFT  algo but not so intuitive to me yet. I think it DFT can’t solve this spreadsheet.

–Some additional notes on the c++ implementation

  • 🙂 I encountered much few seg-faults than in other projects. I think it’s because very few arrays (including vectors) are used.
  • Before I look up a map/set, I always use count() to verify.
  • 🙂 I didn’t need to check against end() as lower_bound() and find() functions would require.
  • no smart ptr needed. container of raw ptr worked well. See source code comments
  • In fact, container of cell names (as strings) is even “safer”.

min-cost partitioning #c++Flex #rare

Q: You will be given a natural number array and a threshold value T. The threshold represents the maximum length of subarrays that may be created for the challenge. Each sub-array you create has a cost equal to maximum integer within the sub-array. Your challenge is to partition the entire array into sub-arrays no longer than the threshold, and do it at minimum cost.

Function Description
Complete the function calculateCost in the editor below. The function must return an integer denoting the minimum cost of partitioning the array.

calculateCost has the following parameter(s):
a[a[0],…a[n-1]]: the integer array to be divided into sub-arrays
k: the threshold value, i.e the maximum size of any sub-array

Constraints
• 1 ≤ n ≤ 5000
• 1 ≤ k ≤ 500
• 1 ≤ a[i] ≤ 100000

For example, for T=2 and original array {1,5,2}, you have two ways to partition it:

  • {1} {5,2} total cost = 1 + 5 = 6 (this is lowest cost)
  • {1,5} {2} total cost = 5 + 2 = 7

— My greedy AlgoAA:

Update: thanks to XR here is an edge case to break AlgoAA: {49,50,99,0,98}

I will use the terms “group” and “subarray” interchangeably. A lone wolf is a group of one node.

I would first identify the global peak value, like 99. Should this node be a lone wolf? No. I can prove that it should “absorb” a neighbor node and become a subarray of two [1]. Should it absorb a 3rd node? I think I can again prove that it should. Therefore my greedy algorithm would first create a subarray of size K around the peak, leaving behind a left segment (and also a right segment), where we apply the same greedy algorithm.

[1] my informal proof — suppose the left neighbor has value 6 and is a loan wolf in the final grouping. We can improve this final grouping by merging this node with the peak. Total cost would reduce by 6. In another scenario suppose this node (value 6) is within subarray #12. Again, we can break up subarray #12, move out this “6” and merge it with the peak, without breaking any rule or increasing total cost.

So what algorithm to create the first subarray around the peak? Let’s assume k=3. There are up to 3 candidate groups, since the peak can be the first node, 2nd node or last node in its group. We can use a sliding window (of width 3) to identify the best among the candidates.

Q: why start from the peak not start from end of the array?
A: If you do, you may separate 2nd highest node from the peak, when they are adjacent. My AlgoAA would identify this situation early on, and put them in the same group.

— My greedy AlgoBB:

Each time after the window slide, we will compare the new window with the best window so far. The comparison is first based on the 2nd highest value in the window. If tied, then compare 3rd highest value in the window..

I think this is not hard to implement — convert each window to a heap then compare top to bottom.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/array/minCostPartition_Flex.cpp is a briefly tested implementation .. 60% confident.

movie query #SIG

Like the other coding question from the same employer, this is mostly about nested data structure.

Q1: Given sample csv file below (unsorted), implement an efficient query function get_movies(genre, Start_year, End_year). Optimize for query, not initial loading. Bonus points if you make it work within 75 minutes.

  #movieId,title,date,genres
  1,Toy Story 3,2010,Adventure|Animation|Children|Comedy|Fantasy
  2,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy
  ...... 20,000 more lines

https://github.com/tiger40490/repo1/blob/py1/py/movieQuery_SIG.py is my working code, good enough for a small data set like movies, but inefficient for bigger data set like newspaper articles.

Q2: what if the number of genres could be thousands, like a tag cloud.

— Q1 analysis
My initial thought was on binary search, with O(logN) + O(E-S) which is often equivalent to O(N). Locate first and last item in a sorted list, then iterate over enclosed items.

After some brief discussion with interviewer, I came up with a design of O(1) + O(E-S), arguably the fastest, featuring a jagged 2D array:

  • Assumption: genres are shared and shows up repeatedly in the file, rather than random strings that may show up only once.
  • use contiguous array to hold all movies, array indexed by year.
  • Within each “year bucket”, store multiple movies in a contiguous array indexed by genreId.
    • Note there is no gap in the genreId /numbers/. I initially said hardcoded genre enum but interview (Ken) challenged me with frequently changing genres. I now feel we can generate genreId as we load the csv file. This is comparable to some perfect-hashing described in Q2 analysis below.
  • each “cell” holds a list-of-shared_ptr-to-movie
  • each movie object is on heap, and owned by shared_ptr’s

— Q2 analysis:
If tags are user-generated, random and uncoordinated, then the total population of tags could be unlimited. Our genreId would go up to millions. Solution? Perhaps use a separate hash table within each “year bucket” instead.

However, since the query is “by tag” it implies that the most important tags are shared.

At this point, I feel design should depend on the real, not imaginary, context, so it’s kind of pointless to speculate what type of tags we will get.

max palindrome substring

https://leetcode.com/problems/longest-palindromic-substring/ seems to be the same problem.

Deepak CM received this problem in a real IV.

https://github.com/tiger40490/repo1/blob/py1/py/str/longestPalindrome.py is one solution, not O(N) at all.

— linear time solutions are on wikipedia, but probably not so intuitive so I give up.

— my simple O(NN) solution 1. first {O(N)} identify each “core” which is defined as

  • “ABA”
  • at least 2 count of the same char like AA

Then {O(N)} for each core, scan both ways in lock steps until we see a mismatch. Then we know the length of this palindrome.

https://leetcode.com/problems/longest-palindromic-substring/solution/ shows a O(NN) DP solution

— my one-scan (original) idea 2, but now I feel unsure.

We stop at every character on our forward scan. When we encounter any seed, we need to keep growing it, as one of these seeds will surely grow into the longest palindrome. However, how do we simultaneously grow so many seeds? We won’t due to efficiency.

Instead, I grow the earliest (oldest) seed only. Any seed encountered afterwards will be shorter , that is until the oldest seed stops growing and gets archived. After the archiving, I start a “manhunt’ — I look at the next oldest [1] seed and determine if it can grow to the current position. If it can’t then it is surely inferior to the just-archived oldest. If it can, then we end the manhunt right there and keep scanning forward

Guarantee to find it — every seed is queued. The only way for a seed to get archived is if it stops growing i.e. we have computed it’s full length

[1] FIFO container is needed

One of the worst test cases is a long black/white string containing only length-1 palindromes. My algo would archive many short seeds… and achieves O(N)

milePerGallon #DeepakCM SIG

Similar to GS Tick Query question but without optimization requirements.

–Requirement: Your Choice of Editor, OS, Compiler & Programming Language. Total time allotted for this  test is 75 mins, The further rounds of Interview will be based on this programming test.

A group of N people go for driving. When ever they refill the fuel in their vehicles, they update a text file with the following fields

<Person>, <Car>, <No of Miles driven on that day, before the fill>, <NO of Gallons filled i.e. gallons used for those miles>,<Date of fill>

All the fields are comma separated. A sample such file is given below

John, Ford, 350, 20, 20160921
John, Ford, 220, 13, 20160920
John, Ford, 230, 35, 20160918
John, Ford, 300, 22.5, 20161112
Jonathan, Mercedes GLA, 220, 13, 20160920
Mishal, Ford Escort, 230, 35, 20160919

Write a function with the following parameters
GetMPG(name_of_person, Start_date, End_date)
such that it should return the following information for each User/driver
1) Name of Car
2) Miles per Gallon for the mentioned period for each of the Car

A person can have more than one car but never share his cars. If there is no record for the mentioned date range, the function should not return anything for that specific car.

–analysis

https://github.com/tiger40490/repo1/blob/py1/py/milePerGallon.py is my python solution.

LRU cache #Part 1

I believe my friend got this question in a bbg phone interview.

Design and implement a data structure for Least Recently Used (LRU) cache. It should support the following operations:

get(key) – Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
put(key, value) – Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.

Follow up:
Could you do both operations in O(1) time complexity?

==Analysis==

Hash table to support lookup. The “value” is a pointer to a link node. Link node also has key/value — 2-way linkage.

slist is a FIFO and grows at tail for every new key/value pair, so head is the earliest pair. Every time a key/value is accessed via hash table, we move the node to the tail.

When capacity is reached, we would remove the tail node. Using the key in that node, we also remove from hash table.

generate paths in ternary matrix #Promethean TomJerry

–Requirement:

Similar to binary-matrix cluster count #DeepakM #DST+fullScan but the connection is modeled differently.

–Original requirements: Input matrix has an object in each cell

  • 1 means wall or barrier
  • 0 means passage
  • 2 means cheese station

Jerry is [x][y]. Tom starts at [0][0] and must collect all cheese and deliver to Jerry. Find shortest path. Tom can only move one (U/D/L/R) step at a time.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/2d/TomJerry.cpp is my half-finished code. It builds a graph connecting all the cells (each a Node object). It’s able to check feasibility. Todo:

  • breadth-first walk from Tom to Jerry, without revisiting any node. Hopefully we can discover all possible paths (i.e. collecting all cheese)
  • determine the shortest.

 

tokens shared among friends #Promethean

See also NxN matrix for graph of N nodes

Requirement: There are N friends numbered from 1 to N. There are M pairs of links, where each (x , y ) pair is connected by a shared integer token described by tokenId. Any two friends, x and y , can be connected directly by multiple tokens, or indirectly (without directly shared token) because if friends x and y share token t and friends y and z also share token t , then x and z are also said to share token t.

Note if x/y shares t and u/v share t, then x and u may be unconnected!

Find the maximal product of x and y for any directly or indirectly connected (x , y ) pair such that x and y share the maximal number of tokens with each other. If x/y have 3 tokens connecting them, and u/v also have 3 tokens, then we compare x*y vs u*v.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/tokenLinked_Friend.cpp showcasing nested containers like map<int, list<set<int>>>. I believe STL is harder than java, python etc.

construct graph from list of connections #BGC java

Given an input file showing a list of {string, string} pairs, build a connection graph.

If you have a correct connection graph, then you can easily determine the connectedness (bool) of any 2 nodes. In a social-network, this bool flag indicates whether 2 individuals are completely unconnected or somehow connected.

—-analysis:
I see this as a social-network. Any pair represents an edge connecting 2 nodes.  At any time there are a number of disconnected islands. The next pair could 1) merge 2 islands or 2) add a node to an existing island or 3) create a new island 4) do nothing, if the 2 nodes are already in some existing island

  • Any known node appears exactly once in the entire graph, in exactly one of the islands.
  • All nodes are contained in a lookup table or hashmap  {node -> island}
  • Each island can be a implemented as a hashset of nodes.

So here’s a proposed algo to process a new pair {A, B}. Look for A and B in the  graph. 3 scenarios + a dummy scenario:

  • (Scenario 3) If both A an B are new comers, then they form a new island.
  • if both A and B are already in the graph,
    • (Scenario 4) if they are in the same island, then exit. Nothing to do
    • (Scenario 1) else we can merge the 2 islands
  • (Scenario 2) If A is in island 3 but B is new comer, then B joins island 3

The merge operation is expensive. The big lookup table needs update but here’s an alternative:

  • At merge time, the smaller island would have all the nodes moved to the bigger island. When the island is empty, it gets a pointer “this.redirect” to the bigger island.
  • lookup table needs no update, avoiding locking a global object.
  • At query time, we look up the table to get the original island, then we follow its pointer (defaults to null) until the island is non-empty.
  • endless loop? would only be a programming error.

c# cod`IV – struct/disconnected

–req#0–
O(1): Insert(Employee e)
O(1): Delete(int id)
O(1): LookupById(int id)
O(1): LookupByEmail(string email)
–req#1–
Must not corrupt:
EmployeeLookup el = new EmployeeLookup();
Employee e = new Employee(1, “one@example.com“);
el.Insert(e);
e.Id = 5;
MUST BE TRUE: el.LookupById(1).Id == 1
MUST BE FALSE: e.Id = el.LookupbyId(1).Id
e = el.LookupById(1);
e.Id = 10;
MUST BE TRUE: el.LookupById(1).Id == 1
MUST BE FALSE: e.Id = el.LookupbyId(1).Id
—– my solution:
public struct Employee{
public int id;
public string em;
}
public class EL where T: struct {//EmployeeLookup
Dictionary d1 = new…;
Dictionary d2 = new…;
 //d1 add and d2 add should both fail or both succeed
bool Insert(T emp){//insert into d1/d2
if (!d1.TryAdd(emp.id, emp)) return false;
if (!d2.TryAdd(emp.email, emp)) {
d1.Remove(emp.id); //rollback
return false;
}
return true;
}
Employee //struct
LookupById(int id){
Employee found;
if (!d1.TryGetValue(id, out found)){//log and throw….
}
return found; //struct
}
}

spreadsheet concretize #Junli Part1

Q: A spreadsheet consists of a two-dimensional array of cells, labelled A1, A2, etc. Rows are identified using letters, columns by numbers. Each cell contains either a numeric value or an expression. Expressions contain numbers, cell references, and any combination of ‘+’, ‘-‘, ‘*’, ‘/’ (4 basic operators). Task: Compute all RPN expressions, or point out one of the Cyclic dependency paths.

——— Here is “our” design ———-
I feel it’s unwise to start by detecting circles. Better concretize as many cells as possible in Phase 1.

* First pass — construct all the RPN objects and corresponding cell objects. An RPN holds all the concrete or symbolic tokens. A cell has an rpn and also a cell name. If a cell is completely concrete, then calculate the result, and add the cell to a FIFO queue.

Also construct a p2d or precedent2dependent<Name, Set > map. It’s a look-up map of <Name, set > This will help us fire update events. If you wonder why use Name. In this context, name is a unique identifier for a cell. I use a simple hash map.

* 2nd pass — process the queue of concrete cells. For every cell removed from the queue, get its name and concrete value into a pair (call it ppair since it’s a Precedent). Look up p2d to get all dependents. Fire update events by feeding the ppair to each dependent cell, which will use the ppair to concretize (part of) its expression. If any dependent cell gets fully concretized, add it to the queue.

Remember to remove the ppair cell from p2d.

Only 2 data structures needed — queue and p2d.

* Phase 1 over when queue depletes. If p2d is still non-empty, we have cyclic dependency (Phase 2). All concrete cells have been “applied” on the spreadsheet yet some cells still reference other cells.

All remaining cells are guaranteed to be involved in some circle(s). To print out one circle, just start from any cell and follow first link and you are bound to hit the starting point.

implement a thread-safe queue using 2 stacks

Your idea is basically

void enqueue(Element e){
    this->stack1.push(e);
}

Element dequeue(){
    if (this->stack2.size()){
      return this->stack2.pop();
    }
    this->transferFromStack1to2();
    if ( ! this->stack2.size()){
        //throw exception or return a special value indicating QUEUE_EMPTY;
    }
    return this->stack2.pop();
}
void transferFromStack1to2(){
    assert this->stack2.size() == 0;
    //get locks on Stack1 and Stack2
    while(this->stack1.size()){
       this->stack2.push(this->stack1.pop()); 
    }
    //release locks
}

We can improve performance a bit with a proactively transfer — as soon as stack2 becomes empty. However, this must be done on another thread.