how many ways to decode #60%

Q(Leetcode 91): A message containing letters from A-Z is being encoded to numbers using the following mapping:

‘A’ -> 1, ‘B’ -> 2, … ‘Z’ -> 26
Given a non-empty string containing only digits, determine the total number of ways to decode it.

====analysis

I think this is similar to the punctuation problem.

–my botup solution

At each position in the string, keep a “score” number that represents “how many ways to decode a left-subtring ending here”

Useful — define my convenient jargon: we will say the encoding 10 to 26 are “high letters”, and the encoding 1 to 9 are “low letters”. If there are 95 ways to decode a string, i will call them 95 “formulas”.

At Position 33, i will look at score[31] (say equal to 95) and score[32] (say, equal to 97). if the two-char substring str[32:34] is between 10 and 26, then score[33] should include the 95 ways to decode str[:32]. Those 95 “formulas” can grow one high letter.

If str[33] is not ‘0’, then score[33] should also include the 97 ways to decode str[:33], because those 97 “formulas” can grow one low letter.

Aha — The 95 and the 97 formulas are all distinct because of the ending letter

I think we only need two variables to hold the previous two scores, but it’s easier to code with a score[] array.

K-palindrome #careercup 20%

Q (Facebook India, careercup): A k-palindrome is a string which transforms into a palindrome on removing at most k characters. Implement check(str, k). str has at most 20,000 characters. 0<=k<=30

Sample Test Case: Input : abdxa 1 -> false
====analysis
It’s crucial to zoom into the worst input early on — binary string with the correct “kills” well-dispersed.

Insight — once we group the chars by ascii code, I can see a successful palindrome is really an A-palindrome interleaved with a B-palindrome and C-palindrome…

This problem is related to the (tougher) problem of “max palindrome sub-sequence”

Eg k=33.
–idea 3: per-club distribution table
each char gets a dedicated club/distro, where we record all positions occupied by this char.
Since K is small, then every distro should be almost “balanced” around the midpoint.
— idea 7
There can be only up to 33 possible midpoints, including those in-between. (If all kills hit the upper end, then midpoint drops by 33/2.) Let’s try each possible midpoint.

Insight — once we pick a midpoint, we know how many we can kill above/below. For example, if we pick the highest midpoint, then all kills must hit lower half. The extreme midpoints are easier due to stringent constraints.

If we pick a central midpoint, then we can kill about 33/2 in each half, but the kills must match — nice constraint.

For each midpoint picked, we run Algo B:
At init-time, we compute the defects in each club/distro using Algo B1:
given the midpoint and the positions of all J’s, the defect count is the number of kills (killing a J or non-J) to balance J’s distro…

We will compile all 26 defect counts.  Now we start growing our seed at the chose midpoint. We expand both ways. If next 2 positions belong to different clubs J vs L, we evaluate the two kill options using Algo B2
One of the two choices will reduce overall defects. We first look at J’s distro. The J kill would improve or worsen J club’s defect.

If there’s an efficient way to compare evaluate the two kills or update the default counts, then we have a decent solution.

–idea 9: frq table. If there are 35 odd clubs, then 33 of them can become even but still 2 odd clubs –> immediately failure. However, This idea fails with the binary string.

wildcard matching #more tractable than regex_FB

Q (Leetcode44): Given an input string (s) and a pattern (p), implement wildcard pattern matching with support for ‘?’ and ‘*’. The matching should cover the entire input string (not partial).

‘?’ Matches any single character.
‘*’ Matches any sequence of characters (including the empty sequence). The star is not a quantifier. I think a single star is like the “.*” in perl.

no space allowed. I think a wildcard can even sit at end of string. I think we could see two wildcards in a row. Two stars can be treated as one star.

=====analysis

My topdn-memoization solution (on github) was accepted at Leetcode.

I feel this is simpler than the regex_FB.py problem, therefore easier to absorb. I think either DP or my original recursive solution should work.

suffix array of a haystack #learning notes

For an important usage, see find All hiding places of needle]haystack #suffixArray

The enhanced suffix array come with additional tables that reproduce the full functionality of suffix trees. The basic/standard suffix array is a construct simplified from a suffix tree.

— relationship between

  • Suffix array — saves the lexicographic rank of each suffix of a haystack.
  • LCP array —- Contains the maximum length of prefix match between two consecutive suffixes, after they are sorted (as in a dictionary) and stored in the suffix array.

Each array element describes one suffix.

Both are integer arrays of length N (i.e. haystack length). LCP saves some lengths; Suffix array saves head positions within haystack

By definition, the LCP is always “helper” for the suffix array. In contrast, the suffix array can be useful by itself.

find All hid`places@needle]haystack #suffixArray

Now I think suffix array is probably the best solution. Suffix array can be built in O(N) and can help any needle1, needle2, needle3 etc. This O(N) cost of pre-processing is small if searched repeatedly.

The Aha — this array is sorted by the string content. Therefore, all the “similar” suffixes club together.

Using the example in https://en.wikipedia.org/wiki/Suffix_array#Applications, if I want all matches for needle “AN” in haystack BANANA, using the suffix array, even a simple ( inefficient ) linear scan (over the array) would find AN, ANANA in a “cluster”. In general, O(logN) binary search is good.

This usage requires only suffix array, but many other usage scenarios require the helper LCP array.

Note the preprocessing is on the haystack not the needle.

Note building the suffix array is an one-time investment , for repeated usage. If used only once, then total cost is O(N) to build and O(logN) to use this array. I think it is comparable to, not worse than, brute-force.

concretize asterisk among brackets #Okao

https://leetcode.com/problems/valid-parenthesis-string/

Q (Leetcode 678): Given a string containing only three types of characters: ‘(‘, ‘)’ and ‘*’, write a function to check whether this string is valid.  An asterisk can concretize to empty string or a left or right bracket

====analysis

Kyle reviewed an O(N) solution, but advanced. Not really a “Medium” question if you want optimal. However in an interview perhaps a less optimized solution is good enough?

Worst data set will point out the constraint and structure, as explained in CIV: we wish all corner cases+worst data sets r given and clarify_requirement means

Need to use spreadsheet to work out a good sample data.

I feel confident to crack it if I get a good sample including some worst data.

— published O(N) solution

One scan. At each position ask the question “How many extra openers can there be up to this char” When the answer is negative, we give up and return false, but usually the answer is s a range like 0-2.

Use two global variables to record the lowest answer, and highest possible answer.

  • When we hit a ‘(‘, lo++ and hi++
  • when we hit a ‘)’, lo– and hi–
  • when we hit a ‘*’, then range expands,
    • lo—. Lowest possible extra opener count is now one lower , since the asterisk can be a closer
    • high++. Highest possible extra opener count is now one higher , since the asterisk can be an opener

I think at end of the scan, lo should be zero, i.e. zero-extra-opener is at least a possibility.

(Possibly reusable) Insight — The Range of possible values as a simple data structure. Might be reusable whenever we are asked “Can this be ….?”

given any string, generate splits into palindromes

https://leetcode.com/problems/palindrome-partitioning/

Q: Given a string s, partition s such that every substring of the partition is a palindrome. Return all possible palindrome partitioning of s.

====analysis

I found this question quite hard and had no confidence, .. overwhelmed, like manyDP problems. But then I came up with two DP solutions .. https://github.com/tiger40490/repo1/blob/py1/py/algo_str/genPalindromSplits.py

I don’t bother to run the leetcode tests as they tend to deflate my burning joy, absorbency… precious stuff.

insight — the word “partition” is horribly confusing. It can mean two very important entities in the problem domain — either a palindrome substring or a complete family of substrings that form the original word. It’s crucial to avoid this word

insight — challenge is not O() but a working solution that generates all splits without repetition

–idea 1: recursive top-down DP

memoization is not easy since I used generator.

–idea 2: iterative bottom-up DP

invalid/unbalanced brackets: kernel #62%

Q: given a string of N left+right brackets, you can find out how many invalid chars to remove to make it balanced. Say it’s R. There are up to N-choose-R ways to make it balanced. Print all unique balanced strings in compact form.

====analysis

subproblem: minimum how many invalid chars to remove

Useful preprocessing technique — on the left end, any closers must be removed. Same on the right end. No choice 🙂 We would end up with a valid char on each end. This is nice but optional in my Idea3 below.

— my idea 3 to count minimum cuts

Aha — There will surely be some “kernels” i.e. opener-then-closer in a row. First scan I will remove them, then remove more kernels. This is always optimal if we aim to minimize the cuts

  • [] ] [ [] ] [ [] ] becomes ]
  • []][[] becomes ][
  • [] [ [] [] [] [] becomes [
  • [[[][]][[]] becomes [
  • [[][]]][[]]][[] becomes ]][

What remain are the positions of bad chars. I need to remember these positions.

Case: closers only. Let’s look at one position like #55. We can cut #55 or another closer at an earlier position.

Case: openers only. Similar to the above.

Case: closers-openers. The original string is partitioned into exactly two sections, each similar to the above cases.

palindrome problems: patterns

Palindromes are … kinda rare but popular in coding interviews because they are easy to understand and unambiguous.

Most of the Palindrome problems are tough, and often require AuxDS.

You can move two pointers inward, or
You can move two pointers outward from the center if you are confident about the center ( or lower+upper bounds for the center’s location)

find longest full path #Rahul

Q: A single string encodes a directory tree. Find the longest full path to a leaf node. For example, “a/222/DDDD” is the longest path in “a\n\t33\n\t\tBB\n\t\tCC\n\t\t\tII\n\t222\n\t\tDDDD”

a
  33
    BBB
    CC
      II
  222
    DDDD

==== analysis

— solution 1: two-scan. O(Number of nodes)

Aha — looks like a pre-order output, so we can build the tree easily.

Each node takes a whole line; indent count indicates d2root. In 2nd scan, we walk the tree in either BFT or pre-order DFT. Each node will use a field to remember pathLengthTillHere.

— solution 2: one-pass. Same O(). Read the input string as a pre-order output.

Aha — No need to build any tree. Just use use a stack to keep the current nodes’ ancestor tree nodes. (Rahul said recursion might be easier).

Same pathLengthTillHere.

longest substring+!repeating chars #60%#peek

Q(leetcode #3): Given a string, find the longest substring without repeating characters.

–Sol1 O(N):
keep a never-shrinking sliding window + a “hashmap” of chars in it. Actually this HM can be a 26-element integer array of frequencies.

Every time the lagging edge of the windows moves by one, by definition one char drops out, so we remove that char from the HM, by decrementing its frequency. If hitting 0 then we also decrement a global var uniqCnt := sizeof the HM.

IFF uniqCnt == windowSz then window is a clean.

Every time we see a clean window and it’s longer than the longest clean window, we update our record.

edit distance

The DP idea — compare matrix-path-counter, which is more visual and easier than This one.

Q72 on Leetcode: Given two words word1 and word2, find the minimum number of operations required to convert word1 to word2. You have the following 3 operations permitted on a word:

  1. Insert a character
  2. Delete a character
  3. Replace a character

Comment — Top 100, naturally occurring. I won’t bother to pass all Leetcode tests esp. the load tests. If I pass all non-load tests I would consider my solution decent.

https://github.com/tiger40490/repo1/tree/py1/py/str has my implementation based on a DP idea online, and a spreadsheet illustration. The idea is elegant once you wrap your mind around it.

===analysis===
Starting with the small string (length S), The challenge is to project as many of the S chars to the large string (length L). If we can project 5 chars at most, then … (wrong — the remaining S-5 chars need replacement, and the other L-S chars need insertion.)

–idea2: draw all the projection arrows from S to L. In a good-projection, every arrow on the right should be more slanted than every arrow on the left. We want the largest good-projection. In the opening example, the largest would have 5 arrows, …

—-
None of these ideas has proven effective.

find min substr contain`all my fav chars

Update — a similar sliding window is used in longest substring without repeating chars

Q (leetcode): Given a string Haystack and a string T, find the minimum window in Haystack which contains (at least) all the characters in T according to the frequencies. Time complexity O(n). Eg: minWindow(ccbabccbabcb, bbc)==bcb

If there is such a window, you are guaranteed that there will always be only one unique minimum window in Haystack. <– I thought this guarantee means something but it doesn’t.

Without loss of generality, I will assume the chars are a-z. I believe those Leetcode corner cases will use only 3 chars

—analysis—

For single-string problem, use array indexed by ascii code. I can convert T to such an array to store the required frequencies (reqFrq)

I can construct a shadow array, same length as Haystack with these payloads:

  • if the hay is not in reqFrq, then payload is a special value like nullptr
  • if the hay is in reqFrq, then….?

–SolSW: sliding-window based

  1. Scan Haystack from left and keep count of actual frequency (check against reqFrq each time). I will inevitably find the earliest good window. By construction, both ends of this window are in reqFrq.
    • Note the entire haystack is more than a good window.
  2. Now I slide the fixed-sized window. If I find another good window, with extra chars on the left, then I have found a shorter window, so I truncate my window on the left
  3. continue Step 2

Longest Parentheses run with multiple hierarchies

Q (Leetcode): Given a string containing nothing but the characters ‘(‘ and ‘)’, find the length of the longest valid (well-formed) parentheses substring.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/str/maxParensRun.cpp is my solution 100% tested on Leetcode

–My Single-iteration solution:

Challenge is data structure. I ended up with 2 data structures to be updated during the iteration

  1. A stack (holding openers’ index values) to locate the matching openers
  2. an array to save “scores”

For each closer, I will record the position of the matching opener, then compute the distance (minimum two).

 

 

case-insensitive string search #KMP #Ahbinav

[[c++cookbook]] P174 suggests to prepare the needle by creating an all-upper-case needle, but leave the (bigger) haystack unchanged.

I discussed with my friend Abhinav. With std::search, we should probably upper-case the haystack as well. The std::search implementation can run M*N char-comparisons.

Even with the efficient KMP algorithm, typical complexity is M+N char-comparisons. So my solution is no worse than the authors’ solution.

longest palindrome subsequence !! substr #30%

This problem is probably tough but not so common. Let’s not spend too much time. If there’s an elegant idea then I should try then read it. To keep things simple just assume there are only 3 unique chars

I feel there should be some DP solution as a shorter haystack is clearly easier than a longer haystack

====DP idea 2 (efficiency is decent but clarity is outstanding — satisfaction)
At each position i in the haystack string, we keep a Club of …. “Members” each a palindrome subseq ending at i.
Requirement: Every eligible Member must be in the Club.

For the next position after incrementing i, this DP algo builds the new Club using all earlier ClubS. We look at each [4] Member in each earlier Club. All of these Members are unique by construction 🙂 Member is a COW class, having
* an immutable length
* an immutable “seq” array of subscripts representing the subseq — no two Members have identical contents in this->seq. The last array element is always i
* an immutable hashtable (or array) of “unused chars on my left”. Each entry is {unused char -> stack of positions}. Note this stack is naturally sorted.
This local optimization eliminates the expensive backscan.

The Club class is immutable. Conceptually, there’s no reason to modify a Club once constructed.

What do we do when we examine a Member? If a Member can “grow” (rightward or both ways) to use the new char, then this Member clones itself, grows and joins the new Club. The “growth” shall update the hashtable IFF growing both ways.

The new char itself is automatically a (trivial) Member of this new Club. Crucially, this procedure is how a new palindrome subsequence gets created.

At any time, before i++, the latest constructed Club includes all the palindrome subseq ending at i.
Similar invariants hold for earlier Clubs.
This level of complete control is kinda remarkable, thanks to the elegance of this DP algorithm.

That’s a fairly complete solution, but there might exist efficiency improvements.

— pruning the ever-growing Clubs
Each Club is immutable but at every position we create a new Club.

Say Club-5 (for i=5) has 44 Members, Club-6 has 22 Members, Club-7 has 53 Members. At position 8, We need to examine each Club. However, Some of the Members across the Clubs may be hopelessly short. This can be seen when the unseen section has only 2 chars.

If we can prune the Clubs (treating them as mutable) we can hopefully reduce the total computation cost.

For this purpose, we can keep “Groups” of Members. Group-3 are all the known Members of length 3. This feature doesn’t increase run time cost.

We only need two simple variables to start pruning
* R := remaining chars = len(haystack) – i
* LeadingPack.len
LeadingPack.len – R is the pruning criteria. Any pack whose len falls below are hopeless and pruned

Because pruning has the potential to drastically reduce computation, every record breaker is good news, giving us a chance to prune many existing Members.

— An experimental, optimistic technique — The default implementation would examine every known Member but we can be lazier. As we increment i, we will focus on Group-6 i.e. the leading pack of longest Members, all length 6, across all Clubs. At a given position, when there are only 15 chars on the forward scan, One Group-6 Member might keep growing 15 steps. In that case, this Member can safely be declared the winner, without looking at the “other” Groups.

We would need to remember the original IDs of the Group-6

This is an optimistic algo.

In fact, I might pick one length-6 Member. But if this pick stops growing, then I pick another from the leading pack… backtracking? IFF all of Group-6 stop growing, then we look at Group-5.

I feel this may not work so well, since there’s a high chance that some group-1 Member has the best potential. The current length of a Member is not a good predictor of ultimate length.

If this algo works it is again using auxDS, my strength

punctuate ContinuousSentence

I believe https://leetcode.com/problems/word-break/description/ is identical.

Q (bbg): Given a dictionary of English words (containing no numbers no underscores) and a very long sentence, someone has removed all the spaces. Please write a program to restore it by adding a space after each word. If there’s no way to parse the sentence just return False, otherwise return True to indicate you found at least one way to parse the sentence.

The sentence can have repeated words.

I was allowed to use either white-board or computer.

Brute force? I struggled for a long time with some vague idea. Interviewer asked me to implement any solution I have, however inefficient , but I was unable to.

Key observation — This is a typical QQ type of algo question. If you remember one key point, then you have a advantage.

Key observation — syntax and ECT is not really hard. Only simple string operations. Data structure is simple. Challenge is an elegant (precious) pure algo.

Key observation — the recursive backtracking solution is working and clever-looking but actually not the very best. Still, recursive thinking is valuable and gave me a first solution. I’m not ashamed of my recursive solution.

Key observation — This problem is considered not too hard because …. the solution is short! But most of us can’t conceive such a solution! I won’t trust the difficulty level in leetcode.com

https://github.com/tiger40490/repo1/blob/py1/py/str/continuousSentenceBBG.py shows my

recursive backtracking  solution. Interviewers basically said it works.

There’s an inherent tree structure in this greedy backtracking algo. This tree has the power to meet the tougher requirement in Q2. As such, this algorithm is more powerful than then algo below.

— More efficient solution: non-recursive, no backtracking

Suppose the sentence has 99 chars ch[0], ch[1] … ch[98]. We maintain an “aboveWater[99]” bool array. A ch[33] is above water if there exists a 4-char bridge word from aboveWater character ch[29] ….

The bool array is assumed underwater initially. We scan it forward once to check and mark selected char as aboveWater.

Once the left section of the array is analyzed, we don’t need to revisit it by backtracking.

The simplest algo — forward iterate every position. At each position i,
look at the last 2 chars j-1~j
look at the last 3 chars j-2~j
look at the last 4 chars j-3~j
… If any is a bridge word, then mark j as aboveWater and move on.

— Q2(Leetcode): Given a non-empty string s and a dictionary wordDict containing a list of non-empty words, add spaces in s to construct a sentence where each word is a valid dictionary word. Return ALL such possible sentences.

%%A: My github solution should solve it even if not optimal. Need to test.

FB3: edit distance==1 #untested

This trivial problem surprised me
  1. surprise — python string/list techniques are often easer than c++, so I need to adapt to the language
  2. surprise — took 40 min on white board, mostly because of c++ style thinking
A popular coding question for phone or white-board screening.

Q: Write a function that returns whether two words are exactly “one edit” away using the following signature:

bool OneEditApart(string s1, string s2);
An edit is:
  • Inserting one character anywhere in the word (including at the beginning and end)
  • Removing one character
  • Replacing one character

Examples:

OneEditApart("cat", "dog") = false
OneEditApart("cat", "cats") = true
OneEditApart("cat", "cut") = true
OneEditApart("cat", "cast") = true
OneEditApart("cat", "at") = true
OneEditApart("cat", "act") = false
https://github.com/tiger40490/repo1/blob/py1/py/str/diffBy1.py is my tested code. I first spent 40 minutes to write on white board but it is basically bug-free.
I was writing python in c++ mode and spent too much time on the len-off-by-1 case until I decided to insert into the shorter string. Then I found a much simpler solution …
.. #1 lesson learned in python string matching:
After finding a mismatch char, comparing two slices is much much simpler than insertion technique or iterator adjustment. No performance hit. If you want to avoid string copy, you can even extract a function to take string references.

bbg-Eq: longest abbreviation #easier

Given a long word with N non-unique characters, we know there can be 2^N abbreviations (including the original word, and including the empty string).

Requirement: I give you a list of strings and you suspect some of them could be an abbreviation of your word. Find the longest string among them that’s a valid abbreviation. Optimize for time complexity.

I feel this problem appears simple but not easy to complete quickly

https://github.com/tiger40490/repo1/blob/py1/py/str/testAbbr_ez.py is my python solution, different from the c++ solution below. Not sure which has more bugs.


I started with string::find() and hit the problem of non-unique characters. Interviewer hinted char-matching — the key point I needed for this type of problem.

#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

typedef string candidate;

vector<candidate> v{"monkey","aaaaaaaaaaaaaaaaaa", "mamalon", "monk"};
string const haystack{"mamalonkey"};
size_t hsize=haystack.size();

//Tip: can use typdef alias in argument:)
bool lenComp(candidate const & a, candidate const & b){
  return a.size()<=b.size();
}
ostream & operator<<(ostream & os, vector<candidate> const & c){
  size_t const sz = c.size();
  for (int i=0;i<sz; ++i){
        os<<c[i]<<" ";
  }
  os<<endl; return os; } //single-pass 🙂 
} 

bool check1candate(candidate const & c){ 
  if (c.size() > hsize) return false;

  for(int i=0, h=0; h<hsize; ++h){
        if (c[i] == haystack[h]){
          if (i==c.size()-1) return true;
          ++i;
        }
  }
  return false;
}
int main(){
  sort(v.begin(), v.end(), lenComp);
  cout<<v; for (int i=v.size()-1; i>=0; --i){
    if (check1candate(v[i])){
        cout<<"winner is "<<v[i];
        return 0;
    }
  }
}

 

bbg-FX: in-place array shuffle #splice^evict

Update: In-Place usually requires 1) splice 2) swap.

Requirement: given a list like

{1,2,3,4,  25,29,44,33,  159,139,155,150,  177,176,169,180} which describes four Persons
{id1,id2,..age1,age2, weight1,weight2,…height1,height2,..}

Write a program to output each person’s full attributes like

{1, 25, 139, 177,   2, 29, 129, 176,   3, 24, 135, 169,   4, 33, 140, 180}

Note the number of attributes (in this case 4) is an input parameter to your program. You don’t know (or care about) the name of each attribute. If you are told there are 37 attributes in each person, just output attr1 .. attr37.

Note the input and output sequence are important.

Q (Bonus question): do it in-place without creating a new collection. You can still have a small number of variables in your code.
A: one way to “cheat” is to append each output person to end of the original vector. It will double in size, so we will truncate the first half.
A: I believe I can use the “homeless/displacement” technique.
A: I now feel many in-place array reshuffle can be done in place with a single temp variable but this may not be one of them.
A: how about starting with a linked list?

https://github.com/tiger40490/repo1/blob/cpp1/cpp/array/insitu_reshuffle_attrib_bbg.cpp has ..

check if a word can reshuffle to palindrome

requirement: https://codecon.bloomberg.com/challenger-series/29. With minimal time and space complexity, the corner cases are tricky. I decided to add a null terminator:)

int main() {
 string S;
  cin>>S;
  vector<char> v(S.begin(), S.end());
  sort(v.begin(), v.end());
  v.push_back(0); //null char
  char last=v[0];
  size_t occur=0;
  bool isOddFound = false;

  for(int i=0; i<v.size(); ++i) {
    bool isNew = v[i] != last;
    //cout<<"--> "<<v[i]<<" isNew = "<<isNew<<endl;
    if (!isNew){
        ++occur;
        //cout<<last<<" occured "<<occur<<" times"<<endl;
        continue;
    }
    //cout<<last<<" occured total "<<occur<<" times"<<endl;
    if (occur % 2){
      if(isOddFound){
        cout<<"no"<<endl;
        return 0 ;
      }
      isOddFound = true;
    }
    last = v[i];
    occur = 1;
  }
  cout<<"yes"<<endl;
}

reverse words in a string containing a sentence

Q: A sentence can contain 2 or more words, spaces and punctuation marks but no hyphen. You are given a sentence as a c-string. Reverse the words in O(1) space O(N) time. “a1 b2 c3” becomes “c3 b2 a1”.

%%A: first pass to reverse the c-string character-wise by swapping. 2nd pass — scan left to right and keep the current-word-start and current-word-end positions. Once we have a word, reverse the chars in it by swapping.

substring search – Boyer-Moore

I think single-string coding questions are evergreen. The insights and techniques might be …reusable?

Brute force search — shift by one, then attempt to match the entire substring. My code below is more efficient than brute-force. Mostly it shifts by one each time, but checks just one char.

Q: write a python function match(string haystack, string nonRegexNeedle), following the simplified Boyer-Moore algo:

Try matching at the left end. If mismatch, then identify the LEFT-most offending char (J) in haystack. Now locate the RIGHT-most position of J in needle (if not found then simply shift all the way past the J.) and right-shift the needle to align the two J.

Q: So when would we right-shift  by one? Not sure. So I don’t know if my code below is 100% correct in all cases.

def visualize(haystack, substring, offset):
  print '\t--' + '0123456789 123456789 12345-----'
  print '\t--' + haystack
  print '\t--' + ' '*offset + substring
  print '\t--' + '-------------------------------'

def match(haystack, substring):
  offset = 0 #offset from start of haystack, where substring is now
  while(True):
    #now locate first offender in haytack
    offenderOffset, offender = -9999, ''
    for i, char in enumerate(substring):
       offenderOffset = offset+i
       if substring[i] != haystack[offenderOffset]:
         offender = haystack[offenderOffset]
         print offender + ' <-- spotted at offenderOffset = ', offenderOffset
         visualize(haystack, substring, offset)       
         break;
         
    if offender == '': return True
    
    # now back-scan substring to locate offender, and then shift to align
    # not really forward-scan haystack
    while True:
      offset += 1
      if offset + len(substring) > len(haystack):
           print 'Gave up complately:'
           visualize(haystack, substring, offset)       
           return False
      if offenderOffset - offset == -1: 
        print 'gave up on aligning: '
        visualize(haystack, substring, offset)
        break
      if substring[offenderOffset - offset] == offender:
        print 'aligned: '
        visualize(haystack, substring, offset)
        break # go back to start of outer while loop

print match('aborb bdb', 'abors') 

 

3c++London bbg algo IV #all about array

Q: given a string of characters, find the longest “run” of any repeating character.

Q: given an array of integers, and a target, determine if there’s any pair that add up to the target. Return true/false.

Q: maximum profit problem. Given a time series of historical spot FX rates, find the maximum possible trading profit by exactly one buy one sell.
A: I have seen this many times. First write pseudo-code.

FB 2questions #regex..

—Q1: write a stateless utility function to take in any string consisting of a-z, and return a string with all the original characters appearing once only. All 2nd and subsequent occurrences should be removed. eg: abcbdc → abcd

I don’t want to new up a string in my utility, so I decided to take in an output char* buffer which should be large enough.

I guess this is all about efficient speed coding.
—Q2: Suppose there’s a regex grammar that can accept exactly 3 constructs
1) literals — any combination of abcdefghijkljmnopqrstuvwxyz,
2) dot (.) that matches any one character
3) asterisk — a standard quantifier

For example,

“a*” can match empty string, a single a, or aa, or aaa etc.
“.*” matches any string including the empty string.
“a*b*a” should match “a” and also match “aa”, or “ba”, or “aaa”

—-
Now, write a utility function bool match(char const * regex, char const * S);

Note entire string S must match entire regex. We don’t want to get a false positive when only a substring of S matches the regex.
—– some tips—–

  1. Easy part — if regex ends in literals, then they must “eat up” the tail of S exactly.
  2. After this step, if regex ends with a dot, it will just eat up the last char of S. After this step, regex definitely ends in a star.
  3. If 2nd char in regex is not a star, then first char must eat up head of S.
  4. Repeat previous step until we see star as 2nd char.
  5. Now we can basically split the regex into segments. Each segment is either a single literal, a single dot, or a non-star followed by a star.

Interviewer said — First assume dot is not allowed in regex. We will deal with the dot later.

—– analysis
EPI Question 6.23 is very similar to my Facebook interview. I told FB interviewer that we don’t NEED such problem solving at work. He said this is one method (standard method?) they assess a candidate’s coding abilities.

I feel it (and google questions) is like IQ test or probability test or abstract reasoning test — irrelevant to work, but used to assess learning capacity. They don’t want a dumb person on their team.

The truth is, practice is key. There are patterns. A very very smart person will still fail if without enough practice. A mediocre guy can speed up significantly if practised enough. I said the same thing to My Hwa Chong math teacher Tony Tan. This factor is also the reason I became so strong in physics.

Q: A simple regex (SRE) can use alphanumerics, dot and star. It is enclosed between implicit ^ and $. Write a match() function taking a SRE and a word. No white space allowed.

https://github.com/tiger40490/repo1/blob/py1/py/regexFB.py is my implementation. Here are a few test cases:

assert not match(”, ‘.’)
assert match(‘xyab-abc’, ‘.*ab.*c.*x*’)
assert match(‘xyab-abc’, ‘.*ab.*c.*.*’)
assert match(‘xyab-abc’, ‘.*ab.*c’)
assert match(‘aaaaaaab-abc’, ‘a*aab.*c’)
assert match(‘aab-abc’, ‘a*aab.*c’)
assert match(‘aabc’, ‘a*aab.*..*x*’)
assert not match(‘abc’, ‘a*aab.*c’)
assert match(‘a-aab’, ‘a*.*a*aab’)
assert not match(‘a-aab’, ‘a*a*aab’)

[12] locate]sentence all permutations@99words #dragonSearch

https://leetcode.com/problems/substring-with-concatenation-of-all-words/description/ is very similar

Q1: you are given a list of (possibly repeating) words denoted L that are all the same length , and a long string denoted S. Find a substring of S that is a concatenation of each word in the list exactly once and without any intervening characters. Such a substring is known as a dragon.

To simplify the problem, you can assume it occurs at least once in the sentence.

Example:.
L: “fooo”, “barr”, “wing”, “ding”, “wing”.
S: “lingmindraboofooowingdingbarrwingmonkeypoundcake”.
Answer — fooowingdingbarrwing.

For O(), let’s assume there are H characters in haystack, W characters in each word, and N words (possibly repeating)

Q1b: what if word length is not uniform?

Solution-A — See https://github.com/tiger40490/repo1/blob/py1/py/str/dragonSearch.py with O(HN)

I hope this can be adapted for Q1b — see source comments

Solution-B1 (Brute force) — For each permutation, scan by indexOf() or strstr() or string::find(). Guaranteed to find a solution in finite time. Doable if N is small, since there’s O(N!) element.

Solution-B2 (Brute force) — WN-long sliding window over haystack. Each window is examined to determine if it’s a dragon.

This is reasonable if H is small relative to N.

Solution-C — a potentially faster solution if the list is large — Permutation of 5 would be 120 😦

Phase 1: For each word in the list, i’d count how many times it occurs. The word with lowest occurrence would be a “candidate”.

If every word has the same occurrence (say 3), I think this is still faster than Solution-B, because all but one of the occurrences is a false positive. It’s relatively easy to screen them out.

As soon as I find a word with occurrence=1 exactly, I exit Phase 1 with a super candidate. For this discussion, I assume we have a super candidate.

Updae — In such a case, I can also use the Solution A approach. Even if the best candidate has occurrence = 2 I can consider this approach…

Phase 2a: remove the super candidate from the list. Blindly read the next 4 character in S after our super candidate. See if any word starts with that…..

Phase 2b: do the same leftward. Perhaps by reversing all the strings.

[12]locate]long sentence a permutation of 3 words #src

/* Requirement: you are given a list of words denoted L that are all the same length , 
and a long string denoted S. Find a substring of S that is a concatenation of each 
word in the list exactly once and without any intervening characters. 
This substring is guaranteed to occur exactly once in S.

Showcase how to populate a map containing list,
* without excessive list cloing, and
* without explicit new
Showcase auto keyword
Showcase unordered_multiset
Showcase rbegin()
*/
#include <iostream>
#include <vector>
#include <list>
#include <unordered_map>
#include <unordered_set>
#include <memory> // make_shared
#include <string>
using namespace std;

string const S = "barrfooolingmindraboofooowingdingbarrwingmonkeypounddingdingbarrwingfooocake";
vector<string> L{"fooo", "barr", "wing", "ding", "wing"};
unordered_map<string, list<size_t> > lookup; //list of positions where the word appears
unordered_multiset<string> pool; //global variable

template<typename T> ostream & operator<<(ostream & os, list<T> const & l){
  for(auto it = l.begin(); it != l.end(); ++it) os<<*it<<" ";
  os<<endl;
}

shared_ptr<list<size_t> > countOccurrence(string const & word){
  shared_ptr<list<size_t> > ret= make_shared<list<size_t> >();

  for(size_t pos=0, hit=0;;pos=hit+4){
        hit = S.find(word, pos);
        if (hit == string::npos) break;
        ret->push_back(hit);
  }
  return ret;
}
int step( bool isForward){
        return isForward? 4 : -4;
}
char search1way(unsigned int pos2try, bool isForward){
    for(int marker = pos2try+step(isForward); ; marker+=step(isForward)){
        string const & substr4 = S.substr(marker,4);
        //cout<<"substr4 = "<<substr4<<endl;

        //now lets see if the substr is one of our words
        auto itr = pool.find(substr4);
        if (pool.end() == itr) break;
        pool.erase(itr);
        cout<<"found   "<<substr4<<"  at position "<<marker<<". words remain unfound = "<<pool.size()<<endl;
        if (pool.empty())  return 0;
    }
    return 'r' ; //has words remaining in pool;
}
int main(){
  pair<string, int> winner = make_pair("", 9999);
  for(int i=L.size()-1; i>=0; --i){
        string const & word = L[i];
        if (lookup.count(word)) continue;
        shared_ptr<list<size_t> > occ = countOccurrence(word);
        lookup.insert(make_pair(word, *occ));
        if (occ->size() < winner.second) winner = make_pair(word, occ->size());
        cout<<word<<" maps to "<<lookup[word];
  }
  cout<<winner.first<<" has the lowest occurrence = "<<winner.second<<endl;

  list<size_t> const & winnerPos = lookup[winner.first];
  for(auto it = winnerPos.rbegin(); it!=winnerPos.rend(); ++it){
    int pos2try=*it;
    pool = unordered_multiset<string>(L.begin(), L.end());
    pool.erase(pool.find(winner.first)); //guaranteed
    cout<<"trying  "<<winner.first<<"  at position "<<pos2try<<" as the anchor word. Words remain unfound = "<<pool.size()<<endl;
    if (0 == search1way(pos2try, true)) return 0;
    cout<<"searching backward..\n";
    if (0 == search1way(pos2try, false)) return 0;
    cout<<pos2try<<" trial failed :(\n\n";
  }
}

KMP algorithm for string search

Q: Why is std::search() and string::find() use simple O(N*M) solution and not using kmp?
A: kmp requires extra memory to store the pre-computed data structure. Allocating the storage is likely more costly than the big-O savings,
A: in the common use cases, the strings are not super long and there are not many repetitions in the pattern, so kmp will not help a lot
A: therefore, overall, the simple implementation is faster, even though big-O is inferior. I believe big-O comparison really compares large values of N
–prefix-function in one sentence: “longest border of a partial-match needle[1..j-1] that is not followed by the character needle[j] “. In other words, we are looking for the largest integer k such that

* needle[1..k-1] is a border of needle[1..j-1] and

* needle[1..k] is not a border of needle[1..j].

needle[1..j-1] is a “partial-match” and needle[1..k-1] is “border” of the partial-match. The border occurs at both ends of the partial-match.

(http://www.cs.sjsu.edu/~beeson/courses/cs255/LectureNotes/3-KnuthMorrisPratt.html)