simple diff class – in pure java, easily embeddable (src)

<![CDATA[ //Based on work by other authors import java.io.DataInputStream; import java.io.FileInputStream; import java.io.IOException; import java.util.Formatter; /** * This single-file self-contained utility consists of 1 public class and 2 * non-public classes (non-CamelCase names). *

* USAGE: diff oldfile newfile *

* This program assumes that “oldfile” and “newfile” are text files. The program * writes to stdout a description of the changes which would transform “oldfile” * into “newfile”. *

* The printout is in the form of commands, each followed by a block of text. * The text is delimited by the commands, which are: *

* DELETE AT n ..deleted lines *

* INSERT BEFORE n ..inserted lines *

* n MOVED TO BEFORE n ..moved lines *

* n CHANGED FROM ..old lines CHANGED TO ..newer lines *

* The line numbers all refer to the lines of the oldfile, as they are numbered * before any commands are applied. The text lines are printed as-is, without * indentation or prefixing. The commands are printed in upper case, with a * prefix of “>>>>”, so that they will stand out. Other schemes may be * preferred. *

* Files which contain more than MAXLINECOUNT lines cannot be processed. This * can be fixed by changing “symbol” to a Vector. The algorithm is taken from * Communications of the ACM, Apr78 (21, 4, 264-), * “A Technique for Isolating Differences Between Files.” Ignoring I/O, and * ignoring the symbol table, it should take O(N) time. This implementation * takes fixed space, plus O(U) space for the symbol table (where U is the * number of unique lines). Methods exist to change the fixed space to O(N) * space. *

* Note that this is not the only interesting file-difference algorithm. In * general, different algorithms draw different conclusions about the changes * that have been made to the oldfile. This algorithm is sometimes “more right”, * particularly since it does not consider a block move to be an insertion and a * (separate) deletion. However, on some files it will be “less right”. This is * a consequence of the fact that files may contain many identical lines * (particularly if they are program source). Each algorithm resolves the * ambiguity in its own way, and the resolution is never guaranteed to be * “right”. However, it is often excellent. This program is intended to be * pedagogic. Specifically, this program was the basis of the Literate * Programming column which appeared in the Communications of the ACM (CACM), in * the June 1989 issue (32, 6, 740-755). *

* By “pedagogic”, I do not mean that the program is gracefully worded, or that * it showcases language features or its algorithm. I also do not mean that it * is highly accessible to beginners, or that it is intended to be read in full, * or in a particular order. Rather, this program is an example of one * professional’s style of keeping things organized and maintainable. *

* The program would be better if the “print” variables were wrapped into a * struct. In general, grouping related variables in this way improves * documentation, and adds the ability to pass the group in argument lists. This * program is a de-engineered version of a program which uses less memory and * less time. The article points out that the “symbol” arrays can be implemented * as arrays of pointers to arrays, with dynamic allocation of the subarrays. * (In C, macros are very useful for hiding the two-level accesses.) In Java, a * Vector would be used. This allows an extremely large value for MAXLINECOUNT, * without dedicating fixed arrays. (The “other” array can be allocated after * the input phase, when the exact sizes are known.) The only slow piece of code * is the “strcmp” in the tree descent: it can be speeded up by keeping a hash * in the tree node, and only using “strcmp” when two hashes happen to be equal. * * @author Ian F. Darwin, Java version * @author D. C. Lindsay, C version (1982-1987) * @author minor adjustment by Bin Tan */ public class Diff { /* The following are global to printout’s subsidiary routines */ // enum{ idle, delete, insert, movenew, moveold, // same, change } printstatus; public static final int idle = 0, delete = 1, insert = 2, movenew = 3, moveold = 4, same = 5, change = 6; private static final String LINE = “————————–“; /** * block len > any possible real block len */ final static int UNREAL = Integer.MAX_VALUE; /** * main – entry point when used standalone. NOTE: no routines return error * codes or throw any local exceptions. Instead, any routine may complain to * stderr and then quit with error to the system. */ public static void main(String argstrings[]) { argstrings = new String[] { “C:\\0\\Copy of c#IV_question.txt”, “C:\\0\\c#IV_question.txt” }; if (argstrings.length != 2) { throw new RuntimeException(“Usage: diff oldfile newfile”); } Diff d = new Diff(); Formatter f = d.doDiff(argstrings[0], argstrings[1]); System.out.println(f); return; } boolean anyprinted; /** * blocklen is the info about found blocks. It will be set to 0, except at * the line#s where blocks start in the old file. At these places it will be * set to the # of lines in the block. During printout , this # will be * reset to -1 if the block is printed as a MOVE block (because the printout * phase will encounter the block twice, but must only print it once.) The * array declarations are to MAXLINECOUNT+2 so that we can have two extra * lines (pseudolines) at line# 0 and line# MAXLINECOUNT+1 (or less). */ int blocklen[]; final public Formatter formatter = new Formatter(new StringBuilder()); /** * Keeps track of information about file1 and file2 */ fileInfo oldinfo, newinfo; int printstatus, printoldline, printnewline; // line numbers in old & new file public Diff() { } /** * Do one file comparison. Called with both filenames. */ public Formatter doDiff(String oldFile, String newFile) { println(“>>>> Difference of file \”” + oldFile + “\” and file \”” + newFile + “\”.\n”); oldinfo = new fileInfo(oldFile); newinfo = new fileInfo(newFile); /* we don’t process until we know both files really do exist. */ try { inputscan(oldinfo); inputscan(newinfo); } catch (IOException e) { System.err.println(“Read error: ” + e); } /* * Now that we’ve read all the lines, allocate some arrays. */ blocklen = new int[(oldinfo.maxLine > newinfo.maxLine ? oldinfo.maxLine : newinfo.maxLine) + 2]; oldinfo.alloc(); newinfo.alloc(); /* Now do the work, and print the results. */ transform(); return printout(); } /** * inputscan Reads the file specified by pinfo.file. ——— Places the * lines of that file in the symbol table. Sets pinfo.maxLine to the number * of lines found. */ @SuppressWarnings(“deprecation”) void inputscan(fileInfo pinfo) throws IOException { String linebuffer; pinfo.maxLine = 0; while ((linebuffer = pinfo.file.readLine()) != null) { storeline(linebuffer, pinfo); // System.out.println(linebuffer); } } /* * newconsume Part of printout. Have run out of old file. Print the rest of * the new file, as inserts and/or moves. */ void newconsume() { for (;;) { if (printnewline > newinfo.maxLine) break; /* end of file */ if (newinfo.other[printnewline] oldinfo.maxLine) break; /* end of file */ printnewline = oldinfo.other[printoldline]; if (printnewline < 0) showdelete(); else if (blocklen[printoldline] oldinfo.maxLine) { newconsume(); break; } if (printnewline > newinfo.maxLine) { oldconsume(); break; } if (newinfo.other[printnewline] < 0) { if (oldinfo.other[printoldline] < 0) this.showchange(); else showinsert(); } else if (oldinfo.other[printoldline] < 0) showdelete(); else if (blocklen[printoldline] >>> End of differences.”); else println(“>>>> Files are identical.”); return formatter; } /* * scanafter Expects both files in symtab, and oldinfo and newinfo valid. * Expects the “other” arrays contain positive #s to indicate lines that are * unique in both files. For each such pair of places, scans past in each * file. Contiguous groups of lines that match non-uniquely are taken to be * good-enough matches, and so marked in “other”. Assumes each other[0] is * 0. */ void scanafter() { int oldline, newline; for (newline = 0; newline = 0) { /* is unique in old & new */ for (;;) { /* scan after there in both files */ if (++oldline > oldinfo.maxLine) break; if (oldinfo.other[oldline] >= 0) break; if (++newline > newinfo.maxLine) break; if (newinfo.other[newline] >= 0) break; // oldline & newline exist, and aren’t already matched if (newinfo.symbol[newline] != oldinfo.symbol[oldline]) break; // not same newinfo.other[newline] = oldline; // record a match oldinfo.other[oldline] = newline; } } } } /** * scanbefore As scanafter, except scans towards file fronts. Assumes the * off-end lines have been marked as a match. */ void scanbefore() { int oldline, newline; for (newline = newinfo.maxLine + 1; newline > 0; newline–) { oldline = newinfo.other[newline]; if (oldline >= 0) { /* unique in each */ for (;;) { if (–oldline = 0) break; if (–newline = 0) break; // oldline and newline exist, and aren’t marked yet if (newinfo.symbol[newline] != oldinfo.symbol[oldline]) break; // not same newinfo.other[newline] = oldline; // record a match oldinfo.other[oldline] = newline; } } } } /** * scanblocks – Finds the beginnings and lengths of blocks of matches. Sets * the blocklen array (see definition). Expects oldinfo valid. */ void scanblocks() { int oldline, newline; int oldfront = 0; // line# of front of a block in old, or 0 int newlast = -1; // newline’s value during prev. iteration for (oldline = 1; oldline <= oldinfo.maxLine; oldline++) blocklen[oldline] = 0; blocklen[oldinfo.maxLine + 1] = UNREAL; // starts a mythical blk for (oldline = 1; oldline <= oldinfo.maxLine; oldline++) { newline = oldinfo.other[oldline]; if (newline < 0) oldfront = 0; /* no match: not in block */ else { /* match. */ if (oldfront == 0) oldfront = oldline; if (newline != (newlast + 1)) oldfront = oldline; ++blocklen[oldfront]; } newlast = newline; } } /* * scanunique Scans for lines which are used exactly once in each file. * Expects both files in symtab, and oldinfo and newinfo valid. The * appropriate "other" array entries are set to the line# in the other file. * Claims pseudo-lines at 0 and XXXinfo.maxLine+1 are unique. */ void scanunique() { int oldline, newline; node psymbol; for (newline = 1; newline >>> ” + printoldline + ” CHANGED FROM”); printstatus = change; oldinfo.symbol[printoldline].showSymbol(); anyprinted = true; printoldline++; } /** * showdelete Part of printout. Expects printoldline is at a deletion. */ void showdelete() { if (printstatus != delete) println(“\n>>>> DELETE AT ” + printoldline); printstatus = delete; oldinfo.symbol[printoldline].showSymbol(); anyprinted = true; printoldline++; } /* * showinsert Part of printout. Expects printnewline is at an insertion. */ void showinsert() { if (printstatus == change) println(LINE + “\n>>>> CHANGED TO”); else if (printstatus != insert) println(“\n>>>> INSERT BEFORE ” + printoldline); printstatus = insert; newinfo.symbol[printnewline].showSymbol(); anyprinted = true; printnewline++; } /** * showmove Part of printout. Expects printoldline, printnewline at start of * two different blocks ( a move was done). */ void showmove() { int oldblock = blocklen[printoldline]; int newother = newinfo.other[printnewline]; int newblock = blocklen[newother]; if (newblock = newblock) { // assume new’s blk moved. blocklen[newother] = -1; // stamp block as “printed”. println(“>>>> ” + newother + ” THRU ” + (newother + newblock – 1) + ” MOVED TO BEFORE ” + printoldline); for (; newblock > 0; newblock–, printnewline++) newinfo.symbol[printnewline].showSymbol(); anyprinted = true; printstatus = idle; } else /* assume old’s block moved */ skipold(); /* target line# not known, display later */ } /** * showsame Part of printout. Expects printnewline and printoldline at start * of two blocks that aren’t to be displayed. */ void showsame() { int count; printstatus = idle; if (newinfo.other[printnewline] != printoldline) { throw new RuntimeException(“BUG IN LINE REFERENCING”); } count = blocklen[printoldline]; printoldline += count; printnewline += count; } /** * skipnew Part of printout. Expects printnewline is at start of a new block * that has already been announced as a move. Skips over the new block. */ void skipnew() { int oldline; printstatus = idle; for (;;) { if (++printnewline > newinfo.maxLine) break; /* end of file */ oldline = newinfo.other[printnewline]; if (oldline oldinfo.maxLine) break; /* end of file */ if (oldinfo.other[printoldline] fileInfo.MAXLINECOUNT) { throw new RuntimeException(“MAXLINECOUNT exceeded, must stop.”); } pinfo.symbol[linenum] = node.addSymbol(linebuffer, pinfo == oldinfo, linenum, this.formatter); } /* * transform Analyzes the file differences and leaves its findings in the * global arrays oldinfo.other, newinfo.other, and blocklen. Expects both * files in symtab. Expects valid “maxLine” and “symbol” in oldinfo and * newinfo. */ void transform() { int oldline, newline; int oldmax = oldinfo.maxLine + 2; /* Count pseudolines at */ int newmax = newinfo.maxLine + 2; /* ..front and rear of file */ for (oldline = 0; oldline < oldmax; oldline++) oldinfo.other[oldline] = -1; for (newline = 0; newline < newmax; newline++) newinfo.other[newline] = -1; scanunique(); /* scan for lines used once in both files */ scanafter(); /* scan past sure-matches for non-unique blocks */ scanbefore(); /* scan backwards from sure-matches */ scanblocks(); /* find the fronts and lengths of blocks */ } }; // end of main class! /** * This is the info kept per-file. */ class fileInfo { static final int MAXLINECOUNT = 20000; DataInputStream file; /* File handle that is open for read. */ public int maxLine; /* After input done, # lines in file. */ int other[]; /* Map of line# to line# in other file */ node symbol[]; /* The symtab handle of each line. */ /* ( -1 means don't-know ). */ /* Allocated AFTER the lines are read. */ /** * Normal constructor with one filename; file is opened and saved. */ fileInfo(String filename) { symbol = new node[MAXLINECOUNT + 2]; other = null; // allocated later! try { file = new DataInputStream(new FileInputStream(filename)); } catch (IOException e) { System.err.println("Diff can't read file " + filename); System.err.println("Error Exception was:" + e); throw new RuntimeException(e); } } // This is done late, to be same size as # lines in input file. void alloc() { other = new int[symbol.length + 2]; } } /** * Class "node". The symbol table routines in this class all understand the * symbol table format, which is a binary tree. The methods are: addSymbol, * symbolIsUnique, showSymbol. */ class node { /* the tree is made up of these nodes */ static final int freshnode = 0, oldonce = 1, newonce = 2, bothonce = 3, other = 4; static node panchor = null; /* symtab is a tree hung from this */ /** * addSymbol(String pline) – Saves line into the symbol table. Returns a * handle to the symtab entry for that unique line. If inoldfile nonzero, * then linenum is remembered. */ static node addSymbol(String pline, boolean inoldfile, int linenum, Formatter formatter) { node pnode; pnode = matchsymbol(pline, formatter); /* find the node in the tree */ if (pnode.linestate == freshnode) { pnode.linestate = inoldfile ? oldonce : newonce; } else { if ((pnode.linestate == oldonce && !inoldfile) || (pnode.linestate == newonce && inoldfile)) pnode.linestate = bothonce; else pnode.linestate = other; } if (inoldfile) pnode.linenum = linenum; return pnode; } /** * matchsymbol Searches tree for a match to the line. * * @param pline * pline, a line of text If node's linestate == freshnode, then * created the node. * @param formatter */ static node matchsymbol(String pline, Formatter formatter) { int comparison; node pnode = panchor; if (panchor == null) return panchor = new node(pline, formatter); for (;;) { comparison = pnode.line.compareTo(pline); if (comparison == 0) return pnode; /* found */ if (comparison 0) { if (pnode.pright == null) { pnode.pright = new node(pline, formatter); return pnode.pright; } pnode = pnode.pright; } } /* NOTE: There are return stmts, so control does not get here. */ } final private Formatter formatter; String line; int linenum; int /* enum linestates */linestate; node pleft, pright; /** * Construct a new symbol table node and fill in its fields. * * @param pline * A line of the text file */ node(String pline, Formatter formatter) { this.formatter = formatter; pleft = pright = null; linestate = freshnode; /* linenum field is not always valid */ line = pline; } /** * showSymbol Prints the line to stdout. */ void showSymbol() { this.formatter.format(line+”\n”); // System.out.println(line); } /** * symbolIsUnique Arg is a ptr previously returned by addSymbol. * ————– Returns true if the line was added to the symbol table * exactly once with inoldfile true, and exactly once with inoldfile false. */ boolean symbolIsUnique() { return (linestate == bothonce); } } ]]>

a simple (tricky) sabotage on java debugger

If you rely heavily on a java debugger, beware of this insidious sabotage.

You could use finally blocks. You could surround everything with try/catch(Throwable). If all of these get skipped (rendering your debugger useless) and system silently terminates at inconsistent moments, as if by a divine intervention, then perhaps ….

perhaps you have a silent System.exit() in an obscure thread.

Let me make these pointers clear —
– System.exit() will override/ignore any finally block. What you put in finally blocks will not run in the face of System.exit()
– System.exit() will not trigger catch(Throwable) since no exception is thrown.
– System.exit() in any thread kills the entire JVM.

Q: Is JNI crash similar to System.exit()?
%%A: i think so.

Actually, in any context a silent System.exit() can be hard to track down when you look at the log.

non-dummy-type template parameters

(Note this topic is not related to template Partial specialization)

First, let’s distinguish a template parameter (like “T”) vs a template argument (like a class “Account”). For a NDTTP, the template parameter has a) concrete type and b) a parameter name, typically “size_t value” vs the template argument like “31”.

Simplest example: std::array template has 2 template parameters —

  1. a dummy type T
  2. a non-dummy type “size_t length”

std::array<Account, 31> concretizes the template with a concrete type Account and a value of 31.

Majority of class templates in practice have a dummy type (or multiple, but let’s stay focused) to signify an Unknown-type. For example, the STL vector can be “concretized” in memory to become a real CLASS when given a real type like “Account”. The real type replaces the dummy type. [[C++ primer]] is the first book I read that mentions a very powerful type of class template. I call it NDTTP i.e. non-dummy-type template parameter.

Background — What appears inside angle brackets after the keyword “template” is always a list of “tokens” or “thingies”. In most cases in practice, each thingy is a “typename T” or “class C”.

Now, the thingy can also be something like “int size”. This is known as a non-type template parameter. I call it a non-dummy-type template parameter, or NDT template parameter. a NDT is not a dummy type, but a real type with a parameter name. NDT declaration syntax is like a regular function parameter. NDT represents a dummy type pinned to a real type.  But how is NDT used when the push comes to the shove i.e. template instantiation (I prefer “template concretization”)?

Remember a real type argument like “AccountAllocator” in STL vector is a piece of information at the class level, not Instance level. When the vector is concretized in memory to become a real/concrete Class, the real class uses AccountAllocator Class. In the same vein, the NDT value is at class-level, not class-instance level nor at template level. That implies the “int size” value of 55 is a constant for the class, across all class instances.

In other words, when we convert a GENERIC “unconcretized” template matrix_double into a real class, the “int size” template parameter (NDT template parameter) is replaced by a value like 55, and treated as a class-level static constant. If we construct 9999 instances of the concrete matrix_double class, all of them share the same size=55.

A class template can have all its “tokens” be dummy types (standard practice), or all of them NDT (P861 [[c++primer]]), or a mixture.

See other posts about When the NDT arg is a specific func ptr like “revalue”. In contrast, given a functor class, you use a regular type param (more common), not a NDTTP.

heap is used less in C than C++/Java/C#

At the application level (as opposed to libraries), I personally feel C apps tend to do most of their work on the stack whereas c++ apps uses more heap.

Q: what are the classic C usages for heap? I feel most requirements are met by “auto” and global variables, including large N-dimensional arrays of structures. A big structure can in turn hold arrays/pointers (well, array and pointers are almost indistinguishable.)

A: linked graph data structures.

C++ added a lot of support for heap — including all the new-expressions and various operator-new (not to mention deletes).

– C++ new-expression ties together heap allocation and class constructor.
– C++ delete-expression ties together heap de-allocation and class destructor.

In C++, class instances are commonly allocated either on stack OR on heap. Java/C# is even more heap-oriented. Why?

add a python script to context menu (windows explorer)

based on http://zephyrfalcon.org/weblog/arch_d7_2003_08_09.html#e306

Look for HKEY_CLASSES_ROOT.Folder to add to folder-level right-click context menu. (HKEY_CLASSES_ROOT.* i.e. the _asterisk_ for file-level context menu)

In your python script,

       dirName, baseName = os.path.split(sys.argv[1])
       print sys.argv
       print dirName
       print baseName

simple script to count classes defined in a python project

Any time you have a sizeable python project with many *.py source files, you can use this script to count how many classes defined.

import re, sys
from os import walk

printed={}
for (path, dirs, files) in walk(“c:\py”) :
       for filename in files :
               if not re.search(“.py$”,filename) : continue
               if not printed.has_key(path):
                       print ” path = ” + path
                       printed[path] = True

               for line in open (path+'\'+filename) :
                       #if re.search('^s*def ', line) : print line,
                       #if re.search('^s*classs', line) : print filename + ':t' + line,
                       if re.search('^s*trys', line) : print filename + ':t' + line,
                       #if re.search('@', line) : print filename + ':t' + line,

stack frame has a pointer to the caller stack frame

Q: what data occupy the real estate of a single stack frame? Remember a stack frame is a chunk of memory of x bytes and each byte has a purpose.

A: (obviously) any “auto” variable declared (therefore allocated) in the function

A: (obviously) any formal parameter. If a 44-byte Account parameter is passed-by-value, then the 44-bytes are allocated in the stack frame. If by-reference, then only a 4-byte pointer allocated.

A: 4-byte pointer to caller's stack frame. Note that frame also contains a pointer to its own caller. Therefore, the stack frames form a linked list. This pointer is known as a “previous stack top”.

A: 4-byte ReturnAddress. When a function f4() returns, control passes back to the caller function f3(). Now at assembly level the caller function may be a stream of 20 instructions. Our f4() may be invoked on Instruction #8 or whatever. This information is saved in f4() stack frame under “ReturnAddress”. Upon return, this information is put into the “instruction pointer” register inside the CPU.

operands to assembly instructions

I feel most operands are registers. See first operand in example below. That means we must load our 32 bits into the EAX register before the operation.

However, an operand can also refer directly to a memory location.

SUB EAX [0x10050D49]

A third type of operand is a constant. You pass that constant from source code to compiler and it is embedded in the “object file”