simple diff class – in pure java, easily embeddable (src)

<![CDATA[ //Based on work by other authors import java.io.DataInputStream; import java.io.FileInputStream; import java.io.IOException; import java.util.Formatter; /** * This single-file self-contained utility consists of 1 public class and 2 * non-public classes (non-CamelCase names). *

* USAGE: diff oldfile newfile *

* This program assumes that “oldfile” and “newfile” are text files. The program * writes to stdout a description of the changes which would transform “oldfile” * into “newfile”. *

* The printout is in the form of commands, each followed by a block of text. * The text is delimited by the commands, which are: *

* DELETE AT n ..deleted lines *

* INSERT BEFORE n ..inserted lines *

* n MOVED TO BEFORE n ..moved lines *

* n CHANGED FROM ..old lines CHANGED TO ..newer lines *

* The line numbers all refer to the lines of the oldfile, as they are numbered * before any commands are applied. The text lines are printed as-is, without * indentation or prefixing. The commands are printed in upper case, with a * prefix of “>>>>”, so that they will stand out. Other schemes may be * preferred. *

* Files which contain more than MAXLINECOUNT lines cannot be processed. This * can be fixed by changing “symbol” to a Vector. The algorithm is taken from * Communications of the ACM, Apr78 (21, 4, 264-), * “A Technique for Isolating Differences Between Files.” Ignoring I/O, and * ignoring the symbol table, it should take O(N) time. This implementation * takes fixed space, plus O(U) space for the symbol table (where U is the * number of unique lines). Methods exist to change the fixed space to O(N) * space. *

* Note that this is not the only interesting file-difference algorithm. In * general, different algorithms draw different conclusions about the changes * that have been made to the oldfile. This algorithm is sometimes “more right”, * particularly since it does not consider a block move to be an insertion and a * (separate) deletion. However, on some files it will be “less right”. This is * a consequence of the fact that files may contain many identical lines * (particularly if they are program source). Each algorithm resolves the * ambiguity in its own way, and the resolution is never guaranteed to be * “right”. However, it is often excellent. This program is intended to be * pedagogic. Specifically, this program was the basis of the Literate * Programming column which appeared in the Communications of the ACM (CACM), in * the June 1989 issue (32, 6, 740-755). *

* By “pedagogic”, I do not mean that the program is gracefully worded, or that * it showcases language features or its algorithm. I also do not mean that it * is highly accessible to beginners, or that it is intended to be read in full, * or in a particular order. Rather, this program is an example of one * professional’s style of keeping things organized and maintainable. *

* The program would be better if the “print” variables were wrapped into a * struct. In general, grouping related variables in this way improves * documentation, and adds the ability to pass the group in argument lists. This * program is a de-engineered version of a program which uses less memory and * less time. The article points out that the “symbol” arrays can be implemented * as arrays of pointers to arrays, with dynamic allocation of the subarrays. * (In C, macros are very useful for hiding the two-level accesses.) In Java, a * Vector would be used. This allows an extremely large value for MAXLINECOUNT, * without dedicating fixed arrays. (The “other” array can be allocated after * the input phase, when the exact sizes are known.) The only slow piece of code * is the “strcmp” in the tree descent: it can be speeded up by keeping a hash * in the tree node, and only using “strcmp” when two hashes happen to be equal. * * @author Ian F. Darwin, Java version * @author D. C. Lindsay, C version (1982-1987) * @author minor adjustment by Bin Tan */ public class Diff { /* The following are global to printout’s subsidiary routines */ // enum{ idle, delete, insert, movenew, moveold, // same, change } printstatus; public static final int idle = 0, delete = 1, insert = 2, movenew = 3, moveold = 4, same = 5, change = 6; private static final String LINE = “————————–“; /** * block len > any possible real block len */ final static int UNREAL = Integer.MAX_VALUE; /** * main – entry point when used standalone. NOTE: no routines return error * codes or throw any local exceptions. Instead, any routine may complain to * stderr and then quit with error to the system. */ public static void main(String argstrings[]) { argstrings = new String[] { “C:\\0\\Copy of c#IV_question.txt”, “C:\\0\\c#IV_question.txt” }; if (argstrings.length != 2) { throw new RuntimeException(“Usage: diff oldfile newfile”); } Diff d = new Diff(); Formatter f = d.doDiff(argstrings[0], argstrings[1]); System.out.println(f); return; } boolean anyprinted; /** * blocklen is the info about found blocks. It will be set to 0, except at * the line#s where blocks start in the old file. At these places it will be * set to the # of lines in the block. During printout , this # will be * reset to -1 if the block is printed as a MOVE block (because the printout * phase will encounter the block twice, but must only print it once.) The * array declarations are to MAXLINECOUNT+2 so that we can have two extra * lines (pseudolines) at line# 0 and line# MAXLINECOUNT+1 (or less). */ int blocklen[]; final public Formatter formatter = new Formatter(new StringBuilder()); /** * Keeps track of information about file1 and file2 */ fileInfo oldinfo, newinfo; int printstatus, printoldline, printnewline; // line numbers in old & new file public Diff() { } /** * Do one file comparison. Called with both filenames. */ public Formatter doDiff(String oldFile, String newFile) { println(“>>>> Difference of file \”” + oldFile + “\” and file \”” + newFile + “\”.\n”); oldinfo = new fileInfo(oldFile); newinfo = new fileInfo(newFile); /* we don’t process until we know both files really do exist. */ try { inputscan(oldinfo); inputscan(newinfo); } catch (IOException e) { System.err.println(“Read error: ” + e); } /* * Now that we’ve read all the lines, allocate some arrays. */ blocklen = new int[(oldinfo.maxLine > newinfo.maxLine ? oldinfo.maxLine : newinfo.maxLine) + 2]; oldinfo.alloc(); newinfo.alloc(); /* Now do the work, and print the results. */ transform(); return printout(); } /** * inputscan Reads the file specified by pinfo.file. ——— Places the * lines of that file in the symbol table. Sets pinfo.maxLine to the number * of lines found. */ @SuppressWarnings(“deprecation”) void inputscan(fileInfo pinfo) throws IOException { String linebuffer; pinfo.maxLine = 0; while ((linebuffer = pinfo.file.readLine()) != null) { storeline(linebuffer, pinfo); // System.out.println(linebuffer); } } /* * newconsume Part of printout. Have run out of old file. Print the rest of * the new file, as inserts and/or moves. */ void newconsume() { for (;;) { if (printnewline > newinfo.maxLine) break; /* end of file */ if (newinfo.other[printnewline] oldinfo.maxLine) break; /* end of file */ printnewline = oldinfo.other[printoldline]; if (printnewline < 0) showdelete(); else if (blocklen[printoldline] oldinfo.maxLine) { newconsume(); break; } if (printnewline > newinfo.maxLine) { oldconsume(); break; } if (newinfo.other[printnewline] < 0) { if (oldinfo.other[printoldline] < 0) this.showchange(); else showinsert(); } else if (oldinfo.other[printoldline] < 0) showdelete(); else if (blocklen[printoldline] >>> End of differences.”); else println(“>>>> Files are identical.”); return formatter; } /* * scanafter Expects both files in symtab, and oldinfo and newinfo valid. * Expects the “other” arrays contain positive #s to indicate lines that are * unique in both files. For each such pair of places, scans past in each * file. Contiguous groups of lines that match non-uniquely are taken to be * good-enough matches, and so marked in “other”. Assumes each other[0] is * 0. */ void scanafter() { int oldline, newline; for (newline = 0; newline = 0) { /* is unique in old & new */ for (;;) { /* scan after there in both files */ if (++oldline > oldinfo.maxLine) break; if (oldinfo.other[oldline] >= 0) break; if (++newline > newinfo.maxLine) break; if (newinfo.other[newline] >= 0) break; // oldline & newline exist, and aren’t already matched if (newinfo.symbol[newline] != oldinfo.symbol[oldline]) break; // not same newinfo.other[newline] = oldline; // record a match oldinfo.other[oldline] = newline; } } } } /** * scanbefore As scanafter, except scans towards file fronts. Assumes the * off-end lines have been marked as a match. */ void scanbefore() { int oldline, newline; for (newline = newinfo.maxLine + 1; newline > 0; newline–) { oldline = newinfo.other[newline]; if (oldline >= 0) { /* unique in each */ for (;;) { if (–oldline = 0) break; if (–newline = 0) break; // oldline and newline exist, and aren’t marked yet if (newinfo.symbol[newline] != oldinfo.symbol[oldline]) break; // not same newinfo.other[newline] = oldline; // record a match oldinfo.other[oldline] = newline; } } } } /** * scanblocks – Finds the beginnings and lengths of blocks of matches. Sets * the blocklen array (see definition). Expects oldinfo valid. */ void scanblocks() { int oldline, newline; int oldfront = 0; // line# of front of a block in old, or 0 int newlast = -1; // newline’s value during prev. iteration for (oldline = 1; oldline <= oldinfo.maxLine; oldline++) blocklen[oldline] = 0; blocklen[oldinfo.maxLine + 1] = UNREAL; // starts a mythical blk for (oldline = 1; oldline <= oldinfo.maxLine; oldline++) { newline = oldinfo.other[oldline]; if (newline < 0) oldfront = 0; /* no match: not in block */ else { /* match. */ if (oldfront == 0) oldfront = oldline; if (newline != (newlast + 1)) oldfront = oldline; ++blocklen[oldfront]; } newlast = newline; } } /* * scanunique Scans for lines which are used exactly once in each file. * Expects both files in symtab, and oldinfo and newinfo valid. The * appropriate "other" array entries are set to the line# in the other file. * Claims pseudo-lines at 0 and XXXinfo.maxLine+1 are unique. */ void scanunique() { int oldline, newline; node psymbol; for (newline = 1; newline >>> ” + printoldline + ” CHANGED FROM”); printstatus = change; oldinfo.symbol[printoldline].showSymbol(); anyprinted = true; printoldline++; } /** * showdelete Part of printout. Expects printoldline is at a deletion. */ void showdelete() { if (printstatus != delete) println(“\n>>>> DELETE AT ” + printoldline); printstatus = delete; oldinfo.symbol[printoldline].showSymbol(); anyprinted = true; printoldline++; } /* * showinsert Part of printout. Expects printnewline is at an insertion. */ void showinsert() { if (printstatus == change) println(LINE + “\n>>>> CHANGED TO”); else if (printstatus != insert) println(“\n>>>> INSERT BEFORE ” + printoldline); printstatus = insert; newinfo.symbol[printnewline].showSymbol(); anyprinted = true; printnewline++; } /** * showmove Part of printout. Expects printoldline, printnewline at start of * two different blocks ( a move was done). */ void showmove() { int oldblock = blocklen[printoldline]; int newother = newinfo.other[printnewline]; int newblock = blocklen[newother]; if (newblock = newblock) { // assume new’s blk moved. blocklen[newother] = -1; // stamp block as “printed”. println(“>>>> ” + newother + ” THRU ” + (newother + newblock – 1) + ” MOVED TO BEFORE ” + printoldline); for (; newblock > 0; newblock–, printnewline++) newinfo.symbol[printnewline].showSymbol(); anyprinted = true; printstatus = idle; } else /* assume old’s block moved */ skipold(); /* target line# not known, display later */ } /** * showsame Part of printout. Expects printnewline and printoldline at start * of two blocks that aren’t to be displayed. */ void showsame() { int count; printstatus = idle; if (newinfo.other[printnewline] != printoldline) { throw new RuntimeException(“BUG IN LINE REFERENCING”); } count = blocklen[printoldline]; printoldline += count; printnewline += count; } /** * skipnew Part of printout. Expects printnewline is at start of a new block * that has already been announced as a move. Skips over the new block. */ void skipnew() { int oldline; printstatus = idle; for (;;) { if (++printnewline > newinfo.maxLine) break; /* end of file */ oldline = newinfo.other[printnewline]; if (oldline oldinfo.maxLine) break; /* end of file */ if (oldinfo.other[printoldline] fileInfo.MAXLINECOUNT) { throw new RuntimeException(“MAXLINECOUNT exceeded, must stop.”); } pinfo.symbol[linenum] = node.addSymbol(linebuffer, pinfo == oldinfo, linenum, this.formatter); } /* * transform Analyzes the file differences and leaves its findings in the * global arrays oldinfo.other, newinfo.other, and blocklen. Expects both * files in symtab. Expects valid “maxLine” and “symbol” in oldinfo and * newinfo. */ void transform() { int oldline, newline; int oldmax = oldinfo.maxLine + 2; /* Count pseudolines at */ int newmax = newinfo.maxLine + 2; /* ..front and rear of file */ for (oldline = 0; oldline < oldmax; oldline++) oldinfo.other[oldline] = -1; for (newline = 0; newline < newmax; newline++) newinfo.other[newline] = -1; scanunique(); /* scan for lines used once in both files */ scanafter(); /* scan past sure-matches for non-unique blocks */ scanbefore(); /* scan backwards from sure-matches */ scanblocks(); /* find the fronts and lengths of blocks */ } }; // end of main class! /** * This is the info kept per-file. */ class fileInfo { static final int MAXLINECOUNT = 20000; DataInputStream file; /* File handle that is open for read. */ public int maxLine; /* After input done, # lines in file. */ int other[]; /* Map of line# to line# in other file */ node symbol[]; /* The symtab handle of each line. */ /* ( -1 means don't-know ). */ /* Allocated AFTER the lines are read. */ /** * Normal constructor with one filename; file is opened and saved. */ fileInfo(String filename) { symbol = new node[MAXLINECOUNT + 2]; other = null; // allocated later! try { file = new DataInputStream(new FileInputStream(filename)); } catch (IOException e) { System.err.println("Diff can't read file " + filename); System.err.println("Error Exception was:" + e); throw new RuntimeException(e); } } // This is done late, to be same size as # lines in input file. void alloc() { other = new int[symbol.length + 2]; } } /** * Class "node". The symbol table routines in this class all understand the * symbol table format, which is a binary tree. The methods are: addSymbol, * symbolIsUnique, showSymbol. */ class node { /* the tree is made up of these nodes */ static final int freshnode = 0, oldonce = 1, newonce = 2, bothonce = 3, other = 4; static node panchor = null; /* symtab is a tree hung from this */ /** * addSymbol(String pline) – Saves line into the symbol table. Returns a * handle to the symtab entry for that unique line. If inoldfile nonzero, * then linenum is remembered. */ static node addSymbol(String pline, boolean inoldfile, int linenum, Formatter formatter) { node pnode; pnode = matchsymbol(pline, formatter); /* find the node in the tree */ if (pnode.linestate == freshnode) { pnode.linestate = inoldfile ? oldonce : newonce; } else { if ((pnode.linestate == oldonce && !inoldfile) || (pnode.linestate == newonce && inoldfile)) pnode.linestate = bothonce; else pnode.linestate = other; } if (inoldfile) pnode.linenum = linenum; return pnode; } /** * matchsymbol Searches tree for a match to the line. * * @param pline * pline, a line of text If node's linestate == freshnode, then * created the node. * @param formatter */ static node matchsymbol(String pline, Formatter formatter) { int comparison; node pnode = panchor; if (panchor == null) return panchor = new node(pline, formatter); for (;;) { comparison = pnode.line.compareTo(pline); if (comparison == 0) return pnode; /* found */ if (comparison 0) { if (pnode.pright == null) { pnode.pright = new node(pline, formatter); return pnode.pright; } pnode = pnode.pright; } } /* NOTE: There are return stmts, so control does not get here. */ } final private Formatter formatter; String line; int linenum; int /* enum linestates */linestate; node pleft, pright; /** * Construct a new symbol table node and fill in its fields. * * @param pline * A line of the text file */ node(String pline, Formatter formatter) { this.formatter = formatter; pleft = pright = null; linestate = freshnode; /* linenum field is not always valid */ line = pline; } /** * showSymbol Prints the line to stdout. */ void showSymbol() { this.formatter.format(line+”\n”); // System.out.println(line); } /** * symbolIsUnique Arg is a ptr previously returned by addSymbol. * ————– Returns true if the line was added to the symbol table * exactly once with inoldfile true, and exactly once with inoldfile false. */ boolean symbolIsUnique() { return (linestate == bothonce); } } ]]>

a simple (tricky) sabotage on java debugger

If you rely heavily on a java debugger, beware of this insidious sabotage.

You could use finally blocks. You could surround everything with try/catch(Throwable). If all of these get skipped (rendering your debugger useless) and system silently terminates at inconsistent moments, as if by a divine intervention, then perhaps ….

perhaps you have a silent System.exit() in an obscure thread.

Let me make these pointers clear —
– System.exit() will override/ignore any finally block. What you put in finally blocks will not run in the face of System.exit()
– System.exit() will not trigger catch(Throwable) since no exception is thrown.
– System.exit() in any thread kills the entire JVM.

Q: Is JNI crash similar to System.exit()?
%%A: i think so.

Actually, in any context a silent System.exit() can be hard to track down when you look at the log.

non-dummy-type template parameters

(Note this topic is not related to template Partial specialization)

Majority of class templates in practice have a dummy type (or multiple, but let’s stay focused) to signify an Unknown-type. For example, the STL vector can be “Instantiated” in memory to become a real CLASS when given a real type like “Account”. The real type replaces the dummy type. [[C++ primer]] is the first book I read that mentions a very powerful type of class template. I call it NDTP…

Background — What appears inside angle brackets after the keyword “template” is always a list of “tokens” or “thingies”. In most cases in practice, each thingy is a “typename T” or “class C”.

Now, the thingy can also be something like “int size”. This is known as a non-type template parameter. I call it a non-dummy-type parameter, or NDT parameter. a NDT is not a dummy type, but a real type with a variable name. NDT declaration syntax is like a regular function parameter. NDT represents a dummy type pinned to a real type.  But how is NDT used when the push comes to the shove i.e. template instantiation?

Remember a real type argument like “AccountAllocator” in STL vector is a piece of information at the class level, not Instance level. When the vector is instantiated in memory to become a real/concrete Class, the real class uses AccountAllocator Class. In the same vein, the NDT is at class-level, not class-instance level nor at template level. That implies the “int size” value of 55 is a constant for the class, across all class instances.

In other words, when we convert a GENERIC template matrix_double into a real class, the “int size” template parameter (NDT parameter) is replaced by a value like 55, and treated as a class-level static constant. If we construct 9999 instances of the concrete matrix_double class, all of them share the same size=55.

A class template can have all its “tokens” be dummy types (standard practice), or all of them NDT (P861 [[c++primer]]), or a mixture.

See other posts about When the NDT is a func ptr.

heap is used less in C than C++/Java/C#

At the application level (as opposed to libraries), I personally feel C apps tend to do most of their work on the stack whereas c++ apps uses more heap.

Q: what are the classic C usages for heap? I feel most requirements are met by “auto” and global variables, including large N-dimensional arrays of structures. A big structure can in turn hold arrays/pointers (well, array and pointers are almost indistinguishable.)

A: linked graph data structures.

C++ added a lot of support for heap — including all the new-expressions and various operator-new (not to mention deletes).

– C++ new-expression ties together heap allocation and class constructor.
– C++ delete-expression ties together heap de-allocation and class destructor.

In C++, class instances are commonly allocated either on stack OR on heap. Java/C# is even more heap-oriented. Why?

add a python script to context menu (windows explorer)

based on http://zephyrfalcon.org/weblog/arch_d7_2003_08_09.html#e306

Look for HKEY_CLASSES_ROOT.Folder to add to folder-level right-click context menu. (HKEY_CLASSES_ROOT.* i.e. the _asterisk_ for file-level context menu)

In your python script,

       dirName, baseName = os.path.split(sys.argv[1])
       print sys.argv
       print dirName
       print baseName

simple script to count classes defined in a python project

Any time you have a sizeable python project with many *.py source files, you can use this script to count how many classes defined.

import re, sys
from os import walk

printed={}
for (path, dirs, files) in walk(“c:\py”) :
       for filename in files :
               if not re.search(“.py$”,filename) : continue
               if not printed.has_key(path):
                       print ” path = ” + path
                       printed[path] = True

               for line in open (path+'\'+filename) :
                       #if re.search('^s*def ', line) : print line,
                       #if re.search('^s*classs', line) : print filename + ':t' + line,
                       if re.search('^s*trys', line) : print filename + ':t' + line,
                       #if re.search('@', line) : print filename + ':t' + line,