A few years back my boss asked me how I would implement code duplication detection. Real duplication would ignore variable renames, comments, and indenting (for non-python). A naive solution would just use string comparisons and not be too useful. I thought about it for a minute and said I'd make an AST and for the code and compare the trees. Not that it's a particularly novel idea, we were just discussing the value of the IP of some companies...
Well, there's been code floating around the intertubes for the past year or so that does just this for python code. The project is called Clone Digger and works on Python and Java. The output is an html file showing chunks of code that either differ by variables or operation changes (in red) or code that is the same (in blue).
I think Clone Digger is another useful tool to use for code reviews, when inheriting a bit of code, or after developing a chunk of code. This is going to be very useful for some refactoring we are doing at work. Now if it only worked on JavaScript....
I gave Clone Digger a try just about a year ago. At the time, it wasn't
very impressive. It took days to run over the code base I was interested
in. Really - days. The results it produced weren't extremely exciting.
It showed me some useful information about duplication, but it also showed
lots of really irrelevant things that didn't represent any meaningful
duplication. Sifting through the bogus output would likely have taken long
enough to negate the time saved by using Clone Digger in the first place.
Jean-Paul-
There doesn't appear to be much activity on the project, but apparently
Clone Digger was in GSOC last year and a student did improve the speed. I
haven't tried it on larger codebases (such as twisted), so I can't comment
on speed there. Yes some of the potential refactorings are
impossible/meaningless, but I thought it was interesting to pour over the
results and think about how one would refactor them. Some are just like
someone hit you over the head with a cluestick. YMMV.
I tried it within the last few months. It was quite slow, and the html
result page was really slow to redraw in firefox too. ISTR finding a bit of
interesting stuff and a lot of sad cases of
I'll-make-a-new-tool-via-cp-and-some-minor-editing.
I follow the Clone Digger mailing list, but am mostly a keen tourist. There
have been stabs at getting Antlr 3-based JavaScript clone-digging
functionality in there. There is also a Lua folder among the project's
files.
It showed me some useful information about duplication, but it also showed
lots of really irrelevant things that didn't represent any meaningful
duplication. I got some needy tools from macrotesting its a best place for
the programmers/testers. Your post is really useful for me in the Code
Duplication Detection... Thank you...