About

Welcome to Panela, Matt Harrison's take on mostly Open Source, Linux, Python, innovation in those areas, other buzzwords and Dick Proenneke. It comes complete with the illustrations as needed. Note the opinions expressed here are merely my opinions and not the opinions of my employer.

about Matt

Calendar

««Mar 2010»»
SMTWTFS
  123456
78910111213
14151617181920
21222324252627
28293031

Mailing List

My RSS Feeds








Clone Digger - another tool to add to your python belt

posted 2009.03.19 Thu

A few years back my boss asked me how I would implement code duplication detection. Real duplication would ignore variable renames, comments, and indenting (for non-python). A naive solution would just use string comparisons and not be too useful. I thought about it for a minute and said I'd make an AST and for the code and compare the trees. Not that it's a particularly novel idea, we were just discussing the value of the IP of some companies...

Well, there's been code floating around the intertubes for the past year or so that does just this for python code. The project is called Clone Digger and works on Python and Java. The output is an html file showing chunks of code that either differ by variables or operation changes (in red) or code that is the same (in blue).

I think Clone Digger is another useful tool to use for code reviews, when inheriting a bit of code, or after developing a chunk of code. This is going to be very useful for some refactoring we are doing at work. Now if it only worked on JavaScript....

tags:        

links: digg this    del.icio.us    reddit




1. Jean-Paul Calderone left...
2009.03.20 Fri 7:44 am :: http://jcalderone.livejournal.com/

I gave Clone Digger a try just about a year ago. At the time, it wasn't very impressive. It took days to run over the code base I was interested in. Really - days. The results it produced weren't extremely exciting. It showed me some useful information about duplication, but it also showed lots of really irrelevant things that didn't represent any meaningful duplication. Sifting through the bogus output would likely have taken long enough to negate the time saved by using Clone Digger in the first place.

This was immediately after the initial announcement, though. I'd be interested to hear if things have advanced since then. Has anyone used Clone Digger on a non-trivial project and gotten good results?


2. Matt left...
2009.03.20 Fri 7:55 am

Jean-Paul- There doesn't appear to be much activity on the project, but apparently Clone Digger was in GSOC last year and a student did improve the speed. I haven't tried it on larger codebases (such as twisted), so I can't comment on speed there. Yes some of the potential refactorings are impossible/meaningless, but I thought it was interesting to pour over the results and think about how one would refactor them. Some are just like someone hit you over the head with a cluestick. YMMV.


3. drewp left...
2009.03.21 Sat 11:53 pm

I tried it within the last few months. It was quite slow, and the html result page was really slow to redraw in firefox too. ISTR finding a bit of interesting stuff and a lot of sad cases of I'll-make-a-new-tool-via-cp-and-some-minor-editing.


4. Olle Jonsson left...
2009.06.08 Mon 6:22 am :: http://ollehost.dk/blog/

I follow the Clone Digger mailing list, but am mostly a keen tourist. There have been stabs at getting Antlr 3-based JavaScript clone-digging functionality in there. There is also a Lua folder among the project's files.

If you have the Antlr book, interest, and copious free time, this might be a hacking jewel for you.


5. sakthi left...
2009.07.02 Thu 3:30 am :: http://www.macrotesting.com

It showed me some useful information about duplication, but it also showed lots of really irrelevant things that didn't represent any meaningful duplication. I got some needy tools from macrotesting its a best place for the programmers/testers. Your post is really useful for me in the Code Duplication Detection... Thank you...

Cheers sakthi