Issue1923

Title Improve performance of --similarity 100
Priority wish Status resolved
Superseder Nosy List WilliamRoberts_hg, mg, tonfa
Assigned To Topics

Created on 2009-11-24.14:36:28 by WilliamRoberts_hg, last changed 2009-11-26.16:00:46 by tonfa.

Messages
msg11093 (view) Author: tonfa Date: 2009-11-26.16:00:46
in main
msg11060 (view) Author: tonfa Date: 2009-11-24.21:05:18
The new *code* (sorry, I hit submit too soon).
msg11059 (view) Author: tonfa Date: 2009-11-24.21:04:49
@William: let us know what the experience with the new is for you.
msg11058 (view) Author: tonfa Date: 2009-11-24.21:03:45
In crew.

http://hg.intevation.org/mercurial/crew/rev/9dfe34bf42c7
changeset:   9924:9dfe34bf42c7
parent:      9912:3d718761157b
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 17:26:42 2009 +0100
summary:     findrenames: first loop over the removed files, it's faster

http://hg.intevation.org/mercurial/crew/rev/4b044b81cb54
changeset:   9925:4b044b81cb54
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 17:39:42 2009 +0100
summary:     findrenames: refactor the score computation

http://hg.intevation.org/mercurial/crew/rev/2ae4d0865629
changeset:   9926:2ae4d0865629
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 18:21:47 2009 +0100
summary:     findrenames: speedup exact match

http://hg.intevation.org/mercurial/crew/rev/a92539567ef3
changeset:   9927:a92539567ef3
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 20:40:04 2009 +0100
summary:     findrenames: improve coding-style
msg11057 (view) Author: tonfa Date: 2009-11-24.18:45:34
See the following csets:
changeset:   9913:9dfe34bf42c7
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 17:26:42 2009 +0100
summary:     findrenames: first loop over the removed files, it's faster

changeset:   9914:4b044b81cb54
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 17:39:42 2009 +0100
summary:     findrenames: refactor the score computation

changeset:   9915:2ae4d0865629
tag:         tip
user:        Benoit Boissinot <benoit.boissinot@ens-lyon.org>
date:        Tue Nov 24 18:21:47 2009 +0100
summary:     findrenames: speedup exact match

here: http://bitbucket.org/bboissin/mercurial-crew-tonfa/
msg11053 (view) Author: tonfa Date: 2009-11-24.15:38:09
If someone want to play with it, here's preliminary patch:

diff --git a/mercurial/cmdutil.py b/mercurial/cmdutil.py
--- a/mercurial/cmdutil.py
+++ b/mercurial/cmdutil.py
@@ -270,31 +270,49 @@
 
 def findrenames(repo, added, removed, threshold):
     '''find renamed files -- yields (before, after, score) tuples'''
+    copies = {}
     ctx = repo['.']
-    for a in added:
-        aa = repo.wread(a)
-        bestname, bestscore = None, threshold
-        for r in removed:
-            if r not in ctx:
-                continue
-            rr = ctx.filectx(r).data()
+    for r in removed:
+        if r not in ctx:
+            continue
+        fctx = ctx.filectx(r)
+        if threshold == 100:
+            n = fctx.node()
+            fparents = fctx.filelog().parents(n)
+            fparents.sort()
+            partialhash = _sha(fparents[0]).update(fparents[1])
+            def score(t):
+                s = partialhash.copy()
+                s.update(t)
+                h = s.digest()
+                if h == n:
+                    return 100
+                return 0
+        else:
+            rr = fctx.data()
+            def score(t):
+                # bdiff.blocks() returns blocks of matching lines
+                # count the number of bytes in each
+                equal = 0
+                alines = mdiff.splitnewlines(aa)
+                matches = bdiff.blocks(aa, rr)
+                for x1, x2, y1, y2 in matches:
+                    for line in alines[x1:x2]:
+                        equal += len(line)
 
-            # bdiff.blocks() returns blocks of matching lines
-            # count the number of bytes in each
-            equal = 0
-            alines = mdiff.splitnewlines(aa)
-            matches = bdiff.blocks(aa, rr)
-            for x1,x2,y1,y2 in matches:
-                for line in alines[x1:x2]:
-                    equal += len(line)
+                lengths = len(aa) + len(rr)
+                if lengths:
+                    return equal*2.0 / lengths
+                return 0
 
-            lengths = len(aa) + len(rr)
-            if lengths:
-                myscore = equal*2.0 / lengths
-                if myscore >= bestscore:
-                    bestname, bestscore = r, myscore
-        if bestname:
-            yield bestname, a, bestscore
+        for a in added:
+            bestscore = copies.get(a, (None, threshold))[1]
+            aa = repo.wread(a)
+            myscore = score(aa)
+            if myscore >= bestscore:
+                copies[a] = (r, myscore)
+    for k, v in copies.iteritems():
+        yield v[0], k, v[1]
 
 def addremove(repo, pats=[], opts={}, dry_run=None, similarity=None):
     if dry_run is None:
msg11052 (view) Author: mg Date: 2009-11-24.14:51:13
Yeah, it would be silly to read the file content in that case (if the sizes
are different).

On the other hand, addremove is not a common command, at least not for me.
So I guess that's why there has been heavily optimized yet.

If you want to play with this, then I think you should take a peek in the
findrenames function here:

  http://selenic.com/hg/file/tip/mercurial/cmdutil.py#l270
msg11051 (view) Author: WilliamRoberts_hg Date: 2009-11-24.14:36:28
Could the "hg addremove --similarity 100" performance be improved by noting
that the checksums of the two files must be identical for 100% similarity,
and so there's no need to extract the file content and compare. It's a
special case optimisation, but "identical" is surely the main use of this
feature?
History
Date User Action Args
2009-11-26 16:00:46tonfasetstatus: testing -> resolved
messages: + msg11093
2009-11-24 21:05:19tonfasetmessages: + msg11060
2009-11-24 21:04:49tonfasetmessages: + msg11059
2009-11-24 21:03:45tonfasetstatus: chatting -> testing
messages: + msg11058
2009-11-24 18:45:35tonfasetmessages: + msg11057
2009-11-24 15:38:15tonfasetnosy: + tonfa
messages: + msg11053
2009-11-24 14:51:13mgsetstatus: unread -> chatting
nosy: + mg
messages: + msg11052
2009-11-24 14:36:28WilliamRoberts_hgcreate