Edgewall Software
Modify

Opened 5 years ago

Last modified 2 days ago

#10267 new defect

Version controls diffs of large text files kill the system

Reported by: anonymous Owned by:
Priority: high Milestone: next-stable-1.0.x
Component: version control/changeset view Version: 0.12.2
Severity: major Keywords: diff performance genshi
Cc:
Release Notes:
API Changes:

Description (last modified by Ryan J Ollos)

We have some large text files (they are some data files we deal with) and when you try to view "last revision" or a revision with them it bombs (i.e. the service's cpu usage goes to 100%, and things come to a screeching halt).

Looks like the issue is in: versioncontrol/web_ui/changeset.py

The _content_changes method tries to do a diff of anything that isn't a binary. This is fine for most text files, unless they are big (like even 5 to 10 megs seems a bit much for the diffing method used). Anyway, I simply added the following to _content_changes:

if (len(old_content) > 100000 or len(new_content) > 100000):
    return None

(Sorry if it isn't proper pep8)

It looks like you have a max_diff_bytes but that only effects the diff results. Ideally you would want a threshold before you have the diff results.

Attachments (0)

Change History (11)

comment:1 Changed 5 years ago by Christian Boos

Component: generalversion control/changeset view
Keywords: diff added
Milestone: next-major-0.1X
Priority: normalhigh
Severity: normalmajor

Incidentally I faced the very same problem this morning…

When I didn't get any response in the browser, I logged on the server and saw the httpd process stuck at 100% CPU. With gdb, I couldn't figure out the problem… because there was no error, just intensive processing in order to generate a diff for an XML file that grew from 70k lines to 100k lines, as I found out after waiting long enough ;-)

comment:2 Changed 5 years ago by anonymous

Same problem here. Changeset view fails for an update to a huge sql dump. The suggested patch works very well for me. I suggest to include the fix in the next minor update.

comment:3 Changed 4 years ago by Christian Boos

Milestone: next-major-releasesnext-stable-1.0.x

We indeed need another threshold, like max_bytes_for_diff. And it probably should take into account the cumulative size, otherwise you'd still have a problem with 10 1MB files if your threshold is at 2MB …

comment:4 Changed 21 months ago by Jun Omae

Small improvement for diff_div.html in [537484b1d/jomae.git]. Before the changes, check of that diff.style is sidebyside/inline is executed for each diff line. We could reduce times of the check to just once.

comment:5 Changed 20 months ago by Ryan J Ollos

Description: modified (diff)

comment:6 Changed 6 months ago by Christian Boos

Keywords: performance genshi added

comment:7 Changed 6 weeks ago by walty <walty8@…>

Cc: walty8@… added

comment:8 Changed 9 days ago by walty <walty8@…>

hi,

Please find my patch regarding to the diff page performance in b548cb57.

Two major enhancements are done:

  1. I cached the API calls for the git ls-tree. Now for a diff page of 100 files, the number of API calls reduced from 200 calls to 2 calls (1 for old revision, and 1 for new revision)
  1. I used difflib.SequenceMatcher.quick_ratio for the diff_bytes calculation of content (In def _estimate_changes()). Previously, the system only used old_content_length + new_content_length as a metric. So if the user only changed one line in jquery.js, the diff_bytes would be (total size of jquery.js) x 2.

Here is the performance comparison in my own low-config mini PC (Celeron N2840, 2GB memory, no other tasks), while viewing one changeset including around 100 files:

  1. before any change, the total time is about 23 seconds (exclude time for template rendering)
  1. after enhancement 1 is done, the total time is reduced to 1 second
  1. after enhancement 2 is done, the total time increased back to 8 seconds

comment:9 in reply to:  8 ; Changed 9 days ago by Jun Omae

Replying to walty <walty8@…>:

Please find my patch regarding to the diff page performance in b548cb57.

This ticket is focused on rendering diff with Genshi templates. Your changes is not in scope of this ticket. Please create new ticket focused on git.

One thing: the following changes would lead performance down if the node is a large file.

-            old_size = old_node.get_content_length()
-            new_size = new_node.get_content_length()
-            return old_size + new_size
+            old_content = old_node.get_content().read()
+            new_content = new_node.get_content().read()
+            ratio = difflib.SequenceMatcher(None,
+                                            old_content,
+                                            new_content).quick_ratio()
Last edited 9 days ago by Jun Omae (previous) (diff)

comment:10 in reply to:  9 Changed 2 days ago by walty <walty8@…>

Replying to Jun Omae:

Replying to walty <walty8@…>:

Please find my patch regarding to the diff page performance in b548cb57.

This ticket is focused on rendering diff with Genshi templates. Your changes is not in scope of this ticket. Please create new ticket focused on git.

Please find the new ticket in #12545.

comment:11 Changed 2 days ago by walty <walty8@…>

Cc: walty8@… removed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned. Next status will be 'new'.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.