Opened 13 years ago
Last modified 10 months ago
#10267 new defect
Version controls diffs of large text files kill the system
Reported by: | anonymous | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | next-dev-1.7.x |
Component: | version control/changeset view | Version: | 0.12.2 |
Severity: | major | Keywords: | diff performance genshi |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description (last modified by )
We have some large text files (they are some data files we deal with) and when you try to view "last revision" or a revision with them it bombs (i.e. the service's cpu usage goes to 100%, and things come to a screeching halt).
Looks like the issue is in: versioncontrol/web_ui/changeset.py
The _content_changes
method tries to do a diff of anything that isn't a binary. This is fine for most text files, unless they are big (like even 5 to 10 megs seems a bit much for the diffing method used). Anyway, I simply added the following to _content_changes
:
if (len(old_content) > 100000 or len(new_content) > 100000): return None
(Sorry if it isn't proper pep8)
It looks like you have a max_diff_bytes but that only effects the diff results. Ideally you would want a threshold before you have the diff results.
Attachments (0)
Change History (15)
comment:1 by , 13 years ago
Component: | general → version control/changeset view |
---|---|
Keywords: | diff added |
Milestone: | → next-major-0.1X |
Priority: | normal → high |
Severity: | normal → major |
comment:2 by , 13 years ago
Same problem here. Changeset view fails for an update to a huge sql dump. The suggested patch works very well for me. I suggest to include the fix in the next minor update.
comment:3 by , 12 years ago
Milestone: | next-major-releases → next-stable-1.0.x |
---|
We indeed need another threshold, like max_bytes_for_diff
. And it probably should take into account the cumulative size, otherwise you'd still have a problem with 10 1MB files if your threshold is at 2MB …
follow-up: 13 comment:4 by , 10 years ago
Small improvement for diff_div.html
in [537484b1d/jomae.git]. Before the changes, check of that diff.style
is sidebyside
/inline
is executed for each diff line. We could reduce times of the check to just once.
comment:5 by , 10 years ago
Description: | modified (diff) |
---|
comment:6 by , 9 years ago
Keywords: | performance genshi added |
---|
comment:7 by , 8 years ago
Cc: | added |
---|
follow-up: 9 comment:8 by , 8 years ago
hi,
Please find my patch regarding to the diff page performance in b548cb57.
Two major enhancements are done:
- I cached the API calls for the
git ls-tree
. Now for a diff page of 100 files, the number of API calls reduced from 200 calls to 2 calls (1 for old revision, and 1 for new revision)
- I used
difflib.SequenceMatcher.quick_ratio
for the diff_bytes calculation of content (Indef _estimate_changes()
). Previously, the system only usedold_content_length + new_content_length
as a metric. So if the user only changed one line injquery.js
, the diff_bytes would be (total size ofjquery.js
) x 2.
Here is the performance comparison in my own low-config mini PC (Celeron N2840, 2GB memory, no other tasks), while viewing one changeset including around 100 files:
- before any change, the total time is about 23 seconds (exclude time for template rendering)
- after
enhancement 1
is done, the total time is reduced to 1 second
- after
enhancement 2
is done, the total time increased back to 8 seconds
follow-up: 10 comment:9 by , 8 years ago
Replying to walty <walty8@…>:
Please find my patch regarding to the diff page performance in b548cb57.
This ticket is focused on rendering diff with Genshi templates. Your changes is not in scope of this ticket. Please create new ticket focused on git.
One thing: the following changes would lead performance down if the node is a large file.
- old_size = old_node.get_content_length() - new_size = new_node.get_content_length() - return old_size + new_size + old_content = old_node.get_content().read() + new_content = new_node.get_content().read() + ratio = difflib.SequenceMatcher(None, + old_content, + new_content).quick_ratio()
comment:10 by , 8 years ago
Replying to Jun Omae:
Replying to walty <walty8@…>:
Please find my patch regarding to the diff page performance in b548cb57.
This ticket is focused on rendering diff with Genshi templates. Your changes is not in scope of this ticket. Please create new ticket focused on git.
Please find the new ticket in #12545.
comment:11 by , 8 years ago
Cc: | removed |
---|
comment:12 by , 8 years ago
Milestone: | next-stable-1.0.x → next-stable-1.2.x |
---|
Moved ticket assigned to next-stable-1.0.x since maintenance of 1.0.x is coming to a close. Please move the ticket back if it's critical to fix on 1.0.x.
comment:13 by , 8 years ago
Milestone: | next-stable-1.2.x → next-dev-1.3.x |
---|
Replying to Jun Omae:
Small improvement for
diff_div.html
in [537484b1d/jomae.git]. Before the changes, check of thatdiff.style
issidebyside
/inline
is executed for each diff line. We could reduce times of the check to just once.
We could port that one to the Jinja2 template?
Incidentally I faced the very same problem this morning…
When I didn't get any response in the browser, I logged on the server and saw the httpd process stuck at 100% CPU. With gdb, I couldn't figure out the problem… because there was no error, just intensive processing in order to generate a diff for an XML file that grew from 70k lines to 100k lines, as I found out after waiting long enough ;-)