Edgewall Software
Modify

Opened 5 years ago

Last modified 6 months ago

#13169 new defect

GitNode.get_content() read entire of a file into memory even if it is huge

Reported by: Jun Omae Owned by:
Priority: normal Milestone: next-stable-1.6.x
Component: plugin/git Version:
Severity: minor Keywords:
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

I think we should use TemporaryFile/SpooledTemporaryFile rather than cStringIO.

See also:

Attachments (0)

Change History (10)

comment:1 by Jun Omae, 5 years ago

Severity: normalminor

Reconsidering, GitNode.get_content() cannot return until the entire file is written to a temporary file. If the file is large, I think the behavior is slow. In the following changes, the method uses stdout of git cat-file blob {sha} without reading the content when size of the content is large than 32 MB.

Proposed changes in [8ec7797e6/jomae.git] (jomae.git@t13169+1.0).

comment:2 by Ryan J Ollos, 5 years ago

Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of _close_proc_pipes in r15077. The function could be restored to avoid duplicate code.

Looking at introduction of GitProcStdout and thinking about r15077, would it make sense to introduce a wrapper for Popen that cleans up on __exit__ and __del__? GitProcStdout does that while providing a file-like interface for stdout.

I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep GitProcStdout because we need to hold a reference to the process when returning from get_file to avoid I/O on a closed file. After the changes, do we need to keep reusing the same process in Storage.cat_file, or could we also implement [38381686f/rjollos.git]?

in reply to:  2 comment:3 by Jun Omae, 5 years ago

Replying to Ryan J Ollos:

Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of _close_proc_pipes in r15077. The function could be restored to avoid duplicate code.

The changes looks good to me. In addition, GitProcStdout.close method is restored in [632e91a44/jomae.git]. The close method is invoked by content_closing() after r16591 (#12977).

Looking at introduction of GitProcStdout and thinking about r15077, would it make sense to introduce a wrapper for Popen that cleans up on __exit__ and __del__? GitProcStdout does that while providing a file-like interface for stdout.

Backporting r15077 to 1.0-stable sounds good, however it is not recommended if we will release 1.0.18 soon.

I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep GitProcStdout because we need to hold a reference to the process when returning from get_file to avoid I/O on a closed file. After the changes, do we need to keep reusing the same process in Storage.cat_file, or could we also implement [38381686f/rjollos.git]?

After the changes, git process is executed when each GitNode.get_content_length() is invoked, for example, browser view would be very slow when a directory has many files….

comment:4 by Ryan J Ollos, 5 years ago

Thanks for giving feedback on those changes. I agree with deferring my additional changes. I think it would be good to commit for 1.0.18: you original changes with [fd8e27b47/rjollos.git] and the class rename in [632e91a44/jomae.git]. I'll make a new ticket for the GitCmd class introduction in milestone 1.0.19.

Version 0, edited 5 years ago by Ryan J Ollos (next)

comment:5 by Ryan J Ollos, 5 years ago

It occurred to me that we could just implement a file interface for GitCmd: [22f5cf676/rjollos.git].

Looking at GitCore.__execute, we could have GitCmd log errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to use communicate rather than stdout.read, stderr.read, stdin.write. However, I'm unsure of the best way to log stderr when size is not None.

in reply to:  5 comment:6 by Jun Omae, 5 years ago

Looking at GitCore.__execute, we could have GitCmd log errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to use communicate rather than stdout.read, stderr.read, stdin.write. However, I'm unsure of the best way to log stderr when size is not None.

Two solutions:

  1. Use non-blocking read both stdout and stderr to avoid deadlocks. However, the programming is too complex.
  2. Use temporary file for stderr to write output without read operations. Log content of the temporary file when the command exits if needed: [d18870314/jomae.git]

comment:7 by Ryan J Ollos, 5 years ago

Milestone: 1.0.181.0.19

comment:8 by Ryan J Ollos, 5 years ago

Milestone: 1.0.191.0.20

comment:9 by Ryan J Ollos, 4 years ago

Milestone: 1.0.20next-stable-1.4.x

Please move back to 1.0.20 if you wish to fix for release 1.0.20 (#13294).

comment:10 by Ryan J Ollos, 6 months ago

Milestone: next-stable-1.4.xnext-stable-1.6.x

Milestone renamed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.