Edgewall Software
Modify

Opened 21 months ago

Last modified 10 months ago

#13169 new defect

GitNode.get_content() read entire of a file into memory even if it is huge

Reported by: Jun Omae Owned by:
Priority: normal Milestone: next-stable-1.4.x
Component: plugin/git Version:
Severity: minor Keywords:
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

I think we should use TemporaryFile/SpooledTemporaryFile rather than cStringIO.

See also:

Attachments (0)

Change History (9)

comment:1 by Jun Omae, 21 months ago

Severity: normalminor

Reconsidering, GitNode.get_content() cannot return until the entire file is written to a temporary file. If the file is large, I think the behavior is slow. In the following changes, the method uses stdout of git cat-file blob {sha} without reading the content when size of the content is large than 32 MB.

Proposed changes in [8ec7797e6/jomae.git] (jomae.git@t13169+1.0).

comment:2 by Ryan J Ollos, 20 months ago

Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of _close_proc_pipes in r15077. The function could be restored to avoid duplicate code.

Looking at introduction of GitProcStdout and thinking about r15077, would it make sense to introduce a wrapper for Popen that cleans up on __exit__ and __del__? GitProcStdout does that while providing a file-like interface for stdout.

I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep GitProcStdout because we need to hold a reference to the process when returning from get_file to avoid I/O on a closed file. After the changes, do we need to keep reusing the same process in Storage.cat_file, or could we also implement [38381686f/rjollos.git]?

in reply to:  2 comment:3 by Jun Omae, 20 months ago

Replying to Ryan J Ollos:

Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of _close_proc_pipes in r15077. The function could be restored to avoid duplicate code.

The changes looks good to me. In addition, GitProcStdout.close method is restored in [632e91a44/jomae.git]. The close method is invoked by content_closing() after r16591 (#12977).

Looking at introduction of GitProcStdout and thinking about r15077, would it make sense to introduce a wrapper for Popen that cleans up on __exit__ and __del__? GitProcStdout does that while providing a file-like interface for stdout.

Backporting r15077 to 1.0-stable sounds good, however it is not recommended if we will release 1.0.18 soon.

I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep GitProcStdout because we need to hold a reference to the process when returning from get_file to avoid I/O on a closed file. After the changes, do we need to keep reusing the same process in Storage.cat_file, or could we also implement [38381686f/rjollos.git]?

After the changes, git process is executed when each GitNode.get_content_length() is invoked, for example, browser view would be very slow when a directory has many files….

comment:4 by Ryan J Ollos, 20 months ago

Thanks for giving feedback on those changes. I agree with deferring my additional changes. I think it would be good to commit for 1.0.18: your original changes with [fd8e27b47/rjollos.git] and the class rename in [632e91a44/jomae.git]. I'll make a new ticket for the GitCmd class introduction in milestone 1.0.19.

Last edited 20 months ago by Ryan J Ollos (previous) (diff)

comment:5 by Ryan J Ollos, 20 months ago

It occurred to me that we could just implement a file interface for GitCmd: [22f5cf676/rjollos.git].

Looking at GitCore.__execute, we could have GitCmd log errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to use communicate rather than stdout.read, stderr.read, stdin.write. However, I'm unsure of the best way to log stderr when size is not None.

in reply to:  5 comment:6 by Jun Omae, 20 months ago

Looking at GitCore.__execute, we could have GitCmd log errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to use communicate rather than stdout.read, stderr.read, stdin.write. However, I'm unsure of the best way to log stderr when size is not None.

Two solutions:

  1. Use non-blocking read both stdout and stderr to avoid deadlocks. However, the programming is too complex.
  2. Use temporary file for stderr to write output without read operations. Log content of the temporary file when the command exits if needed: [d18870314/jomae.git]

comment:7 by Ryan J Ollos, 20 months ago

Milestone: 1.0.181.0.19

comment:8 by Ryan J Ollos, 19 months ago

Milestone: 1.0.191.0.20

comment:9 by Ryan J Ollos, 10 months ago

Milestone: 1.0.20next-stable-1.4.x

Please move back to 1.0.20 if you wish to fix for release 1.0.20 (#13294).

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.