Opened 6 years ago
Last modified 2 years ago
#13169 new defect
GitNode.get_content() read entire of a file into memory even if it is huge
| Reported by: | Jun Omae | Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | next-stable-1.6.x |
| Component: | plugin/git | Version: | |
| Severity: | minor | Keywords: | |
| Cc: | Branch: | ||
| Release Notes: | |||
| API Changes: | |||
| Internal Changes: | |||
Description
I think we should use TemporaryFile/SpooledTemporaryFile rather than cStringIO.
See also:
- GitNode.get_content()
- get_file() -
cStringIOis used - cat_file() - entire of the content into memory by
Popen.stdout.read(size)
Attachments (0)
Change History (10)
comment:1 by , 6 years ago
| Severity: | normal → minor |
|---|
follow-up: 3 comment:2 by , 6 years ago
Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of _close_proc_pipes in r15077. The function could be restored to avoid duplicate code.
Looking at introduction of GitProcStdout and thinking about r15077, would it make sense to introduce a wrapper for Popen that cleans up on __exit__ and __del__? GitProcStdout does that while providing a file-like interface for stdout.
I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep GitProcStdout because we need to hold a reference to the process when returning from get_file to avoid I/O on a closed file. After the changes, do we need to keep reusing the same process in Storage.cat_file, or could we also implement [38381686f/rjollos.git]?
comment:3 by , 6 years ago
Replying to Ryan J Ollos:
Two minor suggestions in [fd8e27b47/rjollos.git]. We can skip setting default parameter
stdout=PIPE, like in r15074. Also, I noticed that changes will need some rebasing on trunk due removal of_close_proc_pipesin r15077. The function could be restored to avoid duplicate code.
The changes looks good to me. In addition, GitProcStdout.close method is restored in [632e91a44/jomae.git]. The close method is invoked by content_closing() after r16591 (#12977).
Looking at introduction of
GitProcStdoutand thinking about r15077, would it make sense to introduce a wrapper forPopenthat cleans up on__exit__and__del__?GitProcStdoutdoes that while providing a file-like interface forstdout.
Backporting r15077 to 1.0-stable sounds good, however it is not recommended if we will release 1.0.18 soon.
I experimented with changes in [f3cab6f42/rjollos.git]. I had to keep
GitProcStdoutbecause we need to hold a reference to the process when returning fromget_fileto avoid I/O on a closed file. After the changes, do we need to keep reusing the same process inStorage.cat_file, or could we also implement [38381686f/rjollos.git]?
After the changes, git process is executed when each GitNode.get_content_length() is invoked, for example, browser view would be very slow when a directory has many files….
comment:4 by , 6 years ago
Thanks for giving feedback on those changes. I agree with deferring my additional changes. I think it would be good to commit for 1.0.18: your original changes with [fd8e27b47/rjollos.git] and the class rename in [632e91a44/jomae.git]. I'll make a new ticket for the GitCmd class introduction in milestone 1.0.19.
follow-up: 6 comment:5 by , 6 years ago
It occurred to me that we could just implement a file interface for GitCmd: [22f5cf676/rjollos.git].
Looking at GitCore.__execute, we could have GitCmd log errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to use communicate rather than stdout.read, stderr.read, stdin.write. However, I'm unsure of the best way to log stderr when size is not None.
comment:6 by , 6 years ago
Looking at
GitCore.__execute, we could haveGitCmdlog errors: [ee852cfea/rjollos.git]. The documentation suggests it is better to usecommunicaterather thanstdout.read,stderr.read,stdin.write. However, I'm unsure of the best way to logstderrwhensizeis notNone.
Two solutions:
- Use non-blocking read both stdout and stderr to avoid deadlocks. However, the programming is too complex.
- Use temporary file for stderr to write output without read operations. Log content of the temporary file when the command exits if needed: [d18870314/jomae.git]
comment:7 by , 6 years ago
| Milestone: | 1.0.18 → 1.0.19 |
|---|
comment:8 by , 6 years ago
| Milestone: | 1.0.19 → 1.0.20 |
|---|
comment:9 by , 6 years ago
| Milestone: | 1.0.20 → next-stable-1.4.x |
|---|



Reconsidering,
GitNode.get_content()cannot return until the entire file is written to a temporary file. If the file is large, I think the behavior is slow. In the following changes, the method uses stdout ofgit cat-file blob {sha}without reading the content when size of the content is large than 32 MB.Proposed changes in [8ec7797e6/jomae.git] (jomae.git@t13169+1.0).