Edgewall Software
Modify

Opened 16 years ago

Closed 11 years ago

Last modified 11 years ago

#7397 closed enhancement (duplicate)

Too many attachments (directories) in ENV/attachments/ticket/[ID] folder

Reported by: netilesik@… Owned by:
Priority: normal Milestone:
Component: attachment Version: 0.11-stable
Severity: major Keywords:
Cc: pkou@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Hi all. I have problem saving > 32765 tickets with attachment.

How to produce: Create 32765 tickets with attachment in every of it. Next ticket attachment will fail to save because of file-system limitation: ENV/attachments/ticket/* already has 32765 directories in it, and this is the limit for most popular file systems (ext3, udf, …)

I think good solution will look like this: Let: [ID] - ticket ID, [F_ID] - first (1-3) digits of ticket [ID]

It is correct to save attachments in next way: ENV/attachments/ticket/[F_ID]/[ID]

Thanks.

Attachments (1)

many-attachments.patch (6.9 KB ) - added by pkou@… 16 years ago.
Preliminary patch - just to explain the idea

Download all attachments as: .zip

Change History (15)

comment:1 by Jonas Borgström, 16 years ago

What operating system and filesystem are you using that has this limitation?

comment:2 by Jonas Borgström, 16 years ago

Sorry, didn't see that you mention ext3.

comment:3 by netilesik@…, 16 years ago

Currently I am using udf (freebsd), but the same limit of "links" has Ext3. this is really easy to repeat:

run this script (for example) in /tmp/test/ perl -e 'for (1..32765) {print "$_\n"; mkdir $_;};'

after this try: mkdir "blabla"; → Error (13) too many links.

comment:4 by netilesik@…, 16 years ago

Milestone: 0.11.2

If it is possible, please, implement this feature in trac 0.11.2, because we cant use trac in some of our projects. Thanks.

comment:5 by pkou@…, 16 years ago

Cc: pkou@… added

Suggestion how to solve it:

  • For any attachment's parent ID, define a hash code:
    hashcode = long(sha.new(str(attachment_parent_id)).hexdigest(), 16)
  • For the hash code, define a bucket number:
    bucket = int(hashcode % 31991)
  • Target directory for an attachment:
    ENV/attachments/att-realm/bucket/att-parent-id

So, basically, we split all parents in 31991 classes, and store each class in separate directory.

The number 31991 is chosen using the following criteria:

  • It shall be less than the maximal number of nodes in directory, e.g. it shall be less than 32000, which is a limitation of ext3 file system. Other file systems seems to have bigger limits, or they do not have limits at all. (Verified for ext2, ext3, ufs, zfs, ntfs, fat16, fat32)
  • It shall be as close as possible to SQRT(MAX-TICKET-NUMBER), e.g. close to SQRT(2^31)=46340
  • It shall be a prime number, in order to take into account all digits from a hash code

So, 31991 is the biggest prime number that is less than 32000, see http://primes.utm.edu/lists/small/10000.txt for the reference.

Potentially, it allows up to 31991*32000=1,023,712,000 attachments on ext3 file system, or up to 31991*32765=1,048,185,115 attachments on ufs/ext2 file system.


Question: If the proposal is okay, shall it be implemented for all projects (e.g. develop environment upgrade script), or for specific projects (e.g. use current approach by default and then allow use the new approach for some projects)?

My vision is that it shall be a default for all projects, and current attachments shall be moved to new structure during an environment upgrade.

comment:6 by Christian Boos, 16 years ago

Component: ticket systemattachment
Milestone: 0.11.22.0
Severity: normalmajor
Type: defectenhancement

I have mixed feelings about this issue. My first reaction was to say "simply choose an appropriate filesystem" which matches the requirements.

But then, if there are existing installations which are effectively having > 32000 subdirectories below ./attachment/ticket, the problem is real and should probably be addressed. However, this shouldn't be done at the price of excessive complexity and not by reducing the intuitiveness and usability of the current $TRAC_ENV/attachment layout.

Therefore, a good compromise would be to have a sharding scheme, in a similar way than for Subversion 1.5 fsfs repositories See http://www.farside.org.uk/200704/tree_structured_fsfs for details about the why and the how.

The additional complexity for Trac is that it has (theoretically) to handle sharding over alphanumerical names. Therefore, svn's scheme of 1/ → 0-999, 2/ → 1000-1999 doesn't seem to be appropriate here. We should find something else, with the following constraints:

  • predictable: given a entity name, should be immediate, non-ambiguous to find the location
  • unique: all the attachments for a given entity should be grouped in a single folder

Also, like in the svn case, this new scheme should apply to newly created environments only, with a script for converting existing environments for those who need it.

For the milestone, I think it can be done as soon as 0.12, but I'm setting 2.0 for now (meaning nice to have but not yet scheduled for a short term release). Definitely not for a minor bugfix release.

by pkou@…, 16 years ago

Attachment: many-attachments.patch added

Preliminary patch - just to explain the idea

comment:7 by pkou@…, 16 years ago

Please review the attached many-attachments.patch, which explains the idea. If it is okay, then I'll cleanup it.

It puts files from $ENV/attachments/type/id to $ENV/attachments2/type/hash/id, where the hash is used for dividing all attachments in groups.

Sharding cannot be used for Trac because it shall operate on alphanumeric names, like wiki page names.

Testing has been made on the following hash functions: SHA1, MD5, Python's internal hash, Knuth's string hash. On one billion entries, the best distribution has been shown by the MD5 algorithm, which gives a guarantee of creation less than 32000 files in a directory over 1,000,000,000 tickets with attachments.

comment:8 by pkou@…, 16 years ago

Another notice: The proposed algorithm allows up to 1,000,000,000 attachments in tickets on ext3 file system. If the maximal number of attachments can be limited by 100,000,000 attachments, then there is much simpler hashing function:

md5.new(str(name)).hexdigest()[0:3]

If the simple hash function is used, then the hash code can be calculated in scripts, also, easily:

echo -n NAME|md5sum|cut -c1-3

The testing shows that it will be possible to have up to 128,000,000 attachments on ext3.

comment:9 by szybalski@…, 15 years ago

http://groups.google.com/group/trac-users/browse_thread/thread/4fbc0dda6bafce88

I think keeping the simplicity of having the ticket# folder and attachments in it is worth switching to xfs or ext4 file system. We have reached the limit of ext3 in just 3 months. If this was documented then we would have installed it on xfs files system from the start. Backing up and moving overt is not such a big deal, so I think a solution to this would be better limits description in production deployment documentation.

Our limit was 31999 on Debian with ext3. Now running on xfs with no limit.

Thanks, Lucas

comment:10 by Christian Boos, 15 years ago

Milestone: 2.0unscheduled

Milestone 2.0 deleted

comment:11 by Remy Blank, 14 years ago

Milestone: triagingnext-major-0.1X

I guess we'll tackle this when t.e.o hits 32000 tickets :)

comment:12 by Remy Blank, 14 years ago

This is related to #7554.

comment:13 by Jun Omae, 11 years ago

Resolution: duplicate
Status: newclosed

After Trac 1.0 by #10313, the attachments directory's structure has been migrated to $ENV/files/attachments/realm/sha1(id)[:3]/sha1(id)/sha1(filename).ext.

comment:14 by Peter Suter, 11 years ago

Milestone: next-major-releases

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The ticket will remain with no owner.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none) to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.