Edgewall Software
Modify

Opened 15 years ago

Last modified 16 months ago

#8645 new defect

More readable header IDs

Reported by: Mitar Owned by:
Priority: high Milestone: next-major-releases
Component: wiki system Version: 0.11.4
Severity: normal Keywords: unicode verify bitesized
Cc: mmitar@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description (last modified by Christian Boos)

The initial report was about invalid header IDs; they're actually not invalid in XHTML 1.0, but their readability can nevertheless be greatly improved. Re-targeting the ticket that way, see comment:5 for details.

Original Report

Headers IDs generated from headers with non-ASCII characters keep/have them what is invalid:

An ID attribute does not conform with the HTML document type specification. ID tokens must begin with a letter, and may contain only letters, digits, hyphens, underscores, colons, and periods.

We could do something like Text::Unidecode.

Attachments (0)

Change History (14)

comment:1 by Christian Boos, 15 years ago

Keywords: unicode verify added
Milestone: 0.12.1

comment:2 by Remy Blank, 15 years ago

Keywords: bitesized added

comment:3 by mark.m.mcmahon@…, 15 years ago

Some more information on this.

Tested on Trac 0.12dev-r9325 (Windows, sqlite, Firefox 3.6 and IE 7.0.7530)

creating a header like

=== éäïçêڪڻڦۆڭک鮎扱唖丶丨丼ナチスヌ荸鉑駁ä ===

the generated anchor link is (i.e. copy link location)…

FIREFOX:

http://localhost/wiki/WikiStart#%C3%A9%C3%A4%C3%AF%C3%A7%C3%AA%DA%AA%DA%BB%DA%A6%DB%86%DA%AD%DA%A9%E9%AE%8E%E6%89%B1%E5%94%96%E4%B8%B6%E4%B8%A8%E4%B8%BC%EF%BE%85%EF%BE%81%EF%BD%BD%EF%BE%87%E8%8D%B8%E9%89%91%E9%A7%81%C3%A4

IE 7:

http://localhost/wiki/WikiStart#éäïçêڪڻڦۆڭک鮎扱唖丶丨丼ナチスヌ荸鉑駁ä

Is the bug saying that the header ID should be encoded?

References to HTML 4.01 spec where this is metioned…

http://www.w3.org/TR/html401/struct/global.html#adef-id
http://www.w3.org/TR/html401/types.html#type-name

in reply to:  3 comment:4 by Remy Blank, 15 years ago

Replying to mark.m.mcmahon@…:

Is the bug saying that the header ID should be encoded?

Yes, that's the general idea. The only characters that should be present in an id attribute are [-a-zA-Z0-9_:.], and an additional constraint is that the first character must be a letter. So we should create an injective encoding that leaves most of the admissible characters as-is, and encodes the others so that two different inputs always generate different outputs.

We won't be able to use URL encoding for that, as % is not admissible. One idea could be to hex-encode the utf-8 encoded string using : as the escape character (making sure that a : is encoded as well). And to ensure that the id always starts with a letter, prepend an "a" if a string doesn't start with a letter, and append a : at the end (to make it unique against the same input with a prepended "a"). Maybe also encode the underscore, and convert spaces to underscores for better readability.

A few examples:

"Valid-42."     -> Valid-42.
"space char"    -> space_char
"an_underscore" -> an:5funderscore
"c'est l'été"   -> c:27est_l:27:c3:a9t:c3:a9
"this: is_it"   -> this:3a_is_it
"1 or 2"        -> a1_or_2:
"a1 or 2"       -> a1_or_2

That's just an idea. Feel free to come up with your own encoding.

in reply to:  3 ; comment:5 by Christian Boos, 15 years ago

Oops, concurrent edit ;-) Sorry Remy, I beg to differ…

According to the HTML 4.01 Specification - B.2.1 Non-ASCII characters in URI attribute values, yes, the header IDs should be ASCII only.

But we're producing XHTML 1.0, and there, it's legal to have unicode characters. More precisely, according to C.8. Fragment Identifiers in the XHTML 1.0 spec, both name and id attributes need to conform to XML 1.0 Section 2.3, production 5, and the Letter covers some ground. Except for the constraint of having no digit as the first character (which we already cope with), I see no need to be overly strict about this, and to try to cover exactly the above set, i.e. any unicode letter should do.

It would be much more valuable to take some greater care about producing readable anchors, by replacing space characters with - or _ (for example Google Code and Wikipedia use _, BitBucket uses -) instead of squashing words together like we do now, as that only "works" for a few heading styles. The only problem is that we would need to support the old style in order to support already existing links, so we would have to stick an extra <a> after the heading.

comment:6 by Remy Blank, 15 years ago

You're right, I failed to check the XML specification. But it also says:

When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used.

So I wonder if we shouldn't try to be conservative (to avoid browser compatibility issues) and stick to that anyway.

I suppose the \w pattern in the _anchor_re regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?

in reply to:  6 ; comment:7 by Christian Boos, 15 years ago

Replying to rblank:

So I wonder if we shouldn't try to be conservative (to avoid browser compatibility issues) and stick to that anyway.

In my tests, only Opera browsers couldn't handle unicode anchors, and IE7, IE8, FF 3.6, Safari, Chrome all worked as expected. As to me, readability of the links is the most important criterion, I'd say the compatibility problem is tolerable.

I suppose the \w pattern in the _anchor_re regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?

I don't know, we shouldn't. But as said above, we should keep the existing transcoding in order to keep old links working however limited or buggy it is, and have a better scheme that will be used by default and in addition to the old one (i.e. visible anchors as shown by the paragraph marks will be the new one).

in reply to:  7 comment:8 by Christian Boos, 15 years ago

Replying to cboos:

Replying to rblank:

I suppose the \w pattern in the _anchor_re regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?

I don't know, we shouldn't.

Oh, and '_' being of the identifier spec '\w' (match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database), the regexp is correct.

comment:9 by Remy Blank, 15 years ago

Ok, so what do you suggest for this ticket? I suspect Mark was looking for some easy tickets to get started in Trac development, so we should at least state clearly what is expected to close the ticket (or close as wontfix).

comment:10 by Christian Boos, 15 years ago

Component: renderingwiki system
Description: modified (diff)
Milestone: next-minor-0.12.xnext-major-0.1X
Summary: Header IDs contain invalid charactersMore readable header IDs

Hijacking the ticket to focus on improving the header IDs (comment:5), as we don't have such ticket yet and the discussion here is relevant to that topic.

in reply to:  10 comment:11 by Christian Boos, 15 years ago

Owner: set to Christian Boos
Priority: normalhigh

Replying to cboos:

Hijacking the ticket to focus on improving the header IDs (comment:5), as we don't have such ticket yet and the discussion here is relevant to that topic.

We had ticket:8499#comment:1, but that was also a hijacked one ;-)

in reply to:  5 comment:12 by Mitar, 14 years ago

Replying to cboos:

But we're producing XHTML 1.0, and there, it's legal to have unicode characters.

Still, my HTML Validator produces this link as explanation.

So probably it would be best if we would normalize anchors to contain only [A-Za-z][A-Za-z0-9:_.-]*.

comment:13 by Christian Boos, 13 years ago

And we should make sure those generated ids won't interfere with ids used in CSS rules (see ticket:645#comment:5).

comment:14 by Ryan J Ollos, 10 years ago

Owner: Christian Boos removed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.