Opened 15 years ago
Last modified 15 months ago
#8645 new defect
More readable header IDs
Reported by: | Mitar | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | next-major-releases |
Component: | wiki system | Version: | 0.11.4 |
Severity: | normal | Keywords: | unicode verify bitesized |
Cc: | mmitar@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description (last modified by )
The initial report was about invalid header IDs; they're actually not invalid in XHTML 1.0, but their readability can nevertheless be greatly improved. Re-targeting the ticket that way, see comment:5 for details.
Original Report
Headers IDs generated from headers with non-ASCII characters keep/have them what is invalid:
An ID attribute does not conform with the HTML document type specification. ID tokens must begin with a letter, and may contain only letters, digits, hyphens, underscores, colons, and periods.
We could do something like Text::Unidecode.
Attachments (0)
Change History (14)
comment:1 by , 15 years ago
Keywords: | unicode verify added |
---|---|
Milestone: | → 0.12.1 |
comment:2 by , 15 years ago
Keywords: | bitesized added |
---|
follow-ups: 4 5 comment:3 by , 15 years ago
comment:4 by , 15 years ago
Replying to mark.m.mcmahon@…:
Is the bug saying that the header ID should be encoded?
Yes, that's the general idea. The only characters that should be present in an id
attribute are [-a-zA-Z0-9_:.]
, and an additional constraint is that the first character must be a letter. So we should create an injective encoding that leaves most of the admissible characters as-is, and encodes the others so that two different inputs always generate different outputs.
We won't be able to use URL encoding for that, as %
is not admissible. One idea could be to hex-encode the utf-8 encoded string using :
as the escape character (making sure that a :
is encoded as well). And to ensure that the id always starts with a letter, prepend an "a" if a string doesn't start with a letter, and append a :
at the end (to make it unique against the same input with a prepended "a"). Maybe also encode the underscore, and convert spaces to underscores for better readability.
A few examples:
"Valid-42." -> Valid-42. "space char" -> space_char "an_underscore" -> an:5funderscore "c'est l'été" -> c:27est_l:27:c3:a9t:c3:a9 "this: is_it" -> this:3a_is_it "1 or 2" -> a1_or_2: "a1 or 2" -> a1_or_2
That's just an idea. Feel free to come up with your own encoding.
follow-up: 12 comment:5 by , 15 years ago
Oops, concurrent edit ;-) Sorry Remy, I beg to differ…
According to the HTML 4.01 Specification - B.2.1 Non-ASCII characters in URI attribute values, yes, the header IDs should be ASCII only.
But we're producing XHTML 1.0, and there, it's legal to have unicode characters. More precisely, according to C.8. Fragment Identifiers in the XHTML 1.0 spec, both name and id attributes need to conform to XML 1.0 Section 2.3, production 5, and the Letter covers some ground. Except for the constraint of having no digit as the first character (which we already cope with), I see no need to be overly strict about this, and to try to cover exactly the above set, i.e. any unicode letter should do.
It would be much more valuable to take some greater care about producing readable anchors, by replacing space characters with -
or _
(for example Google Code and Wikipedia use _
, BitBucket uses -
) instead of squashing words together like we do now, as that only "works" for a few heading styles. The only problem is that we would need to support the old style in order to support already existing links, so we would have to stick an extra <a> after the heading.
follow-up: 7 comment:6 by , 15 years ago
You're right, I failed to check the XML specification. But it also says:
When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used.
So I wonder if we shouldn't try to be conservative (to avoid browser compatibility issues) and stick to that anyway.
I suppose the \w
pattern in the _anchor_re
regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?
follow-up: 8 comment:7 by , 15 years ago
Replying to rblank:
So I wonder if we shouldn't try to be conservative (to avoid browser compatibility issues) and stick to that anyway.
In my tests, only Opera browsers couldn't handle unicode anchors, and IE7, IE8, FF 3.6, Safari, Chrome all worked as expected. As to me, readability of the links is the most important criterion, I'd say the compatibility problem is tolerable.
I suppose the
\w
pattern in the_anchor_re
regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?
I don't know, we shouldn't. But as said above, we should keep the existing transcoding in order to keep old links working however limited or buggy it is, and have a better scheme that will be used by default and in addition to the old one (i.e. visible anchors as shown by the paragraph marks will be the new one).
comment:8 by , 15 years ago
Replying to cboos:
Replying to rblank:
I suppose the
\w
pattern in the_anchor_re
regexp matches only admissible characters? That would mean the current algorithm does generate only valid ids. Why do we remove underscores?I don't know, we shouldn't.
Oh, and '_' being of the identifier spec '\w' (match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database), the regexp is correct.
comment:9 by , 15 years ago
Ok, so what do you suggest for this ticket? I suspect Mark was looking for some easy tickets to get started in Trac development, so we should at least state clearly what is expected to close the ticket (or close as wontfix).
follow-up: 11 comment:10 by , 15 years ago
Component: | rendering → wiki system |
---|---|
Description: | modified (diff) |
Milestone: | next-minor-0.12.x → next-major-0.1X |
Summary: | Header IDs contain invalid characters → More readable header IDs |
Hijacking the ticket to focus on improving the header IDs (comment:5), as we don't have such ticket yet and the discussion here is relevant to that topic.
comment:11 by , 15 years ago
Owner: | set to |
---|---|
Priority: | normal → high |
Replying to cboos:
Hijacking the ticket to focus on improving the header IDs (comment:5), as we don't have such ticket yet and the discussion here is relevant to that topic.
We had ticket:8499#comment:1, but that was also a hijacked one ;-)
comment:12 by , 14 years ago
Replying to cboos:
But we're producing XHTML 1.0, and there, it's legal to have unicode characters.
Still, my HTML Validator produces this link as explanation.
So probably it would be best if we would normalize anchors to contain only [A-Za-z][A-Za-z0-9:_.-]*
.
comment:13 by , 13 years ago
And we should make sure those generated ids won't interfere with ids used in CSS rules (see ticket:645#comment:5).
comment:14 by , 10 years ago
Owner: | removed |
---|
Some more information on this.
Tested on Trac 0.12dev-r9325 (Windows, sqlite, Firefox 3.6 and IE 7.0.7530)
creating a header like
the generated anchor link is (i.e. copy link location)…
FIREFOX:
IE 7:
Is the bug saying that the header ID should be encoded?
References to HTML 4.01 spec where this is metioned…
http://www.w3.org/TR/html401/struct/global.html#adef-id
http://www.w3.org/TR/html401/types.html#type-name