There are many different invisible characters. Many of them have the property WSpace=Y
("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim()
function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim()
and ltrim()
variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace()
instead.
Trailing
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, 's+$', '') FROM eventdates;
The regular expression explained:
s
... regular expression class shorthand for [[:space:]]
????- which is the set of white-space characters - see limitations below
+
... 1 or more consecutive matches
$
... end of string
Demo:
SELECT regexp_replace('inner white ', 's+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (
). Details in this related answer:
Leading
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^s+', '')
^
.. start of string
Both
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^s+', ''), 's+$', '')
Or you can combine both in a single call with two branches.
Add 'g'
as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^s+|s+$', '', 'g')
But that should typically be faster with substring()
:
substring(eventdate, 'S(?:.*S)*')
S
... everything but white space
(?:
re
)
... non-capturing set of parentheses
.*
... any string of 0-n characters
Or one of these:
substring(eventdate, '^s*(.*S)')
substring(eventdate, '(S.*S)') -- only works for 2+ printing characters
(
re
)
... Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
Whitespace?
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]]
.
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'u180e', E'u200B', E'u200C', E'u200D';
'?' | '?' | '?' | '?'
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'u2060', E'uFEFF';
'?' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace 's'
with '[su180eu200Bu200Cu200Du2060uFEFF]'
or '[s?????]'
(note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, 's+$', '')
use:
regexp_replace(eventdate, '[su180eu200Bu200Cu200Du2060uFEFF]+$', '')
or:
regexp_replace(eventdate, '[s?????]+$', '') -- note invisible characters
Limitations
There is also the Posix character class [[:graph:]]
supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [x21-x7E]
), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype
) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]]
(shorthand s
) to catch all whitespace characters. Like: u2007
, u202f
and u00a0
seem to also be missing for @XiCoN JFS.
The manual:
Within a bracket expression, the name of a character class enclosed in
[:
and :]
stands for the list of all characters belonging to that
class. Standard character class names are: alnum
, alpha
, blank
, cntrl
,
digit
, graph
, lower
, print
, punct
, space
, upper
, xdigit
.
These stand for the character classes defined in ctype.
A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character
codes, particularly Unicode characters above U+7FF
(Tom Lane)
Previously, such characters were never recognized as belonging to
locale-dependent character classes such as [[:alpha:]]
.