String processing functions.
This module provides functions for string processing.
A string in this module is represented by
unicode:chardata()
, that is, a list of codepoints,
binaries with UTF-8-encoded codepoints
(UTF-8 binaries), or a mix of the two.
"abcd" is a valid string
<<"abcd">> is a valid string
["abcd"] is a valid string
<<"abc..åäö"/utf8>> is a valid string
<<"abc..åäö">> is NOT a valid string,
but a binary with Latin-1-encoded codepoints
[<<"abc">>, "..åäö"] is a valid string
[atom] is NOT a valid string
This module operates on grapheme clusters. A grapheme cluster is a user-perceived character, which can be represented by several codepoints.
"å" [229] or [97, 778]
"e̊" [101, 778]
The string length of "ß↑e̊" is 3, even though it is represented by the
codepoints [223,8593,101,778]
or the UTF-8 binary
<<195,159,226,134,145,101,204,138>>
.
Grapheme clusters for codepoints of class prepend
and non-modern (or decomposed) Hangul is not handled for performance
reasons in
find/3
,
replace/3
,
split/2
,
split/2
and
trim/3
.
Splitting and appending strings is to be done on grapheme clusters borders. There is no verification that the results of appending strings are valid or normalized.
Most of the functions expect all input to be normalized to one form,
see for example
unicode:characters_to_nfc_list/1
.
Language or locale specific handling of input is not considered in any function.
The functions can crash for non-valid input strings. For example, the functions expect UTF-8 binaries but not all functions verify that all binaries are encoded correctly.
Unless otherwise specified the return value type is the same as the input type. That is, binary input returns binary output, list input returns a list output, and mixed input can return a mixed output.
1> string:trim(" sarah ").
"sarah"
2> string:trim(<<" sarah ">>).
<<"sarah">>
3> string:lexemes("foo bar", " ").
["foo","bar"]
4> string:lexemes(<<"foo bar">>, " ").
[<<"foo">>,<<"bar">>]
This module has been reworked in Erlang/OTP 20 to
handle
unicode:chardata()
and operate on grapheme
clusters. The old
functions
that only work on Latin-1 lists as input
are still available but should not be used, they will be
deprecated in a future release.
Types
direction() = leading | trailing
grapheme_cluster() = char() | [char()]
A user-perceived character, consisting of one or more codepoints.
Functions
casefold(String :: unicode:chardata()) -> unicode:chardata()
Converts
to a case-agnostic
comparable string. Function casefold/1
is preferred
over lowercase/1
when two strings are to be compared
for equality. See also equal/4
.
Example:
1> string:casefold("Ω and ẞ SHARP S").
"ω and ss sharp s"
chomp(String :: unicode:chardata()) -> unicode:chardata()
Returns a string where any trailing \n
or
\r\n
have been removed from
.
Example:
182>string:chomp(<<"\nHello\n\n">>).
<<"\nHello">> 183>string:chomp("\nHello\r\r\n").
"\nHello\r"
equal(A, B) -> boolean()
A = B = unicode:chardata()
equal(A, B, IgnoreCase) -> boolean()
A = B = unicode:chardata()
IgnoreCase = boolean()
equal(A, B, IgnoreCase, Norm) -> boolean()
A = B = unicode:chardata()
IgnoreCase = boolean()
Norm = none | nfc | nfd | nfkc | nfkd
Returns true
if
and
are equal, otherwise false
.
If
is true
the function does
casefold
ing on the fly before the equality test.
If
is not none
the function applies normalization on the fly before the equality test.
There are four available normalization forms:
nfc
,
nfd
,
nfkc
, and
nfkd
.
By default,
is false
and
is none
.
Example:
1>string:equal("åäö", <<"åäö"/utf8>>).
true 2>string:equal("åäö", unicode:characters_to_nfd_binary("åäö")).
false 3>string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc).
true
find(String, SearchPattern) -> unicode:chardata() | nomatch
String = SearchPattern = unicode:chardata()
find(String, SearchPattern, Dir) -> unicode:chardata() | nomatch
String = SearchPattern = unicode:chardata()
Dir = direction()
Removes anything before
in
and returns the remainder of the string or nomatch
if
is not
found.
, which can be leading
or
trailing
, indicates from which direction characters
are to be searched.
By default,
is leading
.
Example:
1>string:find("ab..cd..ef", ".").
"..cd..ef" 2>string:find(<<"ab..cd..ef">>, "..", trailing).
<<"..ef">> 3>string:find(<<"ab..cd..ef">>, "x", leading).
nomatch 4>string:find("ab..cd..ef", "x", trailing).
nomatch
is_empty(String :: unicode:chardata()) -> boolean()
Returns true
if
is the
empty string, otherwise false
.
Example:
1>string:is_empty("foo").
false 2>string:is_empty(["",<<>>]).
true
length(String :: unicode:chardata()) -> integer() >= 0
Returns the number of grapheme clusters in
.
Example:
1>string:length("ß↑e̊").
3 2>string:length(<<195,159,226,134,145,101,204,138>>).
3
lexemes(String :: unicode:chardata(),
SeparatorList :: [grapheme_cluster()]) ->
[unicode:chardata()]
Returns a list of lexemes in
, separated
by the grapheme clusters in
.
Notice that, as shown in this example, two or more
adjacent separator graphemes clusters in
are treated as one. That is, there are no empty
strings in the resulting list of lexemes.
See also split/3
which returns
empty strings.
Notice that [$\r,$\n]
is one grapheme cluster.
Example:
1>string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]).
["abc","de̊f","ghi","jkl","foo"] 2>string:lexemes(<<"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]).
[<<"abc">>,<<"de̊f"/utf8>>,<<"ghi">>,<<"jkl\r\nfoo">>]
lowercase(String :: unicode:chardata()) -> unicode:chardata()
Converts
to lowercase.
Notice that function casefold/1
should be used when converting a string to
be tested for equality.
Example:
2> string:lowercase(string:uppercase("Michał")).
"michał"
next_codepoint(String :: unicode:chardata()) ->
maybe_improper_list(char(), unicode:chardata()) |
{error, unicode:chardata()}
Returns the first codepoint in
and the rest of
in the tail. Returns
an empty list if
is empty or an
{error, String}
tuple if the next byte is invalid.
Example:
1> string:next_codepoint(unicode:characters_to_binary("e̊fg")).
[101|<<"̊fg"/utf8>>]
next_grapheme(String :: unicode:chardata()) ->
maybe_improper_list(grapheme_cluster(),
unicode:chardata()) |
{error, unicode:chardata()}
Returns the first grapheme cluster in
and the rest of
in the tail. Returns
an empty list if
is empty or an
{error, String}
tuple if the next byte is invalid.
Example:
1> string:next_grapheme(unicode:characters_to_binary("e̊fg")).
["e̊"|<<"fg">>]
nth_lexeme(String, N, SeparatorList) -> unicode:chardata()
String = unicode:chardata()
N = integer() >= 0
SeparatorList = [grapheme_cluster()]
Returns lexeme number
in
, where lexemes are separated by
the grapheme clusters in
.
Example:
1> string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").
"ghi"
pad(String, Length) -> unicode:charlist()
String = unicode:chardata()
Length = integer()
pad(String, Length, Dir) -> unicode:charlist()
String = unicode:chardata()
Length = integer()
Dir = direction() | both
pad(String, Length, Dir, Char) -> unicode:charlist()
String = unicode:chardata()
Length = integer()
Dir = direction() | both
Char = grapheme_cluster()
Pads
to
with
grapheme cluster
.
, which can be leading
, trailing
,
or both
, indicates where the padding should be added.
By default,
is $\s
and
is trailing
.
Example:
1>string:pad(<<"He̊llö"/utf8>>, 8).
[<<72,101,204,138,108,108,195,182>>,32,32,32] 2>io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]).
' He̊llö' 3>io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]).
' He̊llö '
prefix(String :: unicode:chardata(), Prefix :: unicode:chardata()) ->
nomatch | unicode:chardata()
If
is the prefix of
, removes it and returns the
remainder of
, otherwise returns
nomatch
.
Example:
1>string:prefix(<<"prefix of string">>, "pre").
<<"fix of string">> 2>string:prefix("pre", "prefix").
nomatch
replace(String, SearchPattern, Replacement) ->
[unicode:chardata()]
String = SearchPattern = Replacement = unicode:chardata()
replace(String, SearchPattern, Replacement, Where) ->
[unicode:chardata()]
String = SearchPattern = Replacement = unicode:chardata()
Where = direction() | all
Replaces
in
with
.
, default leading
, indicates whether
the leading
, the trailing
or all
encounters of
are to be replaced.
Can be implemented as:
lists:join(Replacement, split(String, SearchPattern, Where)).
Example:
1>string:replace(<<"ab..cd..ef">>, "..", "*").
[<<"ab">>,"*",<<"cd..ef">>] 2>string:replace(<<"ab..cd..ef">>, "..", "*", all).
[<<"ab">>,"*",<<"cd">>,"*",<<"ef">>]
reverse(String :: unicode:chardata()) -> [grapheme_cluster()]
Returns the reverse list of the grapheme clusters in
.
Example:
1> Reverse =string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")).
[[79,776],[65,776],[65,778]] 2>io:format("~ts~n",[Reverse]).
ÖÄÅ
slice(String, Start) -> Slice
String = unicode:chardata()
Start = integer() >= 0
Slice = unicode:chardata()
slice(String, Start, Length) -> Slice
String = unicode:chardata()
Start = integer() >= 0
Length = infinity | integer() >= 0
Slice = unicode:chardata()
Returns a substring of
of
at most
grapheme clusters, starting at position
.
By default,
is infinity
.
Example:
1>string:slice(<<"He̊llö Wörld"/utf8>>, 4).
<<"ö Wörld"/utf8>> 2>string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,4).
"ö Wö" 3>string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,50).
"ö Wörld"
split(String, SearchPattern) -> [unicode:chardata()]
String = SearchPattern = unicode:chardata()
split(String, SearchPattern, Where) -> [unicode:chardata()]
String = SearchPattern = unicode:chardata()
Where = direction() | all
Splits
where
is encountered and return the remaining parts.
, default leading
, indicates whether
the leading
, the trailing
or all
encounters of
will split
.
Example:
0>string:split("ab..bc..cd", "..").
["ab","bc..cd"] 1>string:split(<<"ab..bc..cd">>, "..", trailing).
[<<"ab..bc">>,<<"cd">>] 2>string:split(<<"ab..bc....cd">>, "..", all).
[<<"ab">>,<<"bc">>,<<>>,<<"cd">>]
take(String, Characters) -> {Leading, Trailing}
String = unicode:chardata()
Characters = [grapheme_cluster()]
Leading = Trailing = unicode:chardata()
take(String, Characters, Complement) -> {Leading, Trailing}
String = unicode:chardata()
Characters = [grapheme_cluster()]
Complement = boolean()
Leading = Trailing = unicode:chardata()
take(String, Characters, Complement, Dir) -> {Leading, Trailing}
String = unicode:chardata()
Characters = [grapheme_cluster()]
Complement = boolean()
Dir = direction()
Leading = Trailing = unicode:chardata()
Takes characters from
as long as
the characters are members of set
or the complement of set
.
,
which can be leading
or trailing
, indicates from
which direction characters are to be taken.
Example:
5>string:take("abc0z123", lists:seq($a,$z)).
{"abc","0z123"} 6>string:take(<<"abc0z123">>, lists:seq($0,$9), true, leading).
{<<"abc">>,<<"0z123">>} 7>string:take("abc0z123", lists:seq($0,$9), false, trailing).
{"abc0z","123"} 8>string:take(<<"abc0z123">>, lists:seq($a,$z), true, trailing).
{<<"abc0z">>,<<"123">>}
titlecase(String :: unicode:chardata()) -> unicode:chardata()
Converts
to titlecase.
Example:
1> string:titlecase("ß is a SHARP s").
"Ss is a SHARP s"
to_float(String) -> {Float, Rest} | {error, Reason}
String = unicode:chardata()
Float = float()
Rest = unicode:chardata()
Reason = no_float | badarg
Argument
is expected to start with a
valid text represented float (the digits are ASCII values).
Remaining characters in the string after the float are returned in
.
Example:
>{F1,Fs} = string:to_float("1.0-1.0e-1"),
>{F2,[]} = string:to_float(Fs),
>F1+F2.
0.9 >string:to_float("3/2=1.5").
{error,no_float} >string:to_float("-1.5eX").
{-1.5,"eX"}
to_integer(String) -> {Int, Rest} | {error, Reason}
String = unicode:chardata()
Int = integer()
Rest = unicode:chardata()
Reason = no_integer | badarg
Argument
is expected to start with a
valid text represented integer (the digits are ASCII values).
Remaining characters in the string after the integer are returned in
.
Example:
>{I1,Is} = string:to_integer("33+22"),
>{I2,[]} = string:to_integer(Is),
>I1-I2.
11 >string:to_integer("0.5").
{0,".5"} >string:to_integer("x=2").
{error,no_integer}
to_graphemes(String :: unicode:chardata()) -> [grapheme_cluster()]
Converts
to a list of grapheme clusters.
Example:
1>string:to_graphemes("ß↑e̊").
[223,8593,[101,778]] 2>string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]
trim(String) -> unicode:chardata()
String = unicode:chardata()
trim(String, Dir) -> unicode:chardata()
String = unicode:chardata()
Dir = direction() | both
trim(String, Dir, Characters) -> unicode:chardata()
String = unicode:chardata()
Dir = direction() | both
Characters = [grapheme_cluster()]
Returns a string, where leading or trailing, or both,
have been removed.
which can be leading
, trailing
,
or both
, indicates from which direction characters
are to be removed.
Default
is the set of
nonbreakable whitespace codepoints, defined as
Pattern_White_Space in
By default,
is both
.
Notice that [$\r,$\n]
is one grapheme cluster according
to the Unicode Standard.
Example:
1>string:trim("\t Hello \n").
"Hello" 2>string:trim(<<"\t Hello \n">>, leading).
<<"Hello \n">> 3>string:trim(<<".Hello.\n">>, trailing, "\n.").
<<".Hello">>
uppercase(String :: unicode:chardata()) -> unicode:chardata()
Converts
to uppercase.
See also titlecase/1
.
Example:
1> string:uppercase("Michał").
"MICHAŁ"
Obsolete API functions
Here follows the function of the old API. These functions only work on a list of Latin-1 characters.
Note!
The functions are kept for backward compatibility, but are not recommended. They will be deprecated in a future release.
Any undocumented functions in string
are not to be used.
Functions
centre(String, Number) -> Centered
String = Centered = string()
Number = integer() >= 0
centre(String, Number, Character) -> Centered
String = Centered = string()
Number = integer() >= 0
Character = char()
chars(Character, Number) -> String
Character = char()
Number = integer() >= 0
String = string()
chars(Character, Number, Tail) -> String
Character = char()
Number = integer() >= 0
Tail = String = string()
Returns a string consisting of
characters
. Optionally, the string can end with
string
.
This function is obsolete.
Use
lists:duplicate/2
.
chr(String, Character) -> Index
String = string()
Character = char()
Index = integer() >= 0
concat(String1, String2) -> String3
String1 = String2 = String3 = string()
Concatenates
and
to form a new string
, which is returned.
This function is obsolete.
Use [
as
Data
argument, and call
unicode:characters_to_list/2
or
unicode:characters_to_binary/2
to flatten the output.
copies(String, Number) -> Copies
String = Copies = string()
Number = integer() >= 0
Returns a string containing
repeated
times.
This function is obsolete.
Use
lists:duplicate/2
.
cspan(String, Chars) -> Length
String = Chars = string()
Length = integer() >= 0
join(StringList, Separator) -> String
StringList = [string()]
Separator = String = string()
Returns a string with the elements of
separated by the string in
.
This function is obsolete.
Use
lists:join/2
.
Example:
> join(["one", "two", "three"], ", "). "one, two, three"
left(String, Number) -> Left
String = Left = string()
Number = integer() >= 0
left(String, Number, Character) -> Left
String = Left = string()
Number = integer() >= 0
Character = char()
len(String) -> Length
String = string()
Length = integer() >= 0
rchr(String, Character) -> Index
String = string()
Character = char()
Index = integer() >= 0
right(String, Number) -> Right
String = Right = string()
Number = integer() >= 0
right(String, Number, Character) -> Right
String = Right = string()
Number = integer() >= 0
Character = char()
rstr(String, SubString) -> Index
String = SubString = string()
Index = integer() >= 0
span(String, Chars) -> Length
String = Chars = string()
Length = integer() >= 0
str(String, SubString) -> Index
String = SubString = string()
Index = integer() >= 0
strip(String :: string()) -> string()
strip(String, Direction) -> Stripped
String = Stripped = string()
Direction = left | right | both
strip(String, Direction, Character) -> Stripped
String = Stripped = string()
Direction = left | right | both
Character = char()
Returns a string, where leading or trailing, or both, blanks or a
number of
have been removed.
, which can be left
, right
,
or both
, indicates from which direction blanks are to be
removed. strip/1
is equivalent to
strip(String, both)
.
This function is obsolete.
Use
trim/3
.
Example:
> string:strip("...Hello.....", both, $.). "Hello"
sub_string(String, Start) -> SubString
String = SubString = string()
Start = integer() >= 1
sub_string(String, Start, Stop) -> SubString
String = SubString = string()
Start = Stop = integer() >= 1
substr(String, Start) -> SubString
String = SubString = string()
Start = integer() >= 1
substr(String, Start, Length) -> SubString
String = SubString = string()
Start = integer() >= 1
Length = integer() >= 0
sub_word(String, Number) -> Word
String = Word = string()
Number = integer()
sub_word(String, Number, Character) -> Word
String = Word = string()
Number = integer()
Character = char()
Returns the word in position
of
. Words are separated by blanks or
s.
This function is obsolete.
Use
nth_lexeme/3
.
Example:
> string:sub_word(" Hello old boy !",3,$o). "ld b"
The specified string or character is case-converted. Notice that the supported character set is ISO/IEC 8859-1 (also called Latin 1); all values outside this set are unchanged
This function is obsolete use
lowercase/1
,
uppercase/1
,
titlecase/1
or
casefold/1
.
tokens(String, SeparatorList) -> Tokens
String = SeparatorList = string()
Tokens = [Token :: nonempty_string()]
Returns a list of tokens in
, separated
by the characters in
.
Example:
> tokens("abc defxxghix jkl", "x "). ["abc", "def", "ghi", "jkl"]
Notice that, as shown in this example, two or more
adjacent separator characters in
are treated as one. That is, there are no empty
strings in the resulting list of tokens.
words(String) -> Count
String = string()
Count = integer() >= 1
words(String, Character) -> Count
String = string()
Character = char()
Count = integer() >= 1
Notes
Some of the general string functions can seem to overlap each other. The reason is that this string package is the combination of two earlier packages and all functions of both packages have been retained.