Because they're not there! Why character classes at all? The whole point of RE languages is to make pattern matching simpler than writing your own matching algorithms. Standardized RE languages are also easier to read, write, maintain and share than the alternatives. Syntax for character classes exist to exploit in simple notation the shared properties of a group of characters.
So what properties do Ethiopic characters have that RE languages do not detect? Ethiopic letters each contain two properties that should be matched independently. Each letter is a syllable, a "CV" pattern, a means to detect either the "C" part or the "V" part is highly desirable. There are 7 basic "V" forms shared by 37 "C" bases. This gives us 7 classes each containing 37 members and the inversion of 37 "C" classes each of 7 members. We need a simple way to specify these groups without typing out every member of the group in a bracketed list.
But isn't this just a matter of convenience? Yes, but we like convenience,
that's why we have character classes and REs to begin with. Since Perl 5.6
Unicode derived character classes have been introduced in the form of
\p{IsDigit}
.
While this is great for working across scripts in a multilingual
document or archive, it is less nice when working with documents you know
are English only. Really, would you use \p{IsDigit}
when you
could more easily use \d
(or [0-9]
or [:digit:]
)??
This package offers overloading of the Perl regular expressions mechanism
to provide syllabic style character class definitions with the convenience
and ease of use POSIX notation. See the examples/
directory
and module documentation for details.
[#1#] | [#2#] | [#3#] | [#4#] | [#5#] | [#6#] | [#7#] | [#8#] | [#9#] | [#10#] | [#11#] | [#12#] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
[#ሀ#] | ሀ | ሁ | ሂ | ሃ | ሄ | ህ | ሆ | |||||
[#ለ#] | ለ | ሉ | ሊ | ላ | ሌ | ል | ሎ | ሏ | ||||
[#ሐ#] | ሐ | ሑ | ሒ | ሓ | ሔ | ሕ | ሖ | ሗ | ||||
[#መ#] | መ | ሙ | ሚ | ማ | ሜ | ም | ሞ | ሟ | ||||
[#ሠ#] | ሠ | ሡ | ሢ | ሣ | ሤ | ሥ | ሦ | ሧ | ||||
[#ረ#] | ረ | ሩ | ሪ | ራ | ሬ | ር | ሮ | ሯ | ||||
[#ሰ#] | ሰ | ሱ | ሲ | ሳ | ሴ | ስ | ሶ | ሷ | ||||
[#ሸ#] | ሸ | ሹ | ሺ | ሻ | ሼ | ሽ | ሾ | ሿ | ||||
[#ቀ#] | ቀ | ቁ | ቂ | ቃ | ቄ | ቅ | ቆ | ቈ | ቍ | ቊ | ቋ | ቌ |
[#ቐ#] | ቐ | ቑ | ቒ | ቓ | ቔ | ቕ | ቖ | ቘ | ቝ | ቚ | ቛ | ቜ |
[#በ#] | በ | ቡ | ቢ | ባ | ቤ | ብ | ቦ | ቧ | ||||
[#ቨ#] | ቨ | ቩ | ቪ | ቫ | ቬ | ቭ | ቮ | ቯ | ||||
[#ተ#] | ተ | ቱ | ቲ | ታ | ቴ | ት | ቶ | ቷ | ||||
[#ቸ#] | ቸ | ቹ | ቺ | ቻ | ቼ | ች | ቾ | ቿ | ||||
[#ኀ#] | ኀ | ኁ | ኂ | ኃ | ኄ | ኅ | ኆ | ኈ | ኍ | ኊ | ኋ | ኌ |
[#ነ#] | ነ | ኑ | ኒ | ና | ኔ | ን | ኖ | ኗ | ||||
[#ኘ#] | ኘ | ኙ | ኚ | ኛ | ኜ | ኝ | ኞ | ኟ | ||||
[#አ#] | አ | ኡ | ኢ | ኣ | ኤ | እ | ኦ | ኧ | ||||
[#ከ#] | ከ | ኩ | ኪ | ካ | ኬ | ክ | ኮ | ኰ | ኵ | ኲ | ኳ | ኴ |
[#ኸ#] | ኸ | ኹ | ኺ | ኻ | ኼ | ኽ | ኾ | ዀ | ዅ | ዂ | ዃ | ዄ |
[#ወ#] | ወ | ዉ | ዊ | ዋ | ዌ | ው | ዎ | |||||
[#ዐ#] | ዐ | ዑ | ዒ | ዓ | ዔ | ዕ | ዖ | |||||
[#ዘ#] | ዘ | ዙ | ዚ | ዛ | ዜ | ዝ | ዞ | ዟ | ||||
[#ዠ#] | ዠ | ዡ | ዢ | ዣ | ዤ | ዥ | ዦ | ዧ | ||||
[#የ#] | የ | ዩ | ዪ | ያ | ዬ | ይ | ዮ | |||||
[#ደ#] | ደ | ዱ | ዲ | ዳ | ዴ | ድ | ዶ | ዷ | ||||
[#ዸ#] | ዸ | ዹ | ዺ | ዻ | ዼ | ዽ | ዾ | ዿ | ||||
[#ጀ#] | ጀ | ጁ | ጂ | ጃ | ጄ | ጅ | ጆ | ጇ | ||||
[#ገ#] | ገ | ጉ | ጊ | ጋ | ጌ | ግ | ጎ | ጐ | ጕ | ጒ | ጓ | ጔ |
[#ጘ#] | ጘ | ጙ | ጚ | ጛ | ጜ | ጝ | ጞ | |||||
[#ጠ#] | ጠ | ጡ | ጢ | ጣ | ጤ | ጥ | ጦ | ጧ | ||||
[#ጨ#] | ጨ | ጩ | ጪ | ጫ | ጬ | ጭ | ጮ | ጯ | ||||
[#ጰ#] | ጰ | ጱ | ጲ | ጳ | ጴ | ጵ | ጶ | ጷ | ||||
[#ጸ#] | ጸ | ጹ | ጺ | ጻ | ጼ | ጽ | ጾ | ጿ | ||||
[#ፀ#] | ፀ | ፁ | ፂ | ፃ | ፄ | ፅ | ፆ | |||||
[#ፈ#] | ፈ | ፉ | ፊ | ፋ | ፌ | ፍ | ፎ | ፏ | ||||
[#ፐ#] | ፐ | ፑ | ፒ | ፓ | ፔ | ፕ | ፖ | ፗ |
|
|
Equivalence in Phono-Orthography
|
Equivalence of Families
|
Equivalence in Phono-Orthography
|
Equivalence of Families
|
Equivalence in Phono-Orthography
|
Equivalence of Familiesnone ocurring
|
Amharic | Tigrigna-ER | Tigrigna-ET | Ge'ez | |
---|---|---|---|---|
Valid and Probable | ዓለምፀሐይ ዓለምጸሐይ ዓለምጸሃይ ዓለምፀሃይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ | ዓለምጸሓይ | ዓለምፀሐይ ዓለምጸሐይ | ዓለምፀሐይ |
Intermediate Probablity | አለምጸሀይ አለምፀሀይ ዓለምጸሀይ ዓለምፀሀይ ዐለምፀሐይ ዐለምጸሐይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምፀሀይ | ዓለምጸሐይ ዐለምጸሓይ ዐለምጸሐይ ዓለምፀሐይ | ዓለምጸሓይ ዓለምፀሓይ | ዐለምፀሐይ |
Valid but Improbable | ዓለምጸኃይ ዓለምጸኀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸሓይ ዓለምፀሓይ ዓለምጸኻይ ዓለምፀኻይ አለምጸኃይ አለምጸኀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸሓይ ዐለምፀሓይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ | ዓለምፀሓይ ዐለምፀሐይ ዐለምፀሓይ | ዐለምፀሐይ ዐለምጸሐይ ዐለምጸሓይ ዐለምፀሓይ | ዓለምፀሓይ ዐለምፀሓይ |
Impossible (Invalid Phonemes) | ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ | ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ | ዓለምጸሐይ ዓለምጸሃይ ዓለምፀሃይ ዓለምጸሀይ ዓለምጸኃይ ዓለምጸኀይ ዓለምፀሀይ ዓለምፀኃይ ዓለምፀኀይ ዓለምጸሓይ ዓለምጸኻይ ዓለምፀኻይ አለምፀሐይ አለምጸሐይ አለምጸሃይ አለምፀሃይ አለምጸሀይ አለምጸኃይ አለምጸኀይ አለምፀሀይ አለምፀኃይ አለምፀኀይ አለምጸሓይ አለምፀሓይ አለምጸኻይ አለምፀኻይ ዐለምጸሐይ ዐለምጸሃይ ዐለምፀሃይ ዐለምጸሀይ ዐለምጸኃይ ዐለምጸኀይ ዐለምፀሀይ ዐለምፀኃይ ዐለምፀኀይ ዐለምጸሓይ ዐለምጸኻይ ዐለምፀኻይ ኣለምፀሐይ ኣለምጸሐይ ኣለምጸሃይ ኣለምፀሃይ ኣለምጸሀይ ኣለምጸኃይ ኣለምጸኀይ ኣለምፀሀይ ኣለምፀኃይ ኣለምፀኀይ ኣለምጸሓይ ኣለምፀሓይ ኣለምጸኻይ ኣለምፀኻይ |
The following shows by example how the character classes may be applied
to the ዓለምፀሐይ example. The majority of the examples also appear in
the overload.pl
script found in the examples
directory.
The Regexp::Ethiopic package (as well as subclasses) export a number of utility functions that operate on Ethiopic characters and strings. Specify ":utils" as an import option to bring the functions into the local namespace.
A utility function to query the "form" of an Ethiopic syllable. It will return an integer between 1 and 12 corresponding to the [#\d+#]
classes. The function used in the following example:
print getForm ( "አ" ), "\n";
will print the number 1.
A utility function to set the form number of a syllable. The form number must be an integer between 1 and 12 corresponding to the [#\d+#]
classes. Useful in substitutions as per:
s/(.)/setForm($1, 1)/eg;
where every matching character found will be converted into the first form.
A utility function to set the form number of a syllable based on the form of another syllable. Useful in substitutions as per:
s/(\w+)([#ፀ#])/$1.subForm('ጸ', $2)/eg;
where every character in the class of [#ፀ#])
will be converted to the 'ጸ' family in the form number of the matched character. That is, if 'ፂ' is matched it will be converted to 'ጺ'.
A utility function somewhat analogous to sprintf
for a sequence of syllables. The first argument is the format where the desired symbol sequence is provided. The second argument is the string to format.
For example, the format string "%1%2%3%4" indicates that the first character of the argument should be in the first form, %1, the second character in the seconf form, %2, the third character in the third form %3, and the fourth character in the fourth form, %4. For the argument "አበገደ":
print formatForms ( "%1%2%3%4", "አበገደ" ), "\n";
the output would be "አቡጊዳ".
Alias strings are also exported by the Regexp::Ethiopic package (as well as subclasses) that are assigned the the values 1-12 corresponding to their form. Specify ":forms" as an import option to bring the strings into the local namespace. The names are assignments are given by:
($ግዕዝ, $ካዕብ, $ሣልስ, $ራብዕ, $ኃምስ, $ሳድስ, $ሳብዕ,
$ዘመደ_ግዕዝ, $ዘመደ_ካዕብ, $ዘመደ_ሣልስ, $ዘመደ_ራብዕ, $ዘመደ_ኃምስ) = (1 .. 12);
Example use:
if ( getForm ( $x ) == $ካዕብ ) {
:
:
}
The overloading of Perl's regular expressions mechanism is the preferred usage for the Regexp::Ethiopic package. However, the overloading mechanism only applies to the constant part of the RE. The following would not be handled by the Regexp::Ethipic package as expected:
use Regexp::Ethiopic 'overload'; my $x = "ከ"; : : if ( /[#$x#]/ ) { # $x is a variable, /[#ከ#]/ is constant : : }
The above expression is not identical to /[#ከ#]/
because 'ከ' is constant whereas $x
is a variable.
The package never gets to see the variable $x
to then
perform the RE expansion. The work around is to use the package as per:
use Regexp::Ethiopic 'overload'; my $x = "ከ"; : : my $re = Regexp::Ethiopic::getRe ( "[#$x#]" ); if ( /$re/ ) { : : }
This works as expected at the cost of one extra step. The overloading and functional modes of the Regexp::Ethiopic package may be used together without conflict.
The initial philosophy applied to syllabic character class development was to stick with existing POSIX definitions and notation ([=x=], [:x:], etc) and simply apply them in the context of a syllabary. Shoe-horning syllabic classes into POSIX norms has proven at times to be both awkward and confusing. After a lenghty experimentaiton period a clean break was made from the POSIX class symbols and class symbols are applied that appear to be intuitive and easy to type.
In large part, a complication for working with Ethiopic character classes easily has been the difference between the greater number of Ethiopic classes and available (while somewhat applicable) POSIX abstractions. There are four types of character equivalence that are of interest in Ethiopic regular expressions:
The syllable x is:
The choice of # has been made at this time for no other reason than that symbol itself looks like the grid that the syllables are invariably presented in. The interpretation of the character between #s is made by the character's context as either a letter or numeral. This notation has been stable for some time and has proven to be a good neumanic device.