RegexKitLite

Lightweight Objective-C Regular Expressions for Mac OS X using the ICU Library

Introduction to RegexKitLite

This document introduces RegexKitLite for Mac OS X. RegexKitLite enables easy access to regular expressions by providing a number of additions to the standard Foundation NSString class. RegexKitLite acts as a bridge between the NSString class and the regular expression engine in the International Components for Unicode, or ICU, dynamic shared library that is shipped with Mac OS X.

Highlights

Documentation Overview

RegexKitLite Guide

While RegexKitLite is not a descendent of the RegexKit.framework source code, it does provide a small subset of RegexKits NSString methods for performing various regular expression tasks. These include determining the range that a regular expression matches within a string, and easily creating a new string from a match.

RegexKitLite uses the regular expression provided by the ICU library that ships with Mac OS X. The two files, RegexKitLite.h and RegexKitLite.m, and linking against the /usr/lib/libicucore.dylib ICU shared library is all that is required. Adding RegexKitLite to your project only adds a few kilobytes of overhead to your applications size and typically only requires a few kilobytes of memory at runtime. Since a regular expression must first be compiled by the ICU library before it can be used, RegexKitLite keeps a small pseudo Least Recently Used cache of the compiled regular expressions.

See Also

Compiled Regular Expression Cache

The NSString that contains the regular expression must be compiled in to an ICU URegularExpression. This can be an expensive, time consuming step, and the compiled regular expression can be reused again in another search, even if the strings to be searched are different. Therefore RegexKitLite keeps a small cache of recently compiled regular expressions.

This cache is a simple hash table, the size of which can be tuned with the pre-processor define RKL_CACHE_SIZE. The default cache size, which should always be a prime number, is set to 23. The NSString regexString is mapped to a cache slot using modular arithmetic: Cache slot ≡ [regexString hash] mod RKL_CACHE_SIZE, i.e. cacheSlot = [regexString hash] % 23;. Since RegexKitLite uses Core Foundation, this is actually coded as cacheSlot = CFHash(regexString) % RKL_CACHE_SIZE;.

If the cache slot currently contains a compiled URegularExpression, checks are made to ensure that the current regexString is identical to the regular expression used to create the compiled URegularExpression. If they are a match, the cached compiled regular expression is used. If they are not a match, the current compiled regular expression for the selected cache slot is ejected and all of its resources are freed. Then the regexString that caused the ejection is compiled and fills the cache slot. Only one compiled regular expression can reside in a cache slot at a time.

Regular Expressions in Mutable Strings

When a regular expression is compiled, an immutable copy of the string is kept. For immutable NSString objects, the copy is usually the same object with its reference count increased by one. Only NSMutableString objects will cause a new, immutable NSString to be created.

If the regular expression being used is stored in a NSMutableString, the cached regular expression will continue to be used as long as the NSMutableString remains unchanged. Once mutated, the changed NSMutableString will no longer be a match for the cached compiled regular expression that was being used by it previously. Even if the newly mutated strings hash is congruent to the previous unmutated strings hash modulo RKL_CACHE_SIZE, that is to say they share the same cache slot (i.e., ([mutatedString hash] % RKL_CACHE_SIZE) == ([unmutatedString hash] % RKL_CACHE_SIZE)), the immutable copy of the regular expression string used to create the compiled regular expression is used to ensure true equality. The newly mutated string will have to go through the whole cache slot entry creation process and be compiled in to a URegularExpression.

This means that NSMutableString objects can be safely used as regular expressions, and any mutations to those objects will immediately be detected and reflected in the regular expression used for matching.

Searching Mutable Strings

Unfortunately, the ICU regular expression API requires that the compiled regular expression be "set" to the string to be searched. To search a different string, the compiled regular expression must be "set" to the new string. Therefore, RegexKitLite tracks the last NSString that each compiled regular expression was set to, recording the pointer to the NSString object, its hash, and its length. If any of these parameters are different from the last parameters used for a compiled regular expression, the compiled regular expression is "set" to the new string. Since mutating a string will likely change its hash value, it's generally safe to search NSMutableString objects, and in most cases the mutation will reset the compiled regular expression to the updated contents of the NSMutableString.

Caution:

Care must be taken when mutable strings are searched and there exists the possibility that the string has mutated between searches. See NSString RegexKitLite Additions Reference - Cached Information and Mutable Strings for more information.

Last Match Information

When performing a match, the arguments used to perform the match are kept. If those same arguments are used again, the actual matching operation is skipped because the compiled regular expression already contains the results for the given arguments. This is mostly useful when a regular expression contains multiple capture groups, and the results for different capture groups for the same match are needed. This means that there is only a small penalty for iterating over all the capture groups in a regular expression for a match, and essentially becomes the direct ICU regular expression API equivalent of uregex_start() and uregex_end().

See Also

UTF-16 Conversion Cache

RegexKitLite is ideal when the string being matched is a non-ASCII, Unicode string. This is because the regular expression engine used, ICU, can only operate on UTF-16 encoded strings. Since Cocoa keeps essentially all non-ASCII strings encoded in UTF-16 form internally, this means that RegexKitLite can operate directly on the strings buffer without having to make a temporary copy and transcode the string in to ICU's required format.

Like all object oriented programming, the internal representation of an objects information is private. However, the ICU regular expression engine requires that the text to be search be encoded as a UTF-16 string. For pragmatic purposes, Core Foundation has several public functions that can provide direct access to the buffer used to hold the contents of the string, but such direct access is only available if the private buffer is already encoded in the requested direct access format. As a rough rule of thumb, 8-bit simple strings, such as ASCII, are kept in their 8-bit format, which is essentially UTF-8 strings. Non 8-bit simple strings are stored as UTF-16 strings. Of course, this is an implementation private detail, so the precise behavior should never be relied upon. It is mentioned because of the tremendous impact on matching performance and efficiency it can have.

For strings in which direct access to the UTF-16 string is available, RegexKitLite uses that buffer. This is the ideal case as no extra work needs to be performed, such as converting the string in to a UTF-16 string, and allocating memory to hold the temporary conversion. Of course, direct access is not always available, and occasionally the string to be searched will need to be converted in to a UTF-16 string.

RegexKitLite has two conversion buffer caches. Each buffer can only hold the contents of a single NSString at a time. If the selected buffer does not contain the contents of the NSString that is currently being searched, the previous occupant is ejected from the buffer and the current NSString takes it place. The first conversion buffer is fixed in size and set by the C pre-processor define RKL_FIXED_LENGTH, which defaults to 2048. Any string whose length is less than RKL_FIXED_LENGTH will use the fixed size conversion buffer. The second conversion buffer, for strings whose length is longer than RKL_FIXED_LENGTH, will use the dynamically sized conversion buffer. The memory allocation for the dynamically sized conversion buffer is resized for each conversion with realloc() to the size needed to hold the entire contents of the UTF-16 converted string.

This strategy was chosen for its relative simplicity. Keeping track of dynamically created resources is required to prevent memory leaks. As designed, there is only a single pointer to dynamically allocated memory: the pointer to hold the conversion contents of strings whose length is larger than RKL_FIXED_LENGTH. However, since realloc() is used to manage that memory allocation, it becomes very difficult to accidentally leak the buffer. Having the fixed sized buffer means that the memory allocation system isn't bothered with many small requests, most of which are transient in nature to begin with. The current strategy tries to strike the best balance between performance and simplicity.

Mutable Strings

When converted in to a UTF-16 string, the hash of the NSString is recorded, along with the pointer to the NSString object and the strings length. In order for the RegexKitLite to use the cached conversion, all of these parameters must be equal to their values of the NSString to be searched. If there is any difference, the cached conversion is discarded and the current NSString, or NSMutableString as the case may be, is reconverted in to a UTF-16 string.

Caution:

Care must be taken when mutable strings are searched and there exists the possibility that the string has mutated between searches. See NSString RegexKitLite Additions Reference - Cached Information and Mutable Strings for more information.

Multithreading Safety

RegexKitLite is also multithreading safe. Access to the compiled regular expression cache and the conversion cache is protected by a single OSSpinLock to ensure that only one thread has access at a time. The lock remains held while the regular expression match is performed since the compiled regular expression returned by the ICU library is not safe to use from multiple threads. Once the match has completed, the lock is released, and another thread is free to lock the cache and perform a match.

Important:

While it is safe to use the same regular expression from any thread at any time, the usual multithreading caveats apply. For example, it is not safe to mutate a NSMutableString in one thread while performing a match in another.

Using RegexKitLite

The goal of RegexKitLite is not to be a comprehensive Objective-C regular expression framework, but to provide a set of easy to use primitives from which additional functionality can be created. To this end, RegexKitLite provides the following two core primitives from which everything else is built:

There are no additional classes that supply the regular expression matching functionality, everything is accomplished with the two methods above. These methods are added to the existing NSString class via an Objective-C category extension. See NSString RegexKitLite Additions Reference for a complete list of methods.

The real workhorse is the rangeOfRegex:options:inRange:capture:error: method. The receiver of the message is an ordinary NSString class member that you wish to perform a regular expression match on. The parameters of the method are a NSString containing the regular expression regexString, any RKLRegexOptions match options, the NSRange range of the receiver that is to be searched, the capture number from the regular expression regexString that you would like the result for, and an optional error parameter that will contain a NSError object if a problem occurs with the details of the error.

Important:

The C language assigns special meaning to the \ character when inside a quoted " " string in your source code. The \ character is the escape character, and the character that follows has a different meaning than normal. The most common example of this is \n, which translates in to the new-line character. Because of this, you are required to 'escape' any uses of \ by prepending it with another \. In practical terms this means doubling any \ in a regular expression, which unfortunately is quite common, that are inside of quoted " " strings in your source code. Failure to do so will result in numerous warnings from the compiler about unknown escape sequences.

A simple example:

NSString *searchString = @"This is neat."; NSString *regexString = @"(\\w+)\\s+(\\w+)\\s+(\\w+)"; NSRange matchedRange = NSMakeRange(NSNotFound, 0); NSError *error = NULL; matchedRange = [searchString rangeOfRegex:regexString options:RKLNoOptions inRange:searchRange capture:2 error:&error]; NSLog(@"matchedRange : %@", NSStringFromRange(matchedRange)); // 2008-03-18 03:51:16.530 test[51583:813] matchedRange : {5, 2}
Continues…

In the previous example, the NSRange that capture number 2 matched is {5, 2}, which corresponds to the word is in searchString. Once the NSRange is known, you can create a new string containing just the matching text:

…example
NSString *matchedString = [searchString substringWithRange:matchedRange]; NSLog(@"matchedString: '%@'", matchedString); // 2008-03-18 03:51:16.532 test[51583:813] matchedString: 'is'
See Also

Creating A Match Enumerator

As a practical example of how to use the simple primitives provided by RegexKitLite, consider the common need of having to enumerate all the matches of a regular expression in a target string. The following example creates a simple NSEnumerator based enumerator for all the matches of a regular expression in a target string, returning a NSString of the text matched by the regular expression (capture 0) for each call to nextObject until the end of the string is reached. Each match begins searching where the last match ended.

The match enumerator is divided in to two parts. The public part is defined in the header RKLMatchEnumerator.h, below. The second part is a private subclass of NSEnumerator whose interface resides only in the file RKLMatchEnumerator.m. Match enumerators are instantiated by sending a NSString class member the message matchEnumeratorWithRegex:. A NSString with the regular expression is passed as the only argument, and a NSEnumerator is returned.

File name:RKLMatchEnumerator.h
#import <Foundation/NSEnumerator.h> #import <Foundation/NSString.h> #import <stddef.h> @interface NSString (RegexKitLiteEnumeratorAdditions) - (NSEnumerator *)matchEnumeratorWithRegex:(NSString *)regexString; @end

Next, in RKLMatchEnumerator.m, we define our private sub-class of NSEnumerator. In it we declare three instance variables, string, regex, and location. The string ivar holds the string to search, while regex holds the regular expression string. To guard against mutations to either, an immutable copy is made. The location ivar is used to keep track of the current location from which to begin matching. Finally, we declare our designated initializer which initializes the instantiated RKLMatchEnumerator object with the string to search and the regular expression to use.

File name:RKLMatchEnumerator.m
#import <Foundation/NSArray.h> #import <Foundation/NSRange.h> #import "RegexKitLite.h" #import "RKLMatchEnumerator.h" @interface RKLMatchEnumerator : NSEnumerator { NSString *string; NSString *regex; NSUInteger location; } - (id)initWithString:(NSString *)initString regex:(NSString *)initRegex; @end
Continues…

The following begins the implementation section of RKLMatchEnumerator and a fairly standard initialization method, initWithString:regex:.

RKLMatchEnumerator.m
@implementation RKLMatchEnumerator - (id)initWithString:(NSString *)initString regex:(NSString *)initRegex { if((self = [self init]) == NULL) { return(NULL); } string = [initString copy]; regex = [initRegex copy]; return(self); }
Continues…

The following implements the heart of any NSEnumerator, the nextObject method. If all of the matches have been enumerated, location will be set to NSNotFound, and the body of the if statement won't be evaluated and NULL will be returned.

If there are still matches to be found, searchRange is created to begin at value of the location ivar, with the NSRange length set to the remaining length of the string to be searched, or location - [string length].

Then, the match is performed using the RegexKitLite method rangeOfRegex:inRange: and the result stored in the variable matchedRange.

Next, the location ivar is updated to point to the location at the end of the matchedRange. Since it is possible to have a match with a length of zero, it must handle that special case by adding one, otherwise it will loop endlessly, always matching the same location of zero length. If there was no match, matchedRange.location will be NSNotFound and matchedRange.length will be 0, and the location ivar will be set to NSNotFound.

If the matched range location is not NSNotFound, then a substring of the matched range will be returned. Otherwise, we will exit the if body and return NULL, indicating that the NSEnumerator has no more matches to enumerate.

RKLMatchEnumerator.m
- (id)nextObject { if(location != NSNotFound) { NSRange searchRange = NSMakeRange(location, [string length] - location); NSRange matchedRange = [string rangeOfRegex:regex inRange:searchRange]; location = NSMaxRange(matchedRange) + ((matchedRange.length == 0) ? 1 : 0); if(matchedRange.location != NSNotFound) { return([string substringWithRange:matchedRange]); } } return(NULL); }
Continues…

A standard dealloc, releasing the string and regex ivar objects created during initialization.

RKLMatchEnumerator.m
- (void) dealloc { [string release]; [regex release]; [super dealloc]; } @end
Continues…

And finally, the NSString category addition that returns our match enumerator. This simply creates an instance of our private NSEnumerator sub-class RKLMatchEnumerator, initializes it with the string to match, self, using the regular expression regexString, then sends the instantiated object autorelease, which is finally returned. Since this is a NSString category addition, this message will be sent to an instance of an object that is a member of the NSString class, which includes any objects whose super class is ultimately NSString. Therefore, the string to match is the instance receiving the message, self.

RKLMatchEnumerator.m
@implementation NSString (RegexKitLiteEnumeratorAdditions) - (NSEnumerator *)matchEnumeratorWithRegex:(NSString *)regexString { return([[[RKLMatchEnumerator alloc] initWithString:self regex:regexString] autorelease]); } @end

The following piece of code is a simple demonstration of the match enumerator which will use a regular expression to enumerate all the lines in the string to be searched.

The variable searchString contains the string to search. The example string includes several embedded \n, or new-line characters. There are a total of four lines of text, with the third line containing no characters.

The variable regexString contains the regular expression to be used for matching. This regular expression beings with the sequence (?m) which is used to enable the RKLMultiline regular expression option from the text of the regular expression itself. This enables the metacharacters ^ and $ to match the start of and end of a line, respectively. The remaining characters .* will match any character '.' zero or more times '*'. The prose translation would be:

Enable the RKLMultiline option and match all of the characters from the beginning of a line until the end of a line.

The match enumerator is then instantiated and the results are enumerated with a standard while loop, setting matchedString to the object returned by nextObject. For each line that is returned, the current line number, length of the matched string, and the matched string are printed.

File name:main.m
#import <Foundation/NSAutoreleasePool.h> #import "RegexKitLite.h" #import "RKLMatchEnumerator.h" int main(int argc, char *argv[]) { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; NSString *searchString = @"one\ntwo\n\nfour\n"; NSEnumerator *matchEnumerator = NULL; NSString *regexString = @"(?m)^.*$"; NSLog(@"searchString: '%@'", searchString); NSLog(@"regexString : '%@'", regexString); matchEnumerator = [searchString matchEnumeratorWithRegex:regexString]; NSUInteger line = 0; NSString *matchedString = NULL; while((matchedString = [matchEnumerator nextObject]) != NULL) { NSLog(@"%d: %d '%@'", ++line, [matchedString length], matchedString); } [pool release]; return(0); }

The following shell transcript demonstrates compiling the example and executing it. Line number three clearly demonstrates that matches of zero length are possible. Without the additional logic in nextObject to handle this special case, the enumerator would never advance past the match.

Note:

In the shell transcript below, the NSLog() line that prints searchString has been annotated with the '⏎' character to help visually identify the corresponding \n new-line characters in searchString.

shell% cd examples shell% gcc -I.. -g -o main main.m RKLMatchEnumerator.m ../RegexKitLite.m -framework Foundation -licucore shell% ./main 2008-03-21 15:56:17.469 main[44050:807] searchString: 'one two four ' 2008-03-21 15:56:17.520 main[44050:807] regexString : '(?m)^.*$' 2008-03-21 15:56:17.575 main[44050:807] 1: 3 'one' 2008-03-21 15:56:17.580 main[44050:807] 2: 3 'two' 2008-03-21 15:56:17.584 main[44050:807] 3: 0 '' 2008-03-21 15:56:17.590 main[44050:807] 4: 4 'four' shell%

ICU Regular Expression Syntax

For your convenience, the regular expression syntax from the ICU documentation is included below. When in doubt, you should refer to the official ICU User Guide - Regular Expressions documentation page.

Metacharacters
CharacterDescription
\aMatch a BELL, \u0007
\AMatch at the beginning of the input. Differs from ^ in that \A will not match after a new-line within the input.
\b, outside of a [Set]Match if the current position is a word boundary. Boundaries occur at the transitions between word \w and non-word \W characters, with combining marks ignored.
See also: RKLUnicodeWordBoundaries
\b, within a [Set]Match a BACKSPACE, \u0008.
\BMatch if the current position is not a word boundary.
\cxMatch a Control-x character.
\dMatch any character with the Unicode General Category of Nd (Number, Decimal Digit).
\DMatch any character that is not a decimal digit.
\eMatch an ESCAPE, \u001B.
\ETerminates a \Q\E quoted sequence.
\fMatch a FORM FEED, \u000C.
\GMatch if the current position is at the end of the previous match.
\nMatch a LINE FEED, \u000A.
\N{Unicode Character Name}Match the named Unicode Character.
\p{Unicode Property Name}Match any character with the specified Unicode Property.
\P{Unicode Property Name}Match any character not having the specified Unicode Property.
\QQuotes all following characters until \E.
\rMatch a CARRIAGE RETURN, \u000D.
\sMatch a white space character. White space is defined as [\t\n\f\r\p{Z}].
\SMatch a non-white space character.
\tMatch a HORIZONTAL TABULATION, \u0009.
\uhhhhMatch the character with the hex value hhhh.
\UhhhhhhhhMatch the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\wMatch a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\WMatch a non-word character.
\x{hhhh}Match the character with hex value hhhh. From one to six hex digits may be supplied.
\xhhMatch the character with two digit hex value hh.
\XMatch a Grapheme Cluster.
\ZMatch if the current position is at the end of input, but before the final line terminator, if one exists.
\zMatch if the current position is at the end of input.
\n
Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.
Note:
Octal escapes, such as \012, are not supported.
[pattern]Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern.
.Match any character.
^Match at the beginning of a line.
$Match at the end of a line.
\Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /
Operators
OperatorDescription
|Alternation. A|B matches either A or B.
*Match zero or more times. Match as many times as possible.
+Match one or more times. Match as many times as possible.
?Match zero or one times. Prefer one.
{n}Match exactly n times.
{n,}Match at least n times. Match as many times as possible.
{n,m}Match between n and m times. Match as many times as possible, but not more than m.
*?Match zero or more times. Match as few times as possible.
+?Match one or more times. Match as few times as possible.
??Match zero or one times. Prefer zero.
{n}?Match exactly n times.
{n,}?Match at least n times, but no more than required for an overall pattern match.
{n,m}?Match between n and m times. Match as few times as possible, but not less than n.
*+Match zero or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails. Possessive match.
++Match one or more times. Possessive match.
?+Match zero or one times. Possessive match.
{n}+Match exactly n times. Possessive match.
{n,}+Match at least n times. Possessive match.
{n,m}+Match between n and m times. Possessive match.
()Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?:)Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?>)Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the (?> .
(?#)Free-format comment (?#comment).
(?=)Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?!)Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<=)Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?<!)Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?ismwx-ismwx:)Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx)Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
See also: Regular Expression Options
See Also

Adding RegexKitLite to your Project

Note:

The following outlines a typical set of steps that one would perform. This is not the only way, nor the required way to add RegexKitLite to your application. They may not be correct for your project as each project is unique. They are an overview for those unfamiliar with adding additional shared libraries to the list of libraries your application links against.

Outline of Required Steps

The following outlines the steps required to use RegexKitLite in your project.

See Also

Adding RegexKitLite using Xcode

Important:
These instructions apply to Xcode versions 2.4.1 and 3.0. Other versions should be similar, but may vary for specific details.
  1. First, add the ICU dynamic shared library to your Xcode project. You may choose to add the library to any group in your project, and which groups are created by default is dependent on the template type you chose when you created your project. For a typical Cocoa application project, a good choice is the Frameworks group. To add the ICU dynamic shared library, control/right-click on the Framework group and choose Add > Existing Files…

  2. Next, you will need to choose the ICU dynamic shared library file to add. Exactly which file to choose depends on your project, but a fairly safe choice is to select /Developer/SDKs/MacOSX10.5.sdk/usr/lib/libicucore.dylib. You may have installed your developer tools in a different location than the default /Developer directory, and the Mac OS X SDK version should be the one your project is targeting, typically the latest one available.

  3. Then, in the dialog that follows, make sure that Copy items into… is unselected. Select the targets you will be using RegexKitLite in and then click Add to add the ICU dynamic shared library to your project.

  4. Once the ICU dynamic shared library is added to your project, you will need to add it to the libraries that your executable is linked with. To do so, expand the Targets group, and then expand the executable targets you will be using RegexKitLite in. You will then need to select the libicucore.dylib file that you added in the previous step and drag it in to the Link Binary With Libraries group for each excutable target that you will be using RegexKitLite in. The order of the files within the Link Binary With Libraries group is not important, and for a typical Cocoa application the group will contain the Cocoa.framework file.

  5. Next, add the RegexKitLite source files to your Xcode project. In the Groups & Files outline view on the left, control/right-click on the group that would like to add the files to, then select Add > Existing Files…

    Note:

    You can perform the following steps once for each file (RegexKitLite.h and RegexKitLite.m), or once by selecting both files from the file dialog.

  6. Select the RegexKitLite.h and / or RegexKitLite.m file from the file chooser dialog.

  7. The next dialog will present you with several options. If you have not already copied the RegexKitLite files in to your projects directory, you may want to click on the Copy items into… option. Select the targets that you would like add the RegexKitLite functionality to.

  8. Finally, you will need to include the RegexKitLite.h header file. The best way to do this is very dependent on your project. If your project consists of only half a dozen source files, you can add:

    #import "RegexKitLite.h"

    manually to each source file that makes uses of RegexKitLites features. If your project has grown beyond this, you've probably already organized a common "master" header to include to capture headers that are required by nearly all source files already.

Adding RegexKitLite using the Shell

Using RegexKitLite from the shell is also easy. Again, you need to add the header #import to the appropriate source files. Then, to link to the ICU library, you typically only need to add -licucore, just as you would any other library. Consider the following example:

File name:link_example.m
#import <Foundation/NSObjCRuntime.h> #import <Foundation/NSAutoreleasePool.h> #import "RegexKitLite.h" int main(int argc, char *argv[]) { NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init]; // Copyright COPYRIGHT_SIGN APPROXIMATELY_EQUAL_TO 2008 // Copyright \u00a9 \u2245 2008 char *utf8CString = "Copyright \xC2\xA9 \xE2\x89\x85 2008"; NSString *regexString = @"Copyright (.*) (\\d+)"; NSString *subjectString = [NSString stringWithUTF8String:utf8CString]; NSString *matchedString = [subjectString stringByMatching:regexString capture:1]; NSLog(@"subject: \"%@\"", subjectString); NSLog(@"matched: \"%@\"", matchedString); [pool release]; return(0); }

Compiled and run from the shell:

shell% cd examples shell% gcc -g -I.. -o link_example link_example.m ../RegexKitLite.m -framework Foundation -licucore shell% ./link_example 2008-03-14 03:52:51.187 test[15283:807] subject: "Copyright © ≅ 2008" 2008-03-14 03:52:51.269 test[15283:807] matched: "© ≅" shell%

NSString RegexKitLite Additions Reference

Extends by categoryNSString
RegexKitLite1.2
Declared in
  • RegexKitLite.h

Overview

RegexKitLite is not meant to be a full featured regular expression framework. Because of this, it provides only the basic primitives needed to create additional functionality. It is ideal for developers who:

RegexKitLite consists of only two files, the header file RegexKitLite.h and RegexKitLite.m. The only other requirement is to link with the ICU library that comes with Mac OS X. No new classes are created, all functionality is provided as a category extension to the NSString class.

See Also

Xcode 3 Integrated Documentation

This documentation is available in the Xcode DocSet format. To add this documentation to Xcode, select Help > Documentation. Then, in the lower left hand corner of the documentation window, there should be a gear icon with a drop down menu indicator which you should select and choose New Subscription… and enter the following URL:

feed://regexkit.sourceforge.net/RegexKitLiteDocSets.atom

Once you have added the URL, a new group should appear, inside which will be the RegexKitLite documentation with a Get button. Click on the Get button and follow the prompts. Xcode will ask you to enter an administrators password to install the documentation for the first time, which is explained here.

Cached Information and Mutable Strings

While RegexKitLite takes steps to ensure that the information it has cached is valid for the strings it searches, there exists the possibility that out of date cached information may be used when searching mutable strings. For each compiled regular expression, RegexKitLite caches the following information about the last NSString that was searched:

An ICU compiled regular expression must be "set" to the text to be searched. Before a compiled regular expression is used, the pointer to the string object to search, its hash, length, and the pointer to the UTF-16 buffer is compared with the values that the compiled regular expression was last "set" to. If any of these values are different, the compiled regular expression is reset and "set" to the new string.

If a NSMutableString is mutated between two uses of the same compiled regular expression and its hash, length, or UTF-16 buffer changes between uses, RegexKitLite will automatically reset the compiled regular expression with the new values of the mutated string. The results returned will correctly reflect the mutations that have taken place between searches.

It is possible that the mutations to a string can go undetected, however. If the mutation keeps the length the same, then the only way a change can be detected is if the strings hash value changes. For most mutations the hash value will change, but it is possible for two different strings to share the same hash. This is known as a hash collision. Should this happen, the results returned by RegexKitLite may not be correct.

Therefore, if you are using RegexKitLite to search NSMutableString objects, and those strings may have mutated in such a way that RegexKitLite is unable to detect that the string has changed, you must manually clear the internal cache to ensure that the results accurately reflect the mutations. You can clear the cache by calling the following class method:

[NSString clearStringCache];
Warning:

When searching NSMutableString objects that have mutated between searches, failure to clear the cache may result in undefined behavior.

Exceptions Raised

Methods will raise an exception if their arguments are invalid, such as passing NULL for a required parameter. An invalid regular expression or RKLRegexOptions parameter will not raise an exception. Instead, a NSError object with information about the error will be created and returned via the address given with the optional error argument. If information about the problem is not required, error may be NULL. For convenience methods that do not have an error argument, they behave as if error was set to NULL when invoking the primary method.

Important:
Methods raise NSInvalidArgumentException if regexString is NULL, or if capture < 0 or is not valid for regexString.
Important:
Methods raise NSRangeException if range exceeds the bounds of the receiver.
See Also

Tasks

Clearing Cached Information
Determining the Number of Captures
Identifying Matches
Determining the Range of a Match
Creating Temporary Strings from a Match

Class Methods

Clears the cached information about strings.
+ (void)clearStringCache;
Discussion

This method should be used when performing searches on NSMutableString objects and there is the possibility that the string has mutated in between calls to RegexKitLite.

Warning:

When searching NSMutableString objects that have mutated between searches, failure to clear the cache may result in undefined behavior.

Returns the number of captures that regexString contains.
+ (NSInteger)captureCountForRegex:(NSString)regexString;
Return Value
Returns -1 if an error occurs. Otherwise the number of captures in regexString is returned, or 0 if regexString does not contain any captures.
Returns the number of captures that regexString contains.
+ (NSInteger)captureCountForRegex:(NSString)regexString options:(RKLRegexOptions)options error:(NSError **)error;
Discussion
The optional error parameter, if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Return Value
Returns -1 if an error occurs. Otherwise the number of captures in regexString is returned, or 0 if regexString does not contain any captures.

Instance Methods

Returns a Boolean value that indicates whether the receiver is matched by regexString.
- (BOOL)isMatchedByRegex:(NSString *)regexString;
Returns a Boolean value that indicates whether the receiver is matched by regexString within range.
- (BOOL)isMatchedByRegex:(NSString *)regexString options:(RKLRegexOptions)options inRange:(NSRange)range error:(NSError **)error;
Discussion
The optional error parameter, if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Returns the range for the first match of regexString in the receiver.
- (NSRange)rangeOfRegex:(NSString *)regexString;
Return Value
A NSRange structure giving the location and length of the first match of regexString in the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regexString or an error occurs.
Returns the range of capture number capture for the first match of regexString in the receiver.
- (NSRange)rangeOfRegex:(NSString *)regexString capture:(NSInteger)capture;
Return Value
A NSRange structure giving the location and length of capture number capture for the first match of regexString in the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regexString or an error occurs.
Returns the range for the first match of regexString within range of the receiver.
- (NSRange)rangeOfRegex:(NSString *)regexString inRange:(NSRange)range;
Return Value
A NSRange structure giving the location and length of the first match of regexString within range of the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regexString within range or an error occurs.
Returns the range of capture number capture for the first match of regexString within range of the receiver.
- (NSRange)rangeOfRegex:(NSString *)regexString options:(RKLRegexOptions)options inRange:(NSRange)range capture:(NSInteger)capture error:(NSError **)error;
Parameters
  • regexString
    A NSString containing a regular expression.
  • options
    A mask of options specified by combining RKLRegexOptions flags with the C bitwise OR operator. Either 0 or RKLNoOptions may be used if no options are required.
  • range
    The range of the receiver to search.
  • capture
    The matching range of the capture number from regexString to return. Use 0 for the entire range that regexString matched.
  • error
    An optional parameter that if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Return Value
A NSRange structure giving the location and length of capture number capture for the first match of regexString within range of the receiver. Returns {NSNotFound, 0} if the receiver is not matched by regexString within range or an error occurs.
Returns a string created from the characters of the receiver that are in the range of the first match of regexString.
- (NSString *)stringByMatching:(NSString *)regexString;
Return Value
A NSString containing the substring of the receiver matched by regexString. Returns NULL if the receiver is not matched by regexString or an error occurs.
Returns a string created from the characters of the receiver that are in the range of the first match of regexString for capture.
- (NSString *)stringByMatching:(NSString *)regexString capture:(NSInteger)capture;
Return Value
A NSString containing the substring of the receiver matched by capture number capture of regexString. Returns NULL if the receiver is not matched by regexString or an error occurs.
Returns a string created from the characters of the receiver that are in the range of the first match of regexString within range of the receiver.
- (NSString *)stringByMatching:(NSString *)regexString inRange:(NSRange)range;
Return Value
A NSString containing the substring of the receiver matched by regexString within range of the receiver. Returns NULL if the receiver is not matched by regexString within range or an error occurs.
Returns a string created from the characters of the receiver that are in the range of the first match of regexString using options within range of the receiver for capture.
- (NSString *)stringByMatching:(NSString *)regexString options:(RKLRegexOptions)options inRange:(NSRange)range capture:(NSInteger)capture error:(NSError **)error;
Parameters
  • regexString
    A NSString containing a regular expression.
  • options
    A mask of options specified by combining RKLRegexOptions flags with the C bitwise OR operator. Either 0 or RKLNoOptions may be used if no options are required.
  • range
    The range of the receiver to search.
  • capture
    The capture number from regexString to return. Use 0 for the entire range that regexString matched.
  • error
    An optional parameter that if set and an error occurs, will contain a NSError object that describes the problem. This may be set to NULL if information about any errors is not required.
Return Value
A NSString containing the substring of the receiver matched by capture number capture of regexString within range of the receiver. Returns NULL if the receiver is not matched by regexString within range or an error occurs.

Constants

RKLRegexOptions
Type for regular expression options.
typedef uint32_t RKLRegexOptions;
Discussion

See Regular Expression Options for possible values.

Declared In
RegexKitLite.h
The following flags control various aspects of regular expression matching. The flag values may be specified at the time that a regular expression is used, or they may be specified within the pattern itself using the (?ismwx-ismwx) pattern options.
Constants
RKLNoOptions
No regular expression options specified.
RKLCaseless
If set, matching will take place in a case-insensitive manner.
RKLComments
If set, allow use of white space and #comments within patterns.
RKLDotAll
If set, a . in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single . in a regular expression pattern.
RKLMultiline
Control the behavior of ^ and $ in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, ^ and $ will also match at the start and end of each line within the input text.
RKLUnicodeWordBoundaries
Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29 - Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either word or non-word, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
Discussion

Options for controlling the behavior of a regular expression pattern can be controlled in two ways. When the method supports it, options may specified by combining RKLRegexOptions flags with the C bitwise OR operator. For example:

matchedRange = [aString rangeOfRegex:@"^(blue|red)$" options:(RKLCaseless | RKLMultiline) inRange:range error:NULL];

The other way is to specify the options within the regular expression itself, of which there are two ways. The first specifies the options for everything following it, and the other sets the options on a per capture group basis. Options are either enabled, or following a -, disabled. The syntax for both is nearly identical:

OptionExampleDescription
(?ixsmw-ixsmw)(?i)Enables the RKLCaseless option for everything that follows it. Useful at the beginning of a regular expression to set the desired options.
(?ixsmw-ixsmw:)(?iw-m:)Enables the RKLCaseless and RKLUnicodeWordBoundaries options and disables RKLMultiline for the capture group enclosed by the parenthesis.

The following table lists the regular expression pattern option character and its corresponding RKLRegexOptions flag:

CharacterOption
iRKLCaseless
xRKLComments
sRKLDotAll
mRKLMultiline
wRKLUnicodeWordBoundaries
Declared In
RegexKitLite.h

RegexKitLite NSError Error Domains

The following NSError error domains are defined.
extern NSString * const RKLICURegexErrorDomain;
Constants
RKLICURegexErrorDomain
ICU Regular Expression Errors.
Declared In
RegexKitLite.h

RegexKitLite NSError User Info Dictionary Keys

extern NSString * const RKLICURegexErrorNameErrorKey;
extern NSString * const RKLICURegexLineErrorKey;
extern NSString * const RKLICURegexOffsetErrorKey;
extern NSString * const RKLICURegexPreContextErrorKey;
extern NSString * const RKLICURegexPostContextErrorKey;
extern NSString * const RKLICURegexRegexErrorKey;
extern NSString * const RKLICURegexRegexOptionsErrorKey;
Constants
RKLICURegexErrorNameErrorKey
The string returned by invoking u_errorName with the error code returned when attempting to compile the regular expression. Example: U_REGEX_RULE_SYNTAX.
RKLICURegexLineErrorKey
The line number where the error occurred in the regular expression.
RKLICURegexOffsetErrorKey
The offset from the beginning of the line where the error occurred in the regular expression.
RKLICURegexPreContextErrorKey
Up to 16 characters leading up to the cause error in the regular expression.
RKLICURegexPostContextErrorKey
Up to 16 characters after the cause error in the regular expression.
RKLICURegexRegexErrorKey
The regular expression that caused the error.
RKLICURegexRegexOptionsErrorKey
The RKLRegexOptions regular expression options specified.
Discussion

The RKLICURegexLineErrorKey, RKLICURegexOffsetErrorKey, RKLICURegexPreContextErrorKey, and RKLICURegexPostContextErrorKey error keys may not be present for all errors. For example, errors returned by passing invalid RKLRegexOptions flags will not have the listed keys set.

Declared In
RegexKitLite.h

Release Information

Changes:

  • Updated and clarified the documentation regarding adding RegexKitLite to an Xcode project.
  • Created Xcode 3 DocSet documentation.

Changes:

Bug fixes:

  • Fixed a bug that for strings that required UTF-16 conversion, the conversion from the previous string that required conversion may have been re-used for the current string, even though the two strings are different and the new string requires conversion.
  • Updated the internal inconsistency exception macro to correctly handle non-ASCII file names.

Initial release.

License Information

RegexKitLite is distributed under the terms of the BSD License, as specified below.

See Also

License

Copyright © 2008, John Engelhart

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.