SAS extract substring from string with prxchange or prxpson(prxmatch(prxparse()))

2 SOLUTIONS POSTED AT BOTTOM

My code

 data test; extract_string = "<some string here>"; my_result1 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "A1M_PRE"); my_result2 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "AC2_0M"); my_result3 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "GA3_30M"); my_result4 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "DE3_1H30M"); run;

Desired results

Extract the number after _ but preceding M in strings that have M at the end. The result set should be:

 my_result1 = "" my_result2 = "0" my_result3 = "30" my_result4 = "30"

The following extract_string values fail

"\.*(\d*)M\b\"
"\.*(\d*?)M\b\"
"\.*(\d{*})M\b\"
"\.*(\d{*?})M\b\"
"\.*(\d){*}M\b\"
"\.*(\d){*?}M\b\"
"\.*(\d+)M\b\"
"\.*(\d+?)M\b\"
"\.*(\d{+})M\b\"
"\.*(\d{+?})M\b\"
"\.*(\d){+}M\b\"
"\.*(\d){+?}M\b\"
"\.*(\d+\d+)M\b\" 

Potential solutions which I would request help with

  • Perhaps I just haven't tested the correct extract_string yet. Ideas?
  • Perhaps my cat("s/&.*", extract_string, ".*$/$1/") needs to be modified. Ideas?
  • Perhaps I need to use prxpson(prxmatch(prxparse())) instead of prxchange. How would that be formulated?

Links I've looked at but have not been able to successfully implement

SAS PRX to extract substring please

extracting substring using regex in sas

Extract substring from a string in SAS

SOLUTIONS

Solution 1

The suffix in the cat function and the extract_string were modified.

 data test; extract_string = "?(?:_[^_r\n]*?(\d+)M)?$"; my_result1 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "A1M_PRE"); my_result2 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "AC2_0M"); my_result3 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "GA3_30M"); my_result4 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "DE3_1H30M"); run;

Solution 2

This solution uses the other prx-family functions: prxparse, prxmatch, and prxposn.

data have; length string $10; input string; datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;
data want; set have; rxid = prxparse ('/_.*?(\d+)M\s*$/'); length digit_string $8; if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string); number_extracted = input (digit_string, ? 12.);
run;
3

3 Answers

I understand that SAS can use Perl's regex engine. The latter supports \K, which directs the engine to discard everything matched so far and reset the starting point of the match to the current location. The following regular expression should therefore match the substring's digits that are of interest.

_.*?\K\d+(?=M$)

Demo

A failure to match would be interpreted as an empty string having been matched.

6

If you want remove from the line and keep the digits preceding M at the end of the line, you could use a capturing group. In the replacement keep the value of group 1 $1

^.*?(?:_[^_r\n]*?(\d+)M)?$

Explanation

  • ^ Start of string
  • .*? Match any char as least as possible
  • (?: Non capture group
    • _[^_r\n]*? Match _ and any char except an underscore
    • (\d+)M Capture group 1, match 1+ digits followed by M
  • )? Close group and make it optional
  • $ End of string

Regex demo


You could make the extract_string the full pattern:

extract_string = "^.*?(?:_[^_r\n]*?(\d+)M)?$";
my_result1 = prxchange(cat("s/", extract_string, "/$1/"), -1, "A1M_PRE");

Or if you must keep the leading ^.* use

extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";
8

Use PRXPOSN to extract a match group.

Example:

Use pattern /_.*?(\d+)M\s*$/ to locate the last run of digits before a terminating M character.

Regex:

  • _ literal underscore
  • .*? non-greedy any characters
  • (\d+) capture one or more digits
  • M literal M
  • \s*$ - any number of trailing spaces, needed due to SAS character values being right padded with spaces to variable attribute length
data have; length string $10; input string; datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;
data want; set have; rxid = prxparse ('/_.*?(\d+)M\s*$/'); length digit_string $8; if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string); number_extracted = input (digit_string, ? 12.);
run;

Result

enter image description here

2

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like