Regex punctuation split [Python]

Can anyone help me a bit with regexs? I currently have this: re.split(" +", line.rstrip()), which separates by spaces.

How could I expand this to cover punctuation, too?

4 Answers

The official Python documentation has a good example for this one. It will split on all non-alphanumeric characters (whitespace and punctuation). Literally \W is the character class for all Non-Word characters. Note: the underscore "_" is considered a "word" character and will not be part of the split here.

re.split('\W+', 'Words, words, words.')

See for more examples, search page for "re.split"

Using string.punctuation and character class:

>>> from string import punctuation
>>> r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
>>> r.split('dss!dfs^ #$% jjj^')
['dss', 'dfs', 'jjj', '']

import re
st='one two,three; four-five, six'
print re.split(r'\s+|[,;.-]\s*', st)
# ['one', 'two', 'three', 'four', 'five', 'six']

When you consider using a regex to split with any punctuation, you should bear in mind that the \W pattern does not match an underscore (which is a punctuation char, too).

Thus, you can use

import re
tokens = re.split(r'[\W_]+', text)

where [\W_] matches any Unicode non-alphanumeric chars.

Since re.split might return empty items when the match appears at the start or end of string, it is better to use a positive logic and use

import re
tokens = re.findall(r'[^\W_]+', text)

where [^\W_] matches any Unicode alphanumeric chars.

See the Python demo:

import re
text = "!Hello, world!"
print( re.split(r'[\W_]+', text) )
# => ['', 'Hello', 'world', '']
print( re.findall(r'[^\W_]+', text) )
# => ['Hello', 'world']

Star Vibe

Regex punctuation split [Python]

4 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

Grelka. Light of my life. Fire of my loins. My sin, My soul. Why the hell won't you marry me?

What is the cheapest unique in Path of Exile?

Accessing DLC in Fallout 3 Game of the Year Edition (PC)

How can I effectively speed level?