Python: regex, concurrency
Regex:
Python built in re
module is really handy.
re.match
always matches whole stringre.search
you almost always want to use this, as here you will search the matching substirng for given pattern.
Here are some usefull patterns and function which will come in handy:
# Usefull patterns:
URL_PATTERN = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"
MAN_PAGE_PATTERN = r"man\d/.+.html$"
GIT_REPO_NAME_PATTERN = r"[^/]+.git$"
# Useful wrapper around `re.search`
def get_match(pattern: str, string: str) -> str or None:
"""Scan through string looking for a match to the pattern, returning
a Match sub string, or None if no match was found."""
match = re.search(URL_PATTERN, tldr_summary)
return match.group() if match else None
Concurrency:
In python with in built modules multiprocessing
and concurrent
converting your single process single threaded program has been pretty easy.
For example: Simple Example to convert any list comprehension to execute “concurrently”
# test_concurrency.py
import time
import timeit
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def parallelize(func, iterable, use_thread=True, max_workers=5):
pool_executor = ThreadPoolExecutor if use_thread else ProcessPoolExecutor
with pool_executor(max_workers=max_workers) as executor:
data = list(executor.map(func, iterable))
return data
def square(x: int) -> int:
time.sleep(1e-3)
return x ** 2
# For invoking multi processing we need __main__ module defined !! :)
if __name__ == "__main__":
input_iter = list(range(100))
number = 10
repeat = 5
print("1 prcess 1 thread", timeit.repeat(lambda: [square(x) for x in input_iter], repeat=repeat, number=number))
print("1 prcess 5 thread", timeit.repeat(lambda: parallelize(square, input_iter), repeat=repeat, number=number))
print("5 prcess 1 thread each", timeit.repeat(lambda: parallelize(square, input_iter, use_thread=False), repeat=repeat, number=number))
Execution:
$ python test_concurrency.py
1 prcess 1 thread [1.2861979760000002, 1.298199268, 1.2918409180000001, 1.281806194, 1.295781883]
1 prcess 5 thread [0.28324162100000017, 0.2827319930000005, 0.2845571439999999, 0.2883606469999993, 0.2812394300000003]
5 prcess 1 thread each [1.5489112939999998, 1.5780382430000017, 1.5727520679999998, 1.5113713640000004, 1.524860704]
Important Note:
- We do have to know that with threading and multi-processing it’s strength is it’s weakness.
- Hence restricting
max_workers
is keypool_executor(max_workers=5)
. - If
max_workers=None
then it will spawn many threads and many processes which causes huge delay in thread/process management. So it will be slower compared to single process single thread. - You can play around
max_workers
count. But never set it toNone
Very usefull references:
- Python3.x moduel: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
- Tiemit: https://docs.python.org/3/library/timeit.html#python-interface
- RealPython site: https://realpython.com/intro-to-python-threading/#using-a-threadpoolexecutor
- Raymond Hettiger: https://pybay.com/site_media/slides/raymond2017-keynote/index.html
- David Beazley - A Curious Course on Coroutines and Concurrency: http://www.dabeaz.com/coroutines/