Python: regex, concurrency
Regex:
Python built in re module is really handy.
re.matchalways matches whole stringre.searchyou almost always want to use this, as here you will search the matching substirng for given pattern.
Here are some usefull patterns and function which will come in handy:
# Usefull patterns:
URL_PATTERN = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"
MAN_PAGE_PATTERN = r"man\d/.+.html$"
GIT_REPO_NAME_PATTERN = r"[^/]+.git$"
# Useful wrapper around `re.search`
def get_match(pattern: str, string: str) -> str or None:
"""Scan through string looking for a match to the pattern, returning
a Match sub string, or None if no match was found."""
match = re.search(URL_PATTERN, tldr_summary)
return match.group() if match else None
Concurrency:
In python with in built modules multiprocessing and concurrent converting your single process single threaded program has been pretty easy.
For example: Simple Example to convert any list comprehension to execute “concurrently”
# test_concurrency.py
import time
import timeit
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def parallelize(func, iterable, use_thread=True, max_workers=5):
pool_executor = ThreadPoolExecutor if use_thread else ProcessPoolExecutor
with pool_executor(max_workers=max_workers) as executor:
data = list(executor.map(func, iterable))
return data
def square(x: int) -> int:
time.sleep(1e-3)
return x ** 2
# For invoking multi processing we need __main__ module defined !! :)
if __name__ == "__main__":
input_iter = list(range(100))
number = 10
repeat = 5
print("1 prcess 1 thread", timeit.repeat(lambda: [square(x) for x in input_iter], repeat=repeat, number=number))
print("1 prcess 5 thread", timeit.repeat(lambda: parallelize(square, input_iter), repeat=repeat, number=number))
print("5 prcess 1 thread each", timeit.repeat(lambda: parallelize(square, input_iter, use_thread=False), repeat=repeat, number=number))
Execution:
$ python test_concurrency.py
1 prcess 1 thread [1.2861979760000002, 1.298199268, 1.2918409180000001, 1.281806194, 1.295781883]
1 prcess 5 thread [0.28324162100000017, 0.2827319930000005, 0.2845571439999999, 0.2883606469999993, 0.2812394300000003]
5 prcess 1 thread each [1.5489112939999998, 1.5780382430000017, 1.5727520679999998, 1.5113713640000004, 1.524860704]
Important Note:
- We do have to know that with threading and multi-processing it’s strength is it’s weakness.
- Hence restricting
max_workersis keypool_executor(max_workers=5). - If
max_workers=Nonethen it will spawn many threads and many processes which causes huge delay in thread/process management. So it will be slower compared to single process single thread. - You can play around
max_workerscount. But never set it toNone
Very usefull references:
- Python3.x moduel: https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
- Tiemit: https://docs.python.org/3/library/timeit.html#python-interface
- RealPython site: https://realpython.com/intro-to-python-threading/#using-a-threadpoolexecutor
- Raymond Hettiger: https://pybay.com/site_media/slides/raymond2017-keynote/index.html
- David Beazley - A Curious Course on Coroutines and Concurrency: http://www.dabeaz.com/coroutines/