Therefore speak I to them in parables, because seeing, they see not, and hearing, they hear not, neither do they understand.
Matthew 13:13
Ever wondered how miserable some โprestigiousโ businesses are, and how they manage to make their employees make up for poor project management? Me too! A classical situation that contributes to crisis is miscommunication to subcontractors or employees. Let’s see how UML can be used to study such antipatterns. They happen unintentionally, don’t they? ๐ค
This is a real-world use-case from a prestigious legal office located in Warsaw, Poland. I have been asked to capture project management antipatterns, as an external observer and modeller.
One use case was: an expert subcontractor asked proactively, in fact several times, to be put in the communication loop with the client. But the office executives didn’t find it necessary (why would they, huh?). Until… Guess when? The deadline! The subcontractor was caught by surprise: please deliver for the customer by today! But wait, what customer…? ๐ค
Another use case: the office rushed promising the client something they couldn’t deliver, and reached out for its experts for help pretty late.. Guess when? On the deadline day!
Here is the UML model that I promised, a good illustration of this poor management practice! I will use a sequence diagram, a powerful tool to explore interactions ๐ช
Modelling poor management in a legal office. See the source code: here and here.
You certainly agree this is not professional but would probably argue that this doesn’t happen to ErnstYoung, PWC and other big companies… Would you?
Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic researchโ1,2โ.
This post demonstrates how to work with AST in Python by parsing C code with CLang/LLVMโ3โ and visualizing by graphviz.
Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:
from clang.cindex import Index, Config, CursorKind, TypeKindSCRIPT_PATH ="./tcpdump/print-ppp.c"# C99 is a proper C code standard for tcpdump, as per their docsindex = Index.create()translation_unit = index.parse(SCRIPT_PATH,args=["-std=c99"])# filter to nodes in the root script (ignore imported!)script_node = translation_unit.cursorall_nodes = script_node.get_children()all_nodes =filter(lambdac: c.location.file.name == SCRIPT_PATH, all_nodes)# filter to function nodesfunc_nodes =filter(lambdac: c.kind == CursorKind.FUNCTION_DECL, all_nodes)# print attributes and their types for each functionfor fn in func_nodes:print(fn.spelling)for arg in fn.get_arguments(): t = arg.type# handle pointers by describing their pointeesif t.kind == TypeKind.POINTER: declr = t.get_pointee().get_declaration()else: declr = t.get_declaration()print('\t', t.get_canonical().spelling, t.kind,f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}')
Which gives the following output when tested on the tcpdump project
However, the funny part comes from visualization. This is easy with graphviz
from graphviz import Digraphdot =Digraph(strict=True)dot.attr(rankdir="LR",size="20,100",fontsize="6")node_args ={"fontsize":"8pt","edgefontsize":"6pt"}for fn in func_nodes: fn_node_name =f"{fn.spelling}\nL{fn.location.line}" dot.node(fn_node_name,**node_args)for i, arg inenumerate(fn.get_arguments(),start=1): arg_node_name = arg.type.get_canonical().spelling dot.node(arg_node_name,**node_args) dot.edge(fn_node_name, arg_node_name) t = arg.type# handle pointers by describing their pointeesif t.kind == TypeKind.POINTER: declr = t.get_pointee().get_declaration()else: declr = t.get_declaration() declr_file =f"{declr.location.file}" dot.node(declr_file,**node_args) dot.edge( arg_node_name, declr_file,label=f"L{declr.location.line}",fontsize="6pt")from IPython.display import display_svgdisplay_svg(dot)
We can now enjoy the pretty informative graph ๐ It shows that multiple functions share only few types of arguments and gives precise information about their origin.
Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
2.
Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL โฏ: Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
3.
Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665
Kaggle docker images come with a huge list of pre-installed packages for machine-learning, including the support of GPU computing. They run within a container as a Jupyter application accessed by users through its web interface. Running a custom image boils down to these steps
Teams that collaborate on data-science tasks using cloud platforms often choose to share a preconfigured ML environment, such as Kaggle Docker Python image. This resolves reproducibility and dependency issues, while individual team members can add custom packages on top, with local virtual environments, for example adding less common packages for computer vision.
This robust setup requires pointing to the base environment as --system-site-packages when configuring the local virtual environment. Below, we see an example of a local environment with the package DeepForest (not present in the Kaggle image).
Finally, it is worth mentioning the Dev Containers extension, which connects IDE to a running container. Then we can enjoy all the VS Code features ๐
Pre-commit is a great tool for running various sanity checks (formatting, linting) on the code base. However, such scanning may be time-consuming (particularly on certain content like notebooks) which hits both user experience and billing for CI/CD (minutes are usually paid, except for public repos or very small projects).
Below, I demonstrate how to effectively optimize running pre-commit on GitHub Actions. The key is to cache both the pre-commit package and dependent hooks. Note that, as of now (April 2023), pre-commit native caching does only the second part. Fortunately, managing its cache is as simple as calling the GitHub Cache Action on ~/.cache/pre-commit.
In this post I am sharing my recipe for building and publishing Docker using GitHub Actions. It concisely wraps up a few steps that beginners often find problematic. In particular:
use GitHub secrets to securely store credentials, such as $DOCKER_USER and $DOCKER_PASSWORD, for your docker registry (such as DockerHub or GitHub Container Registry)
I recommend logging to the docker registry via the CLI, rather than using a less transparent GitHub Action, which is as simple as docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
use the correct tag pattern when pushing your docker to a registry
GitHub Actions is great as a CI/CD platform. However, to be really efficient, workflows need to leverage some optimization techniques, such as caching or running tasks in parallel. In this note, I am sharing some thoughts on how to use cache effectively, with respect to multiple paths and sudo-installed APT packages. The discussion will touch on a few non-trivial aspects that, in my opinion, are not well-explained in other web materials.
My use case was simple: speed-up building a university course in Sphinx, to be hosted on GitHub pages. Installing Sphinx dependencies required multiple Python and Linux APT downloads, which took quite a long. The caching solution indeed fixed a problem, and here are the key takeaways:
It is a popular misbelief that hiding encrypted connections (SSH) behind a proxy is a dark domain reserved to crime activities. You may need a Russian or Iranian proxy to get your coding job done, when firewalls of your favourite coffee place or wifi in travel forbid the use of SSH.
As this happens to me regularly โ I travel a lot โ I would like to share a solution here. The most effective is to proxy the traffic through port 443 (default for HTTPS, typically enabled). Testing the list of free proxy serversโ1โ we find that the proxy 185.82.139.1 (Iranian) is working well. It remains to add a proxy instruction to the ssh configuration as shown below. That’s it, and I can work with GitHub in my favourite coffee place in France. Enjoy!
# content of .ssh/config
Host github.com
HostName github.com
User git
Port 22
IdentityFile ~/.ssh/id_rsa
StrictHostKeyChecking no
ProxyCommand ncat --proxy 185.82.139.1:443 github.com 22
As anticipated by many, Twitter stopped offering its (limited!) API for free โ1โ.
Now, what options do you have to programmatically access the public content for free? In this context, it is worth mentioning the library snscrape, a tool (well-maintained as of now) for extracting the content from social media services such as Facebook, Instagram or Twitter โ2โ. I have just given a go, in the scope of the research project I am working on, and would love to share some thoughts and code.
The basic usage is pretty simple, but I added multithreading to improve speed by executing queries in parallel (an established way of handling I/O bound operations). I also prefer a functional/pipeline style of composing Python commands, using generators, filter and map features. The code snippet below (see also the Colab notebook) shows how to extract tweets of top futurists. Enjoy!
# install social media scrapper: !pip3 install snscrape
import snscrape.modules.twitter as sntwitter
import itertools
import multiprocessing.dummy as mp # for multithreading
import datetime
import pandas as pd
start_date = datetime.datetime(2018,1,1,tzinfo=datetime.timezone.utc) # from when
attributes = ('date','url','rawContent') # what attributes to keep
def get_tweets(username,n_tweets=5000,attributes=attributes):
tweets = itertools.islice(sntwitter.TwitterSearchScraper(f'from:{username}').get_items(),n_tweets) # invoke the scrapper
tweets = filter(lambda t:t.date>=start_date, tweets)
tweets = map(lambda t: (username,)+tuple(getattr(t,a) for a in attributes),tweets) # keep only attributes needed
tweets = list(tweets) # the result has to be pickle'able
return tweets
# a list of accounts to scrape
user_names = ['kevin2kelly','briansolis','PeterDiamandis','michiokaku']
# parallelise queries for speed !
with mp.Pool(4) as p:
results = p.map(get_tweets, user_names)
# combine
results = list(itertools.chain(*results))