Working with Abstract Syntax Trees

Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic research​1,2​.

This post demonstrates how to work with AST in Python by parsing C code with CLang/LLVM​3​ and visualizing by graphviz.

Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:

from clang.cindex import Index, Config, CursorKind, TypeKind

SCRIPT_PATH = "./tcpdump/print-ppp.c"

# C99 is a proper C code standard for tcpdump, as per their docs
index = Index.create()
translation_unit = index.parse(SCRIPT_PATH, args=["-std=c99"])

# filter to nodes in the root script (ignore imported!)
script_node = translation_unit.cursor
all_nodes = script_node.get_children()
all_nodes = filter(lambda c: == SCRIPT_PATH, all_nodes)

# filter to function nodes
func_nodes = filter(lambda c: c.kind == CursorKind.FUNCTION_DECL, all_nodes)

# print attributes and their types for each function
for fn in func_nodes:
    for arg in fn.get_arguments():
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
            declr = t.get_declaration()
            f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',
            f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}'

Which gives the following output when tested on the tcpdump project

     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L403,C39-L403,C59 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C61-L403,C73 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     const unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C75-L403,C86 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1359,C10-L1359,C33 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1360,C10-L1360,C25 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L1360,C27-L1360,C39 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30

However, the funny part comes from visualization. This is easy with graphviz

from graphviz import Digraph

dot = Digraph(strict=True)
dot.attr(rankdir="LR", size="20,100", fontsize="6")

node_args = {"fontsize": "8pt", "edgefontsize": "6pt"}

for fn in func_nodes:
    fn_node_name = f"{fn.spelling}\nL{fn.location.line}"
    dot.node(fn_node_name, **node_args)
    for i, arg in enumerate(fn.get_arguments(), start=1):
        arg_node_name = arg.type.get_canonical().spelling
        dot.node(arg_node_name, **node_args)
        dot.edge(fn_node_name, arg_node_name)
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
            declr = t.get_declaration()
        declr_file = f"{declr.location.file}"
        dot.node(declr_file, **node_args)
            arg_node_name, declr_file, label=f"L{declr.location.line}", fontsize="6pt"

from IPython.display import display_svg

We can now enjoy the pretty informative graph 😎 It shows that multiple functions share only few types of arguments and gives precise information about their origin.

The fully working example is shared here as a Colab notebook.

  1. 1.
    Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
  2. 2.
    Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL             : Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
  3. 3.
    Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665

Customized Jupyter environments on Google Cloud

Kaggle docker images come with a huge list of pre-installed packages for machine-learning, including the support of GPU computing. They run within a container as a Jupyter application accessed by users through its web interface. Running a custom image boils down to these steps

  • 💡 pulling the right version from the container registry
  • ❗ publishing with appropriate parameters (--runtime flag important for GPU support)

Below we can see how it looks like

(base) maciej.skorski@shared-notebooks:~$ docker pull
v128: Pulling from kaggle-gpu-images/python
d5fd17ec1767: Pulling fs layer 
(base) maciej.skorski@shared-notebooks:~$ sudo docker run \
>    --name "/payload-container" \
>    --runtime "nvidia" \
>    --volume "/home/jupyter:/home/jupyter" \
>    --mount type=bind,source=/opt/deeplearning/jupyter/,destination=/opt/jupyter/.jupyter/,readonly \
>    --log-driver "json-file" \
>    --restart "always" \
>    --publish "" \
>    --network "bridge" \
>    --expose "8080/tcp" \
>    --label "kaggle-lang"="python" \
>    --detach \
>    --tty \
>    --entrypoint "/" \
>    "" \
>    "/" 

The following test in Python shell shows that we can indeed use GPU 🙂

root@cf1b6f63d729:/# ipython
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.33.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: True

In [3]: torch.Tensor([1,2,3]).to(0)
Out[3]: tensor([1., 2., 3.], device='cuda:0')

Repairing user-managed notebooks on Google Cloud

In this note, I am sharing a case study on debugging and fixing jupyter-lab access issues.

The diagnostic script can be run on a VM instance as shown below:

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/

Vertex Workbench Diagnostic Tool

Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container...         [ERROR] Jupyter service is not running
Checking internal Jupyter API status...         [ERROR] Jupyter API is not active
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS        [OK]
Checking DNS      [OK]

System's health status is degraded

Diagnostic tool will collect the following information: 

  System information
  System Log /var/log/
  Docker information
  Jupyter service status
  Network information
  Proxy configuration: /opt/deeplearning/proxy-agent-config.json
  Conda environment information
  pip environment information
  GCP instance information

Do you want to continue (y/n)? n

Jupyter service runs from a container, but it somehow stopped in this case 😳

(base) maciej.skorski@shared-notebooks:~$ docker container ls

Not a problem! We can restart the container, but carefully choosing the parameters to expose it properly (ports, mounted folders etc). The appropriate docker command can be retrieved from a running container on a similar healthy instance by docker inspect

(base) maciej.skorski@kaggle-test-shared:~$ docker inspect \
>   --format "$(curl -s" 3f5b6d709ccc

docker run \
  --name "/payload-container" \
  --runtime "runc" \
  --volume "/home/jupyter:/home/jupyter" \
  --mount type=bind,source=/opt/deeplearning/jupyter/,destination=/opt/jupyter/.jupyter/,readonly \
  --log-driver "json-file" \
  --restart "always" \
  --publish "" \
  --network "bridge" \
  --hostname "3f5b6d709ccc" \
  --expose "8080/tcp" \
  --env "TENSORBOARD_PROXY_URL=/proxy/%PORT%/" \
  --env "LIT_PROXY_URL=/proxy/%PORT%/" \
  --env "PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
  --env "LC_ALL=C.UTF-8" \
  --env "LANG=C.UTF-8" \
  --env "DL_ANACONDA_HOME=/opt/conda" \
  --env "SHELL=/bin/bash" \
  --env "LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64::/opt/conda/lib" \
  --env "CONTAINER_NAME=tf2-cpu/2-11" \
  --env "KMP_BLOCKTIME=0" \
  --env "KMP_AFFINITY=granularity=fine,verbose,compact,1,0" \
  --env "KMP_SETTINGS=false" \
  --env "NODE_OPTIONS=--max-old-space-size=4096" \
  --env "ENABLE_MULTI_ENV=false" \
  --env "LIBRARY_PATH=:/opt/conda/lib" \
  --env "TENSORFLOW_VERSION=2.11.0" \
  --env "KMP_WARNINGS=0" \
  --env "PROJ_LIB=/opt/conda/share/proj" \
  --env "TESSERACT_PATH=/usr/bin/tesseract" \
  --env "PYTHONPATH=:/opt/facets/facets_overview/python/" \
  --env "PYTHONUSERBASE=/root/.local" \
  --env "MPLBACKEND=agg" \
  --env "GIT_COMMIT=7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --env "BUILD_DATE=20230419-235653" \
  --label "build-date"="20230419-235653" \
  --label ""="Container: TensorFlow 2-11" \
  --label "git-commit"="7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --label "kaggle-lang"="python" \
  --label ""="ubuntu" \
  --label "org.opencontainers.image.version"="20.04" \
  --label "tensorflow-version"="2.11.0" \
  --detach \
  --tty \
  --entrypoint "/" \
  "" \

Now the check goes OK 🙂

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/

Vertex Workbench Diagnostic Tool

Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container... [OK]
Checking internal Jupyter API status...         [OK]
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS        [OK]
Checking DNS      [OK]

ML Prototyping Environment on Cloud

Teams that collaborate on data-science tasks using cloud platforms often choose to share a preconfigured ML environment, such as Kaggle Docker Python image. This resolves reproducibility and dependency issues, while individual team members can add custom packages on top, with local virtual environments, for example adding less common packages for computer vision.

This robust setup requires pointing to the base environment as --system-site-packages when configuring the local virtual environment. Below, we see an example of a local environment with the package DeepForest (not present in the Kaggle image).

root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m venv .deepforest --system-site-packages
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install --upgrade pip --quiet
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install deepforest --quiet

The local environment can be further exposed to jupyter as a custom kernel.

root@cf1b6f63d729:/home/jupyter/src/tree_counting# source .deepforest/bin/activate
(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m ipykernel install --user --name .deepforest --display-name "Kaggle+DeepForest"
Installed kernelspec .deepforest in /root/.local/share/jupyter/kernels/.deepforest

The architecture is shown below.

Dev Environment Architecture, generated with plantuml.

This script demonstrates the difference between system-level and local packages.

(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import tensorflow
>>> import deepforest
>>> tensorflow.__file__

Finally, it is worth mentioning the Dev Containers extension, which connects IDE to a running container. Then we can enjoy all the VS Code features 🙂

IDE connected to a container.

Efficient Pre-Commit Hooks with GitHub Actions

Pre-commit is a great tool for running various sanity checks (formatting, linting) on the code base. However, such scanning may be time-consuming (particularly on certain content like notebooks) which hits both user experience and billing for CI/CD (minutes are usually paid, except for public repos or very small projects).

Below, I demonstrate how to effectively optimize running pre-commit on GitHub Actions. The key is to cache both the pre-commit package and dependent hooks. Note that, as of now (April 2023), pre-commit native caching does only the second part. Fortunately, managing its cache is as simple as calling the GitHub Cache Action on ~/.cache/pre-commit.

name: pre-commit

    branches: [experiments]
    branches: [experiments, main]

    runs-on: ubuntu-latest
    - uses: actions/checkout@v3
    - uses: actions/setup-python@v4
        python-version: 3.7
    - name: cache pre-commit deps
      id: cache_pre_commit
      uses: actions/cache@v3
          cache-name: cache-pre-commit
        path: |
        key: ${{ env.cache-name }}-${{ hashFiles('.pre-commit-config.yaml','~/.cache/pre-commit/*') }}
    - name: install pre-commit
      if: steps.cache_pre_commit.outputs.cache-hit != 'true'
      run: |
        python -m venv .pre_commit_venv
        . .pre_commit_venv/bin/activate
        pip install --upgrade pip
        pip install pre-commit
        pre-commit install --install-hooks
        pre-commit gc
    - name: run pre-commit hooks
      run: |
        . .pre_commit_venv/bin/activate  
        pre-commit run --color=always --all-files

Building and Publishing Docker with GitHub Actions

In this post I am sharing my recipe for building and publishing Docker using GitHub Actions. It concisely wraps up a few steps that beginners often find problematic. In particular:

  • use GitHub secrets to securely store credentials, such as $DOCKER_USER and $DOCKER_PASSWORD, for your docker registry (such as DockerHub or GitHub Container Registry)
  • I recommend logging to the docker registry via the CLI, rather than using a less transparent GitHub Action, which is as simple as docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
  • use the correct tag pattern when pushing your docker to a registry

The sample code is shown below. See it in action on production here and in this template.

name: docker-image

    branches: [ "main" ]
    paths: ["Dockerfile",".github/workflows/docker-image.yaml"]

    runs-on: ubuntu-latest
    # Docker tags and credentials for DockerHub/GitHub Containers, customize!
      IMAGE_NAME: plantuml-docker
      IMAGE_VERSION: latest
      DOCKER_USER: ${{ secrets.DOCKER_USER }}
      GITHUB_TOKEN: ${{ secrets.PAT }}
      GITHUB_USER: ${{ }}
    - uses: actions/checkout@v3
    - name: Build and tag the image
      run: |
        docker build . \
    - name: Publish to DockerHub
      if: env.DOCKER_PASSWORD != ''
      run: |
        docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
    - name: Publish to GitHub Container registry
      if: env.GITHUB_TOKEN != ''
      run: |
        docker login -u $GITHUB_USER -p $GITHUB_TOKEN 

Effective Caching with GitHub Actions

GitHub Actions is great as a CI/CD platform. However, to be really efficient, workflows need to leverage some optimization techniques, such as caching or running tasks in parallel. In this note, I am sharing some thoughts on how to use cache effectively, with respect to multiple paths and sudo-installed APT packages. The discussion will touch on a few non-trivial aspects that, in my opinion, are not well-explained in other web materials.

My use case was simple: speed-up building a university course in Sphinx, to be hosted on GitHub pages. Installing Sphinx dependencies required multiple Python and Linux APT downloads, which took quite a long. The caching solution indeed fixed a problem, and here are the key takeaways:

This repository demonstrates the working solution. The cache size (Python and APT packages) ais about 120MB. And this is how the job looks like:

name: docs

on: [push, pull_request, workflow_dispatch]

    runs-on: ubuntu-latest
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
      - name: prepare virtual environment
        run: |
          python -m venv .venv
          mkdir .apt
      - name: cache dependencies
        id: cache_deps
        uses: actions/cache@v3
            cache-name: cache-dependencies
          path: |
          key: ${{ runner.os }}-build-${{ env.cache-name }}-${{ hashFiles('.github/workflows/*') }}
      - name: Install python dependencies
        if: ${{ steps.cache_deps.outputs.cache-hit != 'true' }}
        run: |
          source .venv/bin/activate
          pip install jupyter-book
          pip install sphinxcontrib-plantuml
      - name: Install sudo dependencies
        run: |
          sudo apt-get -o Dir::Cache=".apt" update
          sudo apt-get -o Dir::Cache=".apt" install plantuml
      - run: |
          apt-config dump | grep Dir::Cache
      - name: Compile Docs
        run: |
          source .venv/bin/activate
          jupyter-book build docs
      - name: Deploy to gh-pages
        uses: peaceiris/actions-gh-pages@v3
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_branch: gh-pages
          publish_dir: ./docs/_build/html

Making SSH work by proxy

It is a popular misbelief that hiding encrypted connections (SSH) behind a proxy is a dark domain reserved to crime activities. You may need a Russian or Iranian proxy to get your coding job done, when firewalls of your favourite coffee place or wifi in travel forbid the use of SSH.

As this happens to me regularly – I travel a lot – I would like to share a solution here. The most effective is to proxy the traffic through port 443 (default for HTTPS, typically enabled). Testing the list of free proxy servers ​1​ we find that the proxy (Iranian) is working well. It remains to add a proxy instruction to the ssh configuration as shown below. That’s it, and I can work with GitHub in my favourite coffee place in France. Enjoy!

# content of .ssh/config
    User git
    Port 22
    IdentityFile ~/.ssh/id_rsa
    StrictHostKeyChecking no
    ProxyCommand ncat --proxy 22
  1. 1. List of free proxies. Proxy list for port 443.

Free and robust Tweets extraction

As anticipated by many, Twitter stopped offering its (limited!) API for free ​1​.

Now, what options do you have to programmatically access the public content for free?
In this context, it is worth mentioning the library snscrape, a tool (well-maintained as of now) for extracting the content from social media services such as Facebook, Instagram or Twitter ​2​. I have just given a go, in the scope of the research project I am working on, and would love to share some thoughts and code.

The basic usage is pretty simple, but I added multithreading to improve speed by executing queries in parallel (an established way of handling I/O bound operations). I also prefer a functional/pipeline style of composing Python commands, using generators, filter and map features. The code snippet below (see also the Colab notebook) shows how to extract tweets of top futurists. Enjoy!

# install social media scrapper: !pip3 install snscrape
import snscrape.modules.twitter as sntwitter
import itertools
import multiprocessing.dummy as mp # for multithreading 
import datetime
import pandas as pd

start_date = datetime.datetime(2018,1,1,tzinfo=datetime.timezone.utc) # from when
attributes = ('date','url','rawContent') # what attributes to keep

def get_tweets(username,n_tweets=5000,attributes=attributes):
    tweets = itertools.islice(sntwitter.TwitterSearchScraper(f'from:{username}').get_items(),n_tweets) # invoke the scrapper
    tweets = filter(lambda>=start_date, tweets)
    tweets = map(lambda t: (username,)+tuple(getattr(t,a) for a in attributes),tweets) # keep only attributes needed
    tweets = list(tweets) # the result has to be pickle'able
    return tweets

# a list of accounts to scrape
user_names = ['kevin2kelly','briansolis','PeterDiamandis','michiokaku']

# parallelise queries for speed ! 
with mp.Pool(4) as p:
    results =, user_names)
    # combine
    results = list(itertools.chain(*results))
  1. 1.
    @TwitterDev. Twitter announces stopping free access to its API. Twitter Dev Team. Published February 3, 2023. Accessed February 15, 2023.
  2. 2.
    snscrape. snscrape. Github Repository. Accessed February 15, 2023.

Fourier integrals vanishing on large circles

When evaluating contour integrals, it is often of interest to prove that Fourier-type integrals vanish on large enough semicircles (see the figure). This holds under the following condition:

Theorem. Suppose that $$f(z)=O(|z|^{-a}), \quad a>0$$ for \(z\) in the upper half-plane. Then for any \(\lambda > 0\) we have $$\int_{\gamma_R} f(z)\mathrm{e}^{i\lambda z} \rightarrow 0, \quad R\to+\infty,$$ where \(\gamma_R\) is the upper half-circle of radius \(R\).

This result is stronger than other ways of developing vanishing integration contours in the upper half-plane, compare for instance with the MIT lecture notes by Jeremy Orloff​1​. The version above can be found in advanced books on Fourier transforms, for example​2​.

To prove that, parametrize the upper half-circle \(\gamma_R\) by \(z=R\mathrm{e}^{i\theta} = R(\cos\theta + i\sin\theta)\) where \(0<\theta<\pi\). Under this parametrization, the Fourier multiplier becomes \(\mathrm{e}^{i\lambda z} = \mathrm{e}^{-\lambda R \sin \theta}\mathrm{e}^{i R \lambda \cos\theta}\). Thus, the integral can be bounded by $$ \left|\int_{\gamma_R} f(z)\mathrm{e}^{i\lambda z}\right|\leqslant \int_{0}^{\pi} |f(R\mathrm{e}^{i\theta})| R \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta \\
\leqslant C\int_{0}^{\pi} R^{1-a} \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta\\ = 2C\int_{0}^{\frac{\pi}{2}} R^{1-a} \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta \\
\leqslant 2C\int_{0}^{\frac{\pi}{2}} R^{1-a} \mathrm{e}^{-2 R\lambda \theta / \pi} \mbox{d}\theta \\
= C\cdot \frac{\pi R^{- a} \left(1 – e^{- R \lambda}\right)}{\lambda},$$
which tends to zero as long as \(a>0\) and \(R\to \infty\).

  1. 1.
    Orloff J. Definite integrals using the residue theorem. Lecture Notes. Accessed 2023.
  2. 2.
    Spiegel MR. Laplace Transforms. McGraw Hill; 1965.