Debug CI/CD with SSH

CircleCI is a popular platform Continuous integration (CI) and continuous delivery (CD). While its job status reports are already useful, one can do much more insights by debugging it in real time. Here I am sharing a real use-case of debugging a failing job deploing an app 🙂

Failing job are reported in red and the (most of) errors caught during the execution appear in the terminal. In this case, the environment is unable to locate Python:

The failed job and caught errors.

The error is not very informative. 😮 For more insights, let’s re-run the job in SSH mode:

Re-running the failed CircleCI with SSH.

You will be welcomed with instructions on how to connect via SSH:


CircleCI showing SSH connection string.

Use this instruction to connect to inspect the environment at its failure stage:

mskorski@SHPLC-L0JH Documents % ssh -p 64535 aa.bbb.cc.dd
The authenticity of host '[aa.bbb.cc.dd]:64535 ([aa.bbb.cc.dd]:64535)' can't be established.
ED25519 key fingerprint is SHA256:LsMhHb5fUPLHI9dFdyig4VKw44GTqrA2dkEWT0sZx4k.
Are you sure you want to continue connecting (yes/no)? yes

circleci@bc95bb40fff3:~$ ls project/venv/bin -l
total 300
...
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:37 python -> python3
lrwxrwxrwx 1 circleci circleci   49 Aug  1 12:37 python3 -> /home/circleci/.pyenv/versions/3.8.13/bin/python3
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:38 python3.9 -> python3

Bingo! The terminal warns about broken symbolic links:

Terminal highlights broken symbolic links.

The solution in this case was to update cache. The issues may be far more complex than that, but being able to debug them live comes to the rescue. 😎

Prototype in Jupyter on Multiple Kernels

For data scientists, it is a must to prototype in multiple virtual environments which isolate different (and often very divergent) sets of Python packages. This can be achieved by linking one Jupyter installation with multiple Python environments.

Use the command <code>which jupyter</code> to show the Jupyter location and <jupyter kernelspec list> to show available kernels, as shown below:

ubuntu@ip-172-31-36-77:~/projects/eye-processing$ which jupyter
/usr/local/bin/jupyter
ubuntu@ip-172-31-36-77:~/projects/eye-processing$ jupyter kernelspec list
Available kernels:
  .eyelinkparser    /home/ubuntu/.local/share/jupyter/kernels/.eyelinkparser
  eye-processing    /home/ubuntu/.local/share/jupyter/kernels/eye-processing
  pypupilenv        /home/ubuntu/.local/share/jupyter/kernels/pypupilenv
  python3           /usr/local/share/jupyter/kernels/python3

To make an environment a Jupyter kernel, first activate it and install ipykernel inside.

ubuntu@ip-172-31-36-77:~/projects/eye-processing$ source .venv/bin/activate
(.venv) ubuntu@ip-172-31-36-77:~/projects/eye-processing$ pip install ipykernel
Collecting ipykernel
...
Successfully installed...

Then use ipykernel to register the active environment as a jupyter kernel (choose a name and the destination, e.g. user space; consult python -m ipykernel install –help for more options).

(.venv) ubuntu@ip-172-31-36-77:~/projects/eye-processing$ python -m ipykernel install --name eye-processing --user
Installed kernelspec eye-processing in /home/ubuntu/.local/share/jupyter/kernels/eye-processing

The kernel should appear in the jupyter kernel lists. Notebooks may need restarting to notice it.

The environment appears in Jupyter kernels (VS Code).

Open kernel.json file under kernel’s path to inspect config details:

{
 "argv": [
  "/home/ubuntu/projects/eye-processing/.venv/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "eye-processing",
 "language": "python",
 "metadata": {
  "debugger": true
 }
}
"~/.local/share/jupyter/kernels/eye-processing/kernel.json" [noeol] 14L, 234B

Modern Bibliography Management

A solid bibliography database is vital for every research project, yet building it is considered an ugly manual task by many – particularly by old-school researchers. But this does not have to be painful if we use modern toolkit. The following features appear particularly important:

  • collaborative work (sharing etc)
  • extraction magic (online search, automated record population)
  • tag-annotation support (organize records with keywords or comments)
  • multi-format support (export and import using BibTeX or other formats)
  • plugins to popular editors (Office, GoogleDocs)

I have had a particularly nice experience with Zotero (many thanks to my work colleagues from SensyneHealth for recommending this!). Let the pictures below demonstrate it in action!

tag and search…

organize notes…

automatically extract from databases…

format and export…

National Bank of Poland leads on helping Ukraine’s currency

Ukrainian refugees see their currency hardly convertible at fair rates or even accepted. The lack of liquidity will also hit Ukraine’s government urgent expenses (medical supplies or weapons).

The National Bank of Poland (NBP) offered a comprehensive financial package to address these problems. Firstly, it enabled the refugees to exchange the currency at a nearly official rate, through the agreement with the National Bank of Ukraine (NBU) signed on March 18.

Secondly, in a follow-up agreement the NBP offered a FX Swap for USD 1 billion which will provide the Ukraine central bank with more liquidity.

As Reuters announced on March 24, EU countries are close to follow and agree on a scheme enabling exchanging Ukraine’s cash. The rules disclosed in the draft are close to the NBP scheme (exchange capped at around 10,000 UAH per individual)

Evidence-based quick wins in reducing oil demand

10-point action plan on cutting the oil demand by IEA.

Following the Russia invasion on Ukraine an oil supply shock is expected and what is worse there may be no supply increase from OPEC+ . A way out is to cut the demand as advised by IEA. The proposed steps are quick-wins: easy to implement and reverse while having measurable and significant impact. The overall impact should balance the supply gap.

Quantitative estimates can be backed with empirical or even statistical evidence. A good example is the speed reduction, as recent research advances demonstrate quantitatively what drivers learn by experience: fuel consumption considerably increases with speed beyond a turning point. Turning this statement around: you can likely save lots of petrol driving at a lower speed. The study of He at al. finds cubic approximations fit well and enables several optimisation considerations for various groups of car users.

Source: „Study on a Prediction Model of Superhighway Fuel Consumption Based on the Test of Easy Car Platform”, He at al.

Robust Azure ETLs with Python

Microsoft Azure cloud computing platform faces criticism due to its Python API being little customisable and poorly documented. Still it is a popular choice for many companies, thus data scientists need to squeeze maximum of features and performance out of it rather than complain. Below I am sharing thoughts on creating robust Extract-Transform-Load processes in this setup.

Extract from database: consumer-producer!

The key trick is to enable the consumer-producer pattern: you want to process your data in batches, for the sake of robustness and efficiency. The popular pandas library dedicated for tabular data is not enough for this task, as it hinders useful features. Gain more control using dedicated database drivers, e.g. psycogp driver for the popular Postgres database.

import psycopg2

dsn = f'user={db_user} password={db_password} dbname={db_name} host={db_host}'
conn = psycopg2.connect(dsn)

query = """
    SELECT *
    FROM sales
    WHERE date > '2021'
    """


def process_rows(row_group):
    '''do your stuff, e.g. filter and append to a file'''
    pass


n_prefetch = 10000


# mind the server-side cursor! 
with conn.cursor(name='server_side_cursor') as cur:
    cur.itersize = n_prefetch
    cur.execute(query)
    while True:
        row_group = cur.fetchmany(n_prefetch)
        if len(row_group) > 0:
            process_rows(row_group)
        else:
            break

Cast datatypes

Cast datatypes early and explicitly having these point in mind

Best to do this casting upstream, adapting at the database driver level, like in this example:

import psycopg2

# cast some data types upon receiving from database
datetype_casted = psycopg2.extensions.new_type(
    psycopg2.extensions.DATE.values, "date", psycopg2.DATETIME
)
psycopg2.extensions.register_type(datetype_casted)
decimal_casted = psycopg2.extensions.new_type(
    psycopg2.extensions.DECIMAL.values, "decimal", psycopg2.extensions.FLOAT
)
psycopg2.extensions.register_type(decimal_casted)

Use Parquet to store tabular data

The Apache Parquet is invaluable to efficient work with large data in tabular format. There are two major drivers for Pyhon: pyarrow and fastparquet. The first one can be integrated with Azure data flows (e.g. you can stream and filter data), although it has been limited in supporting more sophisticated data such as timedelta. Remember that writing Parquet incurs memory overhead so better to do this in batches. The relevant code may look as below:

def sql_to_parquet(
    conn, query, column_names, target_dir, n_prefetch=1000000, **parquet_kwargs
):
    """Writes the result of a SQL query to a Parquet file (in chunks).

    Args:
        conn: Psycopg connection object (must be open)
        query: SQL query of "select" type
        column_names: column names given to the resulting SQL table
        target_dir: local directory where Parquet is written to; must exist, data is overwritten
        n_prefetch: chunk of SQL data processed (read from SQL and dumped to Parquet) at a time. Defaults to 1000000.
    """
    with conn.cursor(name="server_side_cursor") as cur:
        # start query
        cur.itersize = n_prefetch
        cur.execute(query)
        # set up consumer
        chunk = 0
        # consume until stream is empty
        while True:
            # get and process one batch
            row_group = cur.fetchmany(n_prefetch)
            chunk += 1
            if len(row_group) > 0:
                out = pd.DataFrame(data=row_group, columns=column_names)
                fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
                out.to_parquet(fname, engine="pyarrow", **parquet_kwargs)
            else:
                break


def df_to_parquet(df, target_dir, chunk_size=100000, **parquet_kwargs):
    """Writes pandas DataFrame to parquet format with pyarrow.

    Args:
        df: pandas DataFrame
        target_dir: local directory where parquet files are written to
        chunk_size: number of rows stored in one chunk of parquet file. Defaults to 100000.
    """
    for i in range(0, len(df), chunk_size):
        slc = df.iloc[i : i + chunk_size]
        chunk = int(i / chunk_size)
        fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
        slc.to_parquet(fname, engine="pyarrow", **parquet_kwargs)

Leverage Azure API properly!

The Azure API is both cryptic in documentation and under-explained in data science blogs. Below I am sharing a comprehensive receipt on how to upload and register Parquet data with minimum effort. While uploading makes the data persist, registering enables further Azure features (such as pipe-lining). Best to dump your Parquet file to a directory (large Parquet files should be chunked) with tempfile then use the Azure Dataset class to both upload to the storage (use universally unique identifier for reproducibility/avoiding path collisions) and register to the workspace (use Tabular subclass). Use classes Workspace and Datastore to facilitate interaction with Azure.

from azureml.core import Workspace, Dataset
import tempfile
from uuid import uuid4
import psycopg2

ws = Workspace.from_config()
dstore = ws.datastores.get("my_datastore")

with tempfile.TemporaryDirectory() as tempdir:
    sql_to_parquet(conn, query, query_cols, tempdir)
    target = (dstore, f"raw/{tab}/{str(uuid4())}")
    Dataset.File.upload_directory(tempdir, target)
    ds = Dataset.Tabular.from_parquet_files(target)
    ds.register(
        ws,
        name=f"dataset_{tab}",
        description="created from Sales table",
        tags={"JIRA": "ticket-01"},
        create_new_version=True,
    )

Quantitative overview of Nord Stream 2 issues

The briefing document of the European Parliament offers a quantitative and comprehensive view on the issues with Nord Stream 2. This excellent figure alone explains a lot:

European Parliament Briefings: Russian pipelines to Europe and Turkey.

It may be interesting to realise that the points this document makes are underrepresented in popular press such as BBC. These concerns are discussed in measurable terms:

  • The existing capacity is enough
  • Nord Stream 2 harms the diversification of routes (Polish/Ukrainian paths reduced, Germany becoming the main hub)
  • There is a documented history of using the gas infrastructure for political purposes
  • EU bodies warned on political and security risks

Iron logic defeats strawman

The straw man rhetorical technique is widely used: from scientific reviews (academia), through job interviews (industry) to the political discourse (international relations). The idea is simple: you distort arguments or opposing views to easier refute them. Not only is this trick popular, but proven successful statistically.

The best weapon against it is logically strict reasoning. Do not rush in quick answers and judgments, but rather use the playback and evaluate/falsify the claim in measurable terms as much as possible. This note aims to spread awareness of this logical fallacy and give illustrating examples, which appears particularly noteworthy in the context of the rise of misinformation.

  • Academia: a popular „straw man” variation is when reviewers exaggerate on English mistakes in scientific papers, concluding that the math content may be equally incorrect. While scientists should strive for perfection at the communication level, in life and computer sciences the technical content is more important. Fix English but don’t let your „this paper presents merit findings” claim get distorted by this. Remember that a) researchers compete with each other and b) it’s easier to complain on misspelled words than to evaluate complex math.
  • Industrial career: you believe you are a good fit, but the interviewers exaggerate on the necessity of some skill X which you don’t have at a moment. We speak of a fallacy when X is not essential for the role, or trivial to learn. Don’t let your „I am a good fit” get distorted!
  • International relations: instead of my own examples, I will refer to the excellent article written by the UK Defence Secretary. He attempts to falsify several claims coming from the government circles of the Russian Federation against the NATO alliance. One is „Russia being encircled by NATO”, which is replied by asking to measure the overlapping border.

Can Russia defend its economy from sanctions?

While western leaders are proud of imposing largest sanctions ever, it seems that Russia can withstand the crisis longer than some expect. The following two points, underrepresented in popular media, support this claim:

Russia seems both experienced and equipped to contain this crisis in short-term. After the crisis in 2014 it has used oil-generated income to collect more reserves and keep founding military.

Russian reserves in mln USD
Russian military spending in bln USD

But time will tell what the long-term financial consequences for the West and Russia would be.

Banks in Austria block financial help amid war on Ukraine

Amid the Russia’s invasion, some banks blocks money transfers to Ukraine. This effectively undermines the help efforts of both the international community and the expats. Here comes the wall of shame: a picture is worth thousand words:

Poland does the right thing: polish banks offered free transfers to Ukraine.