Jan Janssen — July 30, 2025
With the rise of machine-learned interatomic potentials in the atomistic simulation community in Materials Science, the complexity of simulation workflows changed from directed acyclic graph (DAG) based simulation workflows with typically just a single simulation code to simulation workflows coupling simulation codes at different length and time scales and thousands of individual simulations. To orchestrate these workflows a number of simulation frameworks were developed. Still as part of the Exascale Computing Project we realized these were limited in their flexibility and scalability.
So, we developed executorlib[1] based on the concurrent futures Executor interface in the Python standard library and with the goal to distribute Python functions over hundreds of compute nodes. Internally, executorlib leverages the Simple Linux Utility for Resource Management (SLURM) and the flux framework from Lawrence Livermore National Laboratory[2] to up-scale simulation workflows from a workstation or traditional high-performance computers (HPC) to the latest generation of Exascale machines. In contrast to previous solutions, it does not require any daemon process or database but instead directly interfaces with the job manager to maximize computational efficiency. At the same time, it is designed with a focus on debugging capabilities to minimize the overhead of migrating workflows to the Exascale machines.
In this presentation I introduced executorlib, highlighting the lessons learned from development of the pyiron atomistic simulation suite which led to the development of executorlib as minimalistic workflow manager. Finally, I highlight the general applicability of executorlib to distribute python functions of any scientific domain on HPC clusters of all sizes.
[1]: Janssen et al., JOSS, 10(108), 7782, (2025).
[2]: Dong H. Ahn et al., Fut. Gen. Comp. Sys., 110, (2020).
Gijs Burghoorn — June 25, 2025
Parquet is an important file-format used in data science world. It provides many opportunities for query optimization and data pruning. Based on our experience optimizing the Polars Parquet reader, we look through how Parquet stores data, its uses and rediscover many of the reader optimizations.
Watch on Youtubemodel.fit(X, y).predict(X)
Guillaume Lemaitre — April 30, 2025
Scikit-learn is one of the de facto libraries when it comes to predictive modeling with tabular data. For over a decade, it has provided traditional and reliable algorithms to address data science problems. While it excels at model fitting and prediction, these stages represent only a small portion of a data science project and are relatively well-defined. Many data scientists are familiar with the notion that 90% of their time is spent on preprocessing, while the modeling stage takes up only 10% of their efforts. Additionally, tracking and organizing experiments, as well as transitioning from experimentation to production, can be challenging.
This exchange aims to shed light on recent developments and efforts within the scikit-learn ecosystem. We will provide an overview of the following tools through a series of short notebook demos.
Watch on YoutubeMarco Gorelli — March 25, 2025
Marco will discuss Narwhals, a lightweight compatibility layer between dataframes, and how it's silently changing many of the data science tools you’re already using.
Watch on YoutubeGaël Varoquaux — February 26, 2025
While tabular data is central to all organizations, it seems left out of the AI discussion, which has focused on images, text and sound. Ineed, for data science, most of the excitement is in machine learning, but most of the work happens before. Tables often require extensive manual transformation or "data wrangling". I will discuss how we progressively rethought this process, building machine learning tool that require less wrangling. We are building a new library, skrub (https://skrub-data.org), that facilitates complex tabular-learning pipelines, writing as much as possible wrangling as high-level operations and automating them. A few lines of skrub can spare you dozens of wrangling lines!
Watch on YoutubeJuan Luis — January 29, 2025
Combining Python with compiled languages for speed is far from novel. And yet, Rust has proven to be a particularly solid companion for Python, thanks in part to the great tooling available. In this exchange we will give a walkthrough of how to create your first Rust extension for Python.
Watch on YoutubeSylvain Corlay — December 04, 2024
We will discuss the future of the Jupyter project (from a features/technical standpoint), in which we present ongoing and future work on the project. (Collaborative editing, CAD, GIS, etc).
Watch on YoutubeWes McKinney — October 30, 2024
In this discussion, I'll give an overview of some different projects and development work happening in the Python ecosystem that I'm excited about. This includes data frame libraries, data wrangling frameworks like Ibis, core processing libraries like Apache Arrow and DuckDB, and Python core work in tools like ruff, uv, pixi, and rye.
Watch on YoutubeHugo Shi — August 28, 2024
In this discussion, we will explore how Kubernetes, an open-source container orchestration platform, is used in data science and machine learning. As the demand for scalable and reproducible data science workflows grows, Kubernetes offers a robust solution for managing the complexities of deployment, scaling, and maintenance. We will begin by discussing the key challenges faced by data scientists and machine learning engineers, such as environment inconsistencies and infrastructure scalability, and how Kubernetes effectively addresses these issues.
The discussion will cover the basics of Kubernetes, including its core components and architecture, tailored specifically for a data science audience. We will delve into setting up data science environments on Kubernetes, from containerizing workflows to managing dependencies. Furthermore, the talk will highlight advanced topics like scaling data science workloads, orchestrating machine learning pipelines with tools like Kubeflow, and deploying machine learning models in production.
By the end of this session, attendees will have a clear understanding of how Kubernetes can enhance their data science and machine learning workflows, providing a scalable, efficient, and reproducible infrastructure. No Kubernetes knowledge is required for this presentation.
Watch on YoutubeKyle Chard — July 31, 2024
Globus Compute is a distributed Function as a Service (FaaS) platform that enables flexible, secure, scalable, and high performance remote Python function execution. Unlike centralized FaaS platforms, Globus Compute allows users to execute functions on heterogeneous remote computers, from laptops to leadership computing facilities. In this talk, Kyle will describe how Globus Compute can enable researchers to easily scale and execute their Python programs on remote computers. Examples showing how Globus Compute is being used in various applications across the national labs will be presented.
Watch on YoutubeDimo Angelov — June 26, 2024
In my talk, I will present Top2Vec, an open source topic modeling project. I will begin by discussing the inspiration and motivations behind creating the algorithm and project. Following that I will explain how Top2vec differs from traditional topic modeling techniques, highlighting its distinctive features and advantages. To conclude, I will demonstrate some practical applications of Top2Vec.
Watch on YoutubeJames Colliander — May 29, 2024
The digital revolution that started three-quarters of a century ago changed us. We, the people of Earth, are connected at light speed to each other, and to devices all around our planet. We receive signals from as far away as Voyager 1. We are virtually omnipresent. This talk will explore the challenges and opportunities for science in our era of ubiquitous cloud-computing
Watch on YoutubeJørgen Dokken — April 24, 2024
In this exchange, I will present the FEniCS project (https://fenicsproject.org), an open-source toolkit for solving partial differential equations. The FEniCS Project is composed of several packages, written in either C++ or Python. I will delve into the structure of the project and its evolution over time. Specifically, I will explain our use of Python for generating C-code and how we interact with C++.
Watch on YoutubeWill Barnes — March 27, 2024
The field of solar physics is primarily concerned with understanding the physics of the complex outer atmosphere of the Sun, the solar corona. Though visible to the naked eye only during a solar eclipse, the corona is observed across the electromagnetic spectrum, including the visible, extreme ultraviolet and x-ray wavelengths, by both space- and ground-based observatories. We are currently entering a so-called "golden age" of solar physics in which an increasing number of future solar observatories promise to deepen our understanding of this complex plasma environment. However, effectively using these data to understand the fundamental physical processes in the solar atmosphere requires software capable of searching, ingesting, combining, and analyzing data from these many different data sources, all of which are growing both in size and complexity. The SunPy Project aims to solve this challenge by building and maintaining an ecosystem of interoperable, community-developed, open-source Python packages for analyzing solar physics data.In this talk, I'll provide a brief history of the SunPy Project, the current state of the Project as well as the software landscape in solar physics more broadly, and discuss how the Project is moving forward. Throughout my talk, I will illustrate these points using my own journey into the world of open-source scientific Python software development and how that has informed my career as a solar physicist.
Watch on YoutubeDavid Nicholson, Ph.D. — February 28, 2024
In this exchange, I will present VocalPy, the core package of a software community for researchers that study how animals communicate with sound. For this audience, I will focus on the domain-driven design approach I am taking to develop the package. This approach has resulted in features that are (arguably) not common in scientific Python packages. For example, domain-specific data types, and pipelines consisting of classes that use callbacks. I will contrast these features with the design of many core scientific Python packages, that typically place an emphasis on a purely functional approach operating mainly on numpy arrays. I will discuss the trade-offs involved; for example, the potential for increased readability and reproducibility, but along with that a potential increase in maintenance burden. You will walk away with a better understanding of bioacoustics and animal communication, and some food for thought about how we design scientific software.
Watch on YoutubeDr. Hans Debinski — January 31, 2024
I will start with a general introduction into the theory of fitting and why you may want to use iminuit. I will show a few difficult nuts to crack with iminuit, and generally, how you get the best performance for a particular fit. Fitting speed in research matters: if an analysis runs quickly, one can do more checks and variations, and it makes computationally-intensive methods like the bootstrap for error estimation feasible."
Watch on YoutubeDr. Katrina Riehl — November 29, 2023
We have seen innumerable advancements in the scientific community due to the shift toward open science. We've learned, as a community, we must work together in order to build the next generation of scientific innovation. The history of the scientific computing ecosystem is intricately tied to its open source initiatives. One cannot succeed without the other. In this talk, we will walk through where we've been, where we are going, and the lessons we've learned along the way.
Watch on YoutubeStuart Campbell — November 01, 2023
I start by giving a brief background to my career path and my journey in working at many world class large scale experimental user facilities in both Europe and the United States. I’ve been involved in a number of cross facility and institution software projects. I present my thoughts based on my experiences working on these projects, highlighting some of the challenges that face such collaborations.
Watch on YoutubeDr. Titus Brown — September 27, 2023
Dr. Brown discusses his experience in developing two scientific Python packages, khmer and sourmash, designed for dealing with really large sequencing data sets. Topics include switching from C++ to Rust for the performance layer, integrating tests and documentation into your daily life, and the interplay between algorithmic novelty and effective implementations.
Watch on YoutubeDr. Giordon Stark — August 30, 2023
Searches for new physics at the Large Hadron Collider have constrained many models of physics beyond the Standard Model. Many searches also provide resources that allow them to be reinterpreted in the context of other models. We describe a reinterpretation pipeline that examines previously untested models of new physics using supplementary information from ATLAS SUSY searches, such as public analysis routines and serialized likelihoods, in a way that provides accurate limits even in models that differ meaningfully from the benchmark models of the original analysis. These resources are combined with common event generation and simulation toolkits MadGraph, Pythia, and Delphes into workflows steered by TOML configuration files, and bundled into the mapyde python package.
Watch on YoutubeLeah Wasser — June 28, 2023
Our core program is a peer review of open source software which seeks to improve the quality and usability of python tools that scientists depend on to process, visualize and analyze their data. We also have been working on guides to help scientists better understand how to package code so others can use it.
In this session, Leah, provides an overview of our peer review process and packaging guide. She will also discuss how pyOpenSci provides support and visibility for maintainers who are creating scientific Python tools.
Watch on YoutubeRafael Ferreira da Silva — May 31, 2023
Scientific workflows are critical tools in modern scientific computing, enabling the orchestration of large and complex experiments that span multiple facilities and computational resources. The Workflows Community Initiative (https://workflows.community) is a collaborative effort aimed at advancing the state-of-the-art in workflow systems and related technologies.
The initiative brings together researchers, developers, and practitioners to foster communication and collaboration on important issues related to workflow management, including workflow design, execution, monitoring, optimization, and interoperability. As part of the DOE ECP’s ExaWorks project, we have defined PSI/J (Portable Submission Interface for Jobs), a simple and portable interface for submitting jobs to distributed computing resources, regardless of their underlying architecture or middleware. Our Python-based reference PSI/J implementation supports a range of job submission types, including batch jobs, parallel jobs, and interactive jobs.
Watch on YoutubeTalley Lambert — April 26, 2023
Talley presented the motivation and general design of magicgui, a python
library that facilitates the autogeneration of GUIs based on type
annotations. In its high level interface, magicgui attempts to solve this
by mapping python type annotations to widgets (for those familiar with
ipywidgets, the goal is similar to ipywidgets.interact
).
At a lower level, magicgui provides a widget abstraction on top of GUI frameworks like Qt, or ipywidgets; this allows magicgui code to (mostly) work in both a jupyter-notebook environment, or a Qt-desktop environment (or anything for which a backend UI adapter exists: a textual UI adapter is in the works).
Watch on YoutubeEric Roberts and Peter Zwart — March 29, 2023
We introduce a new python software library DLSIA, or Deep Learning for Scientific Image Analysis, which offers users a friendly API for flexible deployment of fully customizable, well-established deep convolutional neural networks – Autoencoders, U-Nets, U-Net3+ (U-Net augmented with dense skip connections), and mixed-scale dense networks (MSDNets) – in addition to to more novel choices of networks, such as sparse mixed-scale networks (SMSNets), a lean, lower-parameter, and sparsely-connected MSDNet variant based on stochastically-generated graph networks. Overall, with the release of DLSIA, we aim to provide flexible deep learning routines and end-to-end frameworks to aid National Laboratory scientists in many of their image analysis tasks and challenges, including image segmentation, denoising, unsupervised clustering, and uncertainty quantification. In this talk, we showcase the ease-of-use and flexibility of the DLSIA library by highlighting use-cases involving... - pixel-by-pixel segmentation of volumetric, in-situ, sub-nuclear biological structures, - inpainting/estimating missing pixel data behind detector gaps in experimental X-ray scattering images, - image clustering via autoencoder latent space, and - multi-network aggregation in tomographic imaging.
Watch on YoutubeLeland McInnes & John Healy — February 22, 2023
John and Leland discuss how UMAP As a Python Open Source Project became the way it is. They trek through the journey of developing UMAP going all the way back to PyCon 2016 with HDBSCAN package, and the rest is as they say, history.
Watch on YoutubePablo Galindo Salgado — January 26, 2023
Python 3.11 is faster than previous Python versions. This is the result of the effort of the Faster CPython collaboration, which is a team that Guido van Rossum started at Microsoft and that later some other contributors and core devs (including myself) joined as collaborators. In this discussion, I will go into detail on how we are making Python faster, what techniques are we using, what challenges are we facing, and what may be stored for future versions.
Watch on YoutubeTodd Gamblin — November 30, 2022
Spack is an open source package management tool, written in Python, that simplifies the process of building and installing scientific software. It is used widely in the HPC community — by end users, HPC facility staff, and software developers who need to manage dependencies. Spack is very general; it is designed to allow packages to be built with many different versions, configurations, build options, and compiler flags, for CPU and GPU machines. This talk will give an overview of Spack, its community, and how enables users to be more productive.
This discussion will give an overview of Spack, its community, and how it enables users to be more productive.
Watch on YoutubeEmanuele Laface — October 26, 2022
Python is a daily instrument in science for data analysis, modeling and computing in general. In this talk, I will discuss the role of Python at the European Spallation Source, a particle accelerator facility for neutron production. I will briefly describe the science that a neutron source can achieve and then my discussion will be focused on the use of Python in our laboratory. In particular, I will talk about Python and Jupyter in the control system of the particle accelerator as a front-end to access the multiple systems used to archive data, query the accelerator devices, and simulate online the dynamics of the particle beam.
Watch on YoutubeDraga Doncila Pop & Juan Nunez-Iglacias — September 28, 2022
Juan and Draga will discuss napari, a Python package for fast array visualization that is equally comfortable working with 2D, 3D, and higher-dimensional image data. Napari can overlay images with the results of downstream processing steps, enabling quality control as well as manual intervention at critical stages — a workflow that previously involved shuttling of data between disparate tools.
They will also discuss napari's plugin interface, which can be used to extend its functionality, and to distribute new tools and methods to collaborators (who may have less Python experience) and the broader scientific community.
Watch on YoutubeMaxwell Grover and Zachary Sherman — August 31, 2022
The Python ARM Radar Toolkit, Py-ART, is a Python module containing a collection of weather radar algorithms and utilities. Py-ART is used by the Atmospheric Radiation Measurement (ARM) Climate Research Facility for working with data from a number of its precipitation and cloud radars, but has been designed so that it can be used by others in the radar and atmospheric communities to examine, processes, and analyze data from many types of weather radars.
We discuss the growth of this toolkit, where it fits into the general open radar science community, and future directions of the package!
Watch on YoutubeJan Janssen — July 27, 2022
With the first Exascale computers becoming available and with machine learning becoming a fundamental building block of many research projects, we have to rethink the way we manage our simulations and analysis. Moving away from shell scripts and a zoo of different utilities and simulation codes, pyiron[1,2], in analogy to an integrated development environment (IDE), provides a central interface for rapid prototyping and up-scaling simulation protocols. The whole simulation lifecycle is represented in pyiron by a class of generic objects – the pyiron objects – which connect to the job management, the data storage interface and the user interface. As a result, the individual objects can be combined like building blocks to construct complex simulation protocols and enable the automation of routine tasks.
For more information, visit the pyIron website: www.pyiron.org
Watch on YoutubeMatthew Feickert, Gordon Watts, and Jim Pivarski — June 29, 2022
The landscape for analysis tools in experimental particle physics has changed drastically over the last decade, with a growing community movement from C++ frameworks to an ecosystem of interoperable Pythonic data analysis tooling. This movement has been spearheaded and supported by the Scikit-HEP organization and the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP), which seek to provide the computational and data science open source tools that will enable physicists to have extensible toolkits for approaching the data-intensive challenges of the High Luminosity Large Hadron Collider (LHC) and beyond.
Watch on YoutubeMatthias Bussonnier — May 25, 2022
The more widespread Scientific Python becomes, the more the distributed nature and multiple projects will make it hard for newcomers to understand the landscape. In particular, documentation for "All in one" proprietary and closed source solutions is easier to find. Now that the world is hyper-connected, computing is ubiquitous and CI virtually free, I believe we can have a much better documentation experience.
Watch on YoutubeWolfgang Kerzendorf — April 27, 2022
Wolfgang discusses fundamental questions of Astrophysics and how we can use AI & Physics to better understand the universe at large.
Watch on YoutubeLaurie Stephey and Daniel Margala — March 30, 2022
We discussed scaling Python on large-scale GPU systems, as well as how to manage those systems and promote the usage of those systems.
Watch on YoutubeThoms Caswell and Juliane Reinhardt — January 26, 2022
During this session, we introduced our newest committee members and took a look ahead to what 2022 might hold for Python and the PyData community.
We also discussed some of the projects and initiatives our host panelists are working on and they shared their views on the direction Python and PyData are headed.
Watch on YoutubeRoss Barnowski and Stéfan van der Walt — December 01, 2021
Aric Hagberg — October 27, 2021
NetworkX is a software tool for network science. I'll tell the previously untold story of how the software project started at Los Alamos and describe the original design goals. The software scope was driven by research applications such as disease spread, cybersecurity, and measuring scholarly impact. I'll describe these applications and the algorithms and analyses that were developed to support them.
Watch on Youtube