Pipreqs FastAPI Server + Github Bot🔗
Python Pipreqs API + Github Bot🔗
tl;dr: repo link
Turned my #Python requirements API into a @github #bot that reminds me when I forget to update dependencies! 🤖
— @gar@gerardbentley.com (@GarsBar35Plus) April 29, 2022
I used httpx and gidgethub, deployed with the existing @FastAPI server on @heroku
Followed a great guide by @mariatta -> https://t.co/3GthR5urKe https://t.co/ARQw007LEQ pic.twitter.com/WRbuxvF3qA
Powered by FastAPI, pipreqs, and gidgethub. Specifically, FastAPI runs the API server for receiving requests, pipreqs does the code requirements assessing, and gidgethub handles routing requests from Github Webhooks.
Basic API to power bots / helpers to rid the world of Python projects without correct requirements!
Simple bot to open an issue if requirements.txt
doesn't match the output of pipreqs
.
Attempts to shallow git clone
a given repo then run pipreqs
on the codebase.
Returns the resulting requirements.txt
contents!
How to Use🔗
- Use the pyscript frontend on github pages
- Use the Streamlit frontend 🎈
- Use the fastapi generated swagger docs
- Use
curl
or other http client
Bash | |
---|---|
Response
Text Only | |
---|---|
Background🔗
Python is a popular language, but it lacks a standard build & dependency management tool. The official docs recommend the third party pipenv, and others such as poetry and pants are available for other preferences. Even conda can provide environment and dependency management!
Slowing things down, the basic way of installing Python packages is with pip
and a file called requirements.txt
(which is a list of packages like the one above!)
Common guides will recommend running something like:
Bash | |
---|---|
Or maybe this to make sure the active Python environment's pip gets used:
Bash | |
---|---|
Deployments🔗
This project had the goal of reducing headaches from the following loop:
- Write code for a Python / Streamlit app
- Push the code to a github repo
- Connect the repo to Streamlit Cloud
- Wait for deploy on Streamlit Cloud
- Get
"No module named"
or other error from not having up-to-date requirements
So why is this the way it is?
I could say it's because "Python lacks a standard build & dependency management tool", but I know that a simple build script / Dockerfile and something like pre-commit can prevent these headaches before pushing code.
So then the answer is a combination of laziness and a desire to not bloat every repo with many tools and scripts. Part of the beauty of managed application hosts such as Heroku and Streamlit Cloud is the minimal amount of setup needed to launch your app.
A truly minimal Streamlit Cloud repo just needs the .py
file that holds the streamlit
calls.
Once your app needs more third party Python packages than streamlit
, some kind of dependency file is needed for the Platform to know how to install and run your app.
Manual Method🔗
The essence of this API and bot can be boiled down to 2 CLI calls:
git clone --depth 1
pipreqs --print
The clone command will download a copy of the repository that we want to check for requirements.
The --depth 1
flag is important to limit how many files get downloaded (we don't need the whole repo history, just the latest).
pipreqs is a tool that analyzes a directory containing Python files and aims to produce a requirements.txt
file with every third party package that is imported in the project directory.
The --print
flag will tell pipreqs not to produce a file, but just to print out the file contents to stdout.
You can try this on your own!
Bash | |
---|---|
Python Method🔗
From Python we can either use a library such as gitpython, or make the system call to git
with the subprocess standard library module.
Here is a basic example (using .split()
to avoid using shell execution)
Python | |
---|---|
Since pipreqs
uses docopt
to parse CLI arguments, I figured it would be easy enough to repeat the subprocess
pattern and be sure to capture the stdout output (the requirements!)
API Endpoint🔗
For users to access our function over the internet (and for bots 🤖 to access it!) we'll serve it up in a REST API. Python has plenty of web framework libraries, pretty much any of them would handle this fine.
For version 1 of this system I've chosen to have a single endpoint that responds to HTTP GET requests and requires a query argument called code_url
.
All it will respond with is plain text containing the requirements.txt
contents and a 200
status code.
Here's what that looks like in my FastAPI app:
Python | |
---|---|
Async Python🔗
This method so far works fine for a single user, but imagine the system with multiple users / bots making requests at once.
Each git clone
and pipreqs
call would have to complete for a single user before the next in the queue can be processed.
git clone
relies on the I/O speed of our server to write files, the remote git server to read the files, and the network speed to get them to our server.
pipreqs
includes synchronous requests
calls to pypi, so it relies on the response speed of the pypi server and the I/O speed of our server to loop over project files and run ast.parse
(from standard library) on the .py
files.
WSGI frameworks such as Flask and Bottle rely on threading tricks such as gevent greenlets (read more from Bottle) to handle many connections in a single process. ASGI frameworks such as FastAPI (built on Starlette) instead rely on a single-threaded event loop and async coroutines to handle many connections.
Above we defined the /pipreqs
route handler with async def
function.
In FastAPI our function will already be running in the event loop so we can use await
in our code to utilize other async
functions!
Speaking of other async
functions, we can swap out the synchronous subprocess.run()
with another standard library function asyncio.create_subprocess_exec()
(see the asyncio subprocess docs).
This will ensure that our program can let git clone
and pipreqs
run without hanging up our server process when there's no work to check on.
A general asynchronous run function using exec needs a try block (using create_subprocess_shell
will return the errors in stderr
without raising).
Primary Function🔗
Above in the endpoint code we glossed over the most important details:
- Clone the repo from
code_url
- Run
pipreqs
on the cloned repo
When we clone a repo we don't need to store its contents forever.
That would take a lot of storage space and each repo will get new commits that invalidate the old downloads anyway.
Making a temporary directory, cloning into it, running pipreqs
on it, then deleting it will work just fine.
PRE-NOTE: We'll continue to use async
functions
Python | |
---|---|
NOTE: We're adding a bit of complexity to prepare for a bot use-case, which is comparing existing requirements.txt
to pipreqs
result.
To clone the repo and check the contents of requirements.txt
, we can do something like this:
And to run pipreqs on the directory we just cloned:
POST-NOTE: using HTTPException from FastAPI couples the logic of our program to the API component. I'm fine with this for now, but a more specific error to the function would allow it to be re-used more easily in another app.
Caching🔗
If you've written a bot that makes requests to an external API then you might see making unlimited git clone
and pipreqs
calls as a potential problem.
One strategy to prevent (some) repeated requests to external APIs is by caching results for your server to reference when the same user request comes in.
The Python standard library has functools.lru_cache
, but in this case I want the results to expire if the cache becomes too large and also if more than 5 minutes has passed.
I used cachetools
library for a tested TTLCache implementation, but to get it to work with an asynchronous function we have to get a little creative.
Python | |
---|---|
Since cachetools
caches act much like a Dictionary, we can overwrite the dunder method that gets called when we try to access a key that doesn't exist!
Here we create a future / task, which is to run the pipreqs_from_url
function.
Then we assign that future as the value for the code_url
key (so that future checks will access that instead of going to __missing__
)
Finally we return the bare future so that we don't lock up the program while executing pipreqs_from_url
immediately and synchronously.
Singular Cache🔗
functools.lru_cache
does come in handy for creating an object that serves as a singleton in Python.
We can treat the RequirementsCache
for our program this way and make a small, cached wrapper function that returns it once and only once.
Python | |
---|---|
Using the cache🔗
Now we can fill in the API endpoint code!
The wrapper function to fetch the cache is synchronous so we don't need to await get_requirements_cache()
.
Then we use Dictionary bracket notation to ask the cache for the future, which we must await
to get the actual returned value from pipreqs_from_url
The __missing__
function we wrote above will get called when a given code_url
isn't in the Cache.
Otherwise the cache can return the same future that was stored earlier in __missing__
!
Python | |
---|---|
For this endpoint the old_requirements
isn't useful, so we just return the pipreqs
generated requirements.
Github Bot🔗
I followed this guide by Python Core developer and Github Bot builder Mariatta, but adapted it to fit in FastAPI.
The main concept of our bot interaction goes as follows:
- Whenever there is a push to my repository the bot should be notified with at least the url and whoever made the push
- The bot should then run our
pipreqs_from_url
function on the repository url and this time compare the generatedrequirements.txt
to any existingrequirements.txt
- If there's a difference between the files then the bot should open an Issue on the repository to note the differences
Maybe it seems simple in concept but there's a few technical and non-technical aspects to highlight.
Webhooks🔗
Github (among other git hosting services) allows you to set up webhooks on your repositories. This is the technology that will notify our bot whenever someone pushes a commit.
Webhooks prevent our bot service from having to constantly ping the repository to check for changes. So long as we trust Github's servers are running correctly, we can be confident that our bot will be notified whenever a push happens.
We can set one up manually in a repo with a secret key to ensure no attacker sends bogus data. To expand this idea to other users, a Github app or adapting to PubSub might be necessary to ease the creation and access of webhooks.
Check out the webhook events and payloads documentation for more on what Github will send to our bot. Which brings us to the next step, what exactly is the bot?
Gidgethub Endpoint (the bot)🔗
Our bot is actually another FastAPI endpoint! It's barely a robot at all!
Since we already have a FastAPI server that will be running (to host the main endpoint), it makes sense to me to include our bot as an additional endpoint. Another strategy would be to utilze a new web server hosted somewhere else to listen for Github's webhook events.
The guide linked above utilizes aiohttp as both the server and the http client for interacting with Github after receiving a webhook event (since we'll open an Issue in some cases). Another option is to use a "serverless" technology such as AWS Lambda as your bot host, which might reduce your server costs if it is idling a lot.
Here's what the endpoint looks like FastAPI / Starlette. Gidgethub is doing the heavy lifting, we simply pass along request headers and body to the library in order to validate and parse them.
Automatically Opening an Issue🔗
I considered 3 ways to alert the repo owner (myself in this case) of a potential dependency mismatch:
- Send an email
- Open a Pull Request with new
requirements.txt
- Open an Issue with new and old
requirements.txt
Reasons in favor of opening an Issue:
- There may be Python modules that aren't used in deployment that trigger extra dependencies in pipreqs
- A specific older vesion of a package might be required (pipreqs grabs latest version)
- Allows the user to set email preferences for alerts
The main reasons against opening a Pull Request:
- Thinking about scaling this bot into a scraper of sorts for other people's projects, it's not always courteous to open a Pull Request without alerting the maintainer in an Issue
- Some projects prefer
environment.yml
orpyproject.toml
for dependencies
Utilizing gidgethub
, we can write a function that responds to any push
events to the repository:
Next, we can utilize our requirements_cache
the same way as in the API endpoint to fetch the requirements and old_requirements (if present) for the repo.
If our pipreqs requirements match what is in the repo, then there's nothing to do for the bot.
Otherwise, we'll grab the provided url for interacting with Issues in the repo:
Python | |
---|---|
Finally, we'll write a short message and use the gh
http client to make the POST request to alert the repository owner and whoever submitted the push!
Python | |
---|---|
Testing🔗
We can use pytest
and a handful of its helpers to make testing the API and subprocess calls a bit easier:
By using a conftest.py
we can prepare a Starlette test client for testing the API endpoints:
Then in our test_pipreqsapi.py
we need to make sure to cue pytest-asyncio
and clear the cache after each test in order to validate call counts:
Python | |
---|---|
From there we can test each function, building up to the main endpoint. We'll include general tests on whether things succeed when given good inputs or mocks and whether they fail with the intended errors when bad things happen.
Testing our async subprocess run is a good example.
Here we use known command ls
and broken command lz
to verify our function responds as we expect.
Python | |
---|---|
(I don't include git clone here since it needs to go out to an external server without other setup / guarantees. It's definitely worth an end to end test though at some point.)
We can use monkeypatching to fake an expected result now that we've verified our _run
function works.
This next test assumes we cloned the repo and that it didn't have an existing requirements.txt
:
Python | |
---|---|
To simulate the case where there is a requirements.txt
we could set up a dummy repo with one or use mocking to pretend that any Path from pathlib can find it and read it:
We can also use mocking to keep track of how many times a function gets called. This is useful for validating that our cache will not call the same function twice when it has the value cached already.
Finally for the endpoints we'll need to rely on the test_app
established in conftest.py
.
We can assert things about the status and any expected errors:
Next Steps🔗
Overall this was a fun project for creating a FastAPI server with caching and async distributed transactions.
It also got me to think about github bots a bit and how we might clean up Python repos going forward.
Bot🔗
This server works for exposing the pipreqs functionality via API, but the bot leaves some steps to be desired.
As of now the gidgethub
handling relies on my own github account's access token and the webhooks are validated based on the secret established for my particular repo.
The bot might be better applied as a Github App to allow users to more easily click a button and get the functionality.
API🔗
Adding features already available in pipreqs would be straightforward with further optional query params:
- min / compatible version options
- using non-pypi package host
Adding features not already in pipreqs would be more involved or hacky:
- Convert from
requirements.txt
topyproject.toml
/environment.yml
- Async parsing and fetching package info
Another idea is on-demand Issue / Pull requests on a given repo (for helping out others without putting too much effor in yourself)
Created: June 7, 2023