FastAPI notes – Dmitriy Popov-Velasco

Basics

Build APIs based on standard Python type hints
Automatically generate interactive documentation
Fast to code, fewer bugs

#! pip install fastapi
#! pip install "uvicorn[standard]"

If the following contents are in main.py, can run via uvicorn main:app --reload.
- main refers to main.py and app refers to the object inside main.py.
- --reload reloads the page upon changes, to be used during dev, not prod.
Can see documentation conforming to OpenAPI standard in http://127.0.0.1:8000/docs, from which you can use the endpoints!
http://127.0.0.1:8000/redoc returns documentation in alternative format.
Use async def to make the functions non-blocking, enabling other tasks to run concurrently. Useful when function performs I/O-bound operations, such as database queries, file I/O, or network requests, and when need to handle a large number of concurrent requests efficiently.
Type hints will be validated with Pydantic, so if use a non-int in /items/{item_id}, will get an error.
Order matters: If read_user_current is placed after read_user, will get an error since FastAPI will read functions top-down and try to validate input to be an integer.
Use Enums if path parameter must come from a certain list of values. If improper parameter is passed, FastAPI will list available values!
To have paths be read correctly, use :path path converter, allowing the parameter to capture the entire path, including slashes.
read_animal without additional parameters will read off animals 0-10. With additional parameters, can specify which ones we want via query parameters, as in http://127.0.0.1:8000/animals/?skip=0&limit=2. Here, ? denotes start of query parameters and & separates them. Can also pass optional parameter as http://127.0.0.1:8000/animals/?skip=0&limit=2&optional_param=3, just make sure to specify it as typing.Optional.
Can pass and use optional parameters as in read_user_item.
Request body is data sent by client to the API and response body is data sent from API to client. Use Pydantic to specify request body with POST request type.
- To send a post request, could test it out in /docs or with curl -X POST “http://127.0.0.1:8000/books/” -H “Content-Type: application/json” -d ‘{ “name”: “The Great Gatsby”, “author”: “F. Scott Fitzgerald”, “description”: “A novel set in the 1920s”, “price”: 10.99 }’
- Then can go to /books endpoint to see the books printed.

from fastapi import FastAPI
from enum import Enum
import typing as t
from pydantic import BaseModel

app = FastAPI()

@app.get("/") #route/endpoint
def home_page():
    return {"message":"Hello World!"}

@app.get("/items/{item_id}") #item_id is the path parameter
async def read_item(item_id: int):
    return {"item_id":item_id}

@app.get("/users/me") # will not work if placed after, must be before to be valid
async def read_user_current():
    return {"user_id":"Current user"}

@app.get("/users/{user_id}") 
async def read_user(user_id: int):
    return {"user_id":user_id}

class ModelName(str,Enum):
    ALEXNET = 'ALEXNET'
    RESNET = 'RESNET'
    LENET = 'LENET'

@app.get("/models/{model_name}")
async def get_model(model_name: ModelName):
    if model_name == ModelName.ALEXNET:
        return {'model_name':model_name}
    elif model_name.value == "LENET":
        return {'model_name': model_name}
    else:
        return {'model_name':f"You have selected {model_name.value}"}
    
@app.get("files/{file_path:path}")
async def read_file(file_path:str):
    return {"file_path":file_path}

animal_db = [{"animal_name":'cat'},{"animal_name":'llama'},{"animal_name":'alpaca'}]

@app.get("/animals/")
async def read_animal(skip: int=0, limit: int=10, optional_param: t.Optional[int]=None):
    return {"animals": animal_db[skip:skip+limit], "optional_parameter":optional_param}

@app.get("/users/{user_id}/items/{item_id}")
async def read_user_item(
    user_id: int, item_id: int, q: t.Optional[str]=None, short:bool=False
):
    item = {"item_id":item_id, "owner_id":user_id}
    if q:
        item.update({"q":q})
    if not short:
        item.update({'description':'great item with long description'})
    return item

books_db = []
class Book(BaseModel):
    name:str
    author:str
    description:t.Optional[str]
    price:float

@app.post("/books/")
async def create_item(book:Book):
    books_db.append(book)
    return book 
@app.get("/books/")
async def get_books():
    return books_db

Notes following “Building Data Science Applications with FastAPI” by François Voron Chapter 2: Python specificities -> asyncio

Q: What’s the difference between WSGI and ASGI gateways as it pertains to Django and FastAPI? WSGI (Web Server Gateway Interface) and ASGI (Asynchronous Server Gateway Interface) are two different specifications for Python web servers and applications. They serve as interfaces between web servers and web applications or frameworks. Here’s a detailed comparison of WSGI and ASGI, particularly in the context of Django and FastAPI:

WSGI (Web Server Gateway Interface) Synchronous:

WSGI is designed for synchronous web applications. It handles one request at a time per worker, which can lead to inefficiencies when dealing with I/O-bound operations like database queries or external API calls. Django:

Django is traditionally a WSGI-based framework. It works well for most web applications but can struggle with real-time features like WebSockets or long-polling due to its synchronous nature. Common WSGI servers for Django include Gunicorn and uWSGI. Concurrency:

WSGI applications handle concurrency by using multiple worker processes or threads. Each worker handles one request at a time. Deployment:

WSGI applications are typically deployed using WSGI servers like Gunicorn, uWSGI, or mod_wsgi (for Apache). - ASGI (Asynchronous Server Gateway Interface) Asynchronous:

ASGI is designed for asynchronous web applications. It supports both synchronous and asynchronous code, allowing for more efficient handling of I/O-bound operations and real-time features. FastAPI:

FastAPI is an ASGI-based framework. It is built from the ground up to support asynchronous programming, making it ideal for applications that require high concurrency, real-time communication, or WebSockets. Common ASGI servers for FastAPI include Uvicorn and Daphne. Concurrency:

ASGI applications can handle many requests concurrently using asynchronous I/O. This allows for more efficient use of resources, especially for I/O-bound tasks. Deployment:

ASGI applications are typically deployed using ASGI servers like Uvicorn, Daphne, or Hypercorn.

#!pip install nest_asyncio # run asyncio within Jupyter's already running even loop

Requirement already satisfied: nest_asyncio in /home/mainuser/anaconda3/envs/mintonano/lib/python3.11/site-packages (1.6.0)

# import asyncio
# async def printer(name: str, times: int)->None:
#     for i in range(times):
#         print(name)
#         await asyncio.sleep(1)
# async def main():
#     await asyncio.gather(
#         printer("A",3),
#         printer("B",3)
#     )
# asyncio.run(main())

# adopting code since Jupyter has it's own event loop
import asyncio
import nest_asyncio

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

async def printer(name: str, times: int) -> None:
    for i in range(times):
        print(name)
        await asyncio.sleep(1)

async def main():
    await asyncio.gather(
        printer("A", 3),
        printer("B", 3)
    )

# Await the main coroutine directly
await main()

A
B
A
B
A
B

asyncio.sleep(1) was added since writing code in a coroutine doesn’t necessarily mean it will not block. Computations are blocking! I/O opps will not block or we could use multiprocessing.
Path parameters and their validation

from fastapi import FastAPI, Path
app = FastAPI()

@app.get('/license-plates/{license}')
async def get_license_plate(id: int = Path(...,regex=r"^\w{2}-\d{3}-\w{2}")):
    return {"license":license}

/tmp/ipykernel_6460/3090869963.py:5: DeprecationWarning: `regex` has been deprecated, please use `pattern` instead
  async def get_license_plate(id: int = Path(...,regex=r"^\w{2}-\d{3}-\w{2}")):

In FastAPI, … above indicate that we don’t want a default value. RegEx validates French license plates like AB-123-CD.

Notes by Key Topic

Installation, virtual environment (conda), running, first app

%% bash
conda create --name fastapi-env python=3.11
conda activate fastapi-env 
pip install fastapi[all]

If FastAPI app is called app in main file, run as follows: uvicorn main:app --reload
Access interactive documentation using http://127.0.0.1:8000/docs (using Swagger UI) or http://127.0.0.1:8000/redoc (using ReDoc)

Defining routes with path and query parameters (for user input) and validating requests.

Defining path parameters: user_id is a path parameter FastAPI will convert to an integer.

from fastapi import FastAPI
app = FastAPI() 

@app.get("/users/{user_id}")
def read_user(user_id:int):
    return {"user_id":user_id}

Defining query parameters via function parameters with default values:

@app.get("/users/")
def read_user(skip:int=0,limit:int=10):
    return {"skip":skip, "limit":limit}

If a user accesses /users/?skip=5&limit=15, FastAPI will return {"skip":5, "limit":15}

Request validation with Pydantic below. Make the request as follows:
curl -X POST "http://127.0.0.1:8000/users/" -H "Content-Type: application/json" -d '{"id":1, "name":"John Smith", "email":"john@example.com"}' --X POST: use HTTP method to post data --H “Content-Type: application/json”`: add HTTP header to the request and specify that the data being sent is in JSON format
- -d ‘{“id”:1, “name”:“John Smith”, “email”:“john@example.com”}’: send the specified data in the request body

from pydantic import BaseModel
class User(BaseModel):
    id: int
    name: str
    email: str

@app.post("/users/")
def create_user(user:User):
    return {"id": user.id, "name": user.name, "email":user.email}

Combining path ahd query parameters with Pydantic: user_id is a path parameter and details is a query parameter that modifies the response.
Read simply as curl "http://127.0.0.1:8000/users/1"

@app.get("/users/{user_id}")
def read_user(user_id: int, details: bool=False):
    if details:
        return {"user_id":user_id, "details":"Detailed info"}
    return {"user_id":user_id}

Request and response models

Request models define the structure of the data that your API expects to receive in the request body. They are used to validate and parse the incoming data.
Response models define the structure of the data that your API returns in response. They ensure that the response data is correctly formatted and validated.
curl -X POST "http://127.0.0.1:8000/users/" -H "Content-Type: application/json" -d '{"id":1, "name":"John Smith", "email":"john@example.com", "age":30}'

from fastapi import FastAPI
from pydantic import BaseModel
from typing import Optional

app = FastAPI()

class UserCreate(BaseModel):
    id: int
    name: str
    email: str
    age: Optional[int] = None

class UserResponse(BaseModel):
    id: int
    name: str
    email: str
    age: Optional[int] = None
    is_active:  bool

@app.post("/users/", response_model=UserResponse)
async def create_user(user:UserCreate): #validate incoming data
    user_response = UserResponse(       #validate outgoing data
        id= user.id,
        name=user.name,
        email=user.email,
        age=user.age,
        is_active=True
    )
    return user_response

Dependency Injection

Inject dependencies (database connections, configuration settings, other shared resources) into your functions or classes.
Separate concerns between the logic of the endpoint and the more generic logic for the pagination parameters.
Ideal for utility logic to retrieve or validate data, make security checks, or call external logic that will be needed several times across the application.
Notes following “Building Data Science Applications with FastAPI” by François Voron Chapter 2: Python specificities -> asyncio

from fastapi import Depends, FastAPI

app = FastAPI()

async def pagination(skip:int=0,limit:int=10)->tuple[int,int]:
    return (skip,limit)

@app.get("/items")
async def list_items(p:tuple[int,int]=Depends(pagination)):
    skip,limit = p
    return {"skip":skip, "limit":limit}

@app.get("/things")
async def list_things(p:tuple[int,int]=Depends(pagination)):
    skip,limit = p
    return {"skip":skip, "limit":limit}

FastAPI limitation: Depends function is not able to forward the type of the dependency function, so we have to do this manually above.
Raising a 404 error:

from fastapi import Depends, FastAPI, HTTPException, status
from pydantic import BaseModel


class Post(BaseModel):
    id: int
    title: str
    content: str


class PostUpdate(BaseModel):
    title: str | None
    content: str | None


class DummyDatabase:
    posts: dict[int, Post] = {}


db = DummyDatabase()
db.posts = {
    1: Post(id=1, title="Post 1", content="Content 1"),
    2: Post(id=2, title="Post 2", content="Content 2"),
    3: Post(id=3, title="Post 3", content="Content 3"),
}


app = FastAPI()


async def get_post_or_404(id: int) -> Post:
    try:
        return db.posts[id]
    except KeyError:
        raise HTTPException(status_code=status.HTTP_404_NOT_FOUND)


@app.get("/posts/{id}")
async def get(post: Post = Depends(get_post_or_404)):
    return post


@app.patch("/posts/{id}")
async def update(post_update: PostUpdate, post: Post = Depends(get_post_or_404)):
    updated_post = post.copy(update=post_update.dict())
    db.posts[post.id] = updated_post
    return updated_post


@app.delete("/posts/{id}", status_code=status.HTTP_204_NO_CONTENT)
async def delete(post: Post = Depends(get_post_or_404)):
    db.posts.pop(post.id)

Creating and using a parametrized dependency with a class

Suppose we wanted to dynamically cap the limit value in the pagination example…-> would need to do this with a class!

from fastapi import Depends, FastAPI, Query

app = FastAPI()

class Pagination:
    def __init__(self, maximum_limit:int = 100):
        self.maximum_limit = maximum_limit
    async def __call__(
            self,
            skip: int = Query(0, ge=0),
            limit: int = Query(10,ge=0)
    ) -> tuple[int,int]:
        capped_limit = min(self.maximum_limit, limit)
        return (skip, capped_limit)
# hardcoded below, but could come from config file or env variable
pagination = Pagination(maximum_limit=50)

@app.get("/items")
async def list_items(p: tuple[int, int] = Depends(pagination)):
    skip, limit = p
    return {"skip": skip, "limit": limit}


@app.get("/things")
async def list_things(p: tuple[int, int] = Depends(pagination)):
    skip, limit = p
    return {"skip": skip, "limit": limit}

Note: n FastAPI, Query is used to define and validate query parameters for your API endpoints. Query parameters are the key-value pairs that appear after the ? in a URL. They are typically used to filter, sort, or paginate data.
Depends simply expects a callable: in can be __call__ or another function as below. Note that the pattern below could be used to apply different preprocessing steps, depending on the data, in the ML context:

from fastapi import Depends, FastAPI, Query

app = FastAPI()


class Pagination:
    def __init__(self, maximum_limit: int = 100):
        self.maximum_limit = maximum_limit

    async def skip_limit(
        self,
        skip: int = Query(0, ge=0),
        limit: int = Query(10, ge=0),
    ) -> tuple[int, int]:
        capped_limit = min(self.maximum_limit, limit)
        return (skip, capped_limit)

    async def page_size(
        self,
        page: int = Query(1, ge=1),
        size: int = Query(10, ge=0),
    ) -> tuple[int, int]:
        capped_size = min(self.maximum_limit, size)
        return (page, capped_size)


pagination = Pagination(maximum_limit=50)


@app.get("/items")
async def list_items(p: tuple[int, int] = Depends(pagination.skip_limit)):
    skip, limit = p
    return {"skip": skip, "limit": limit}


@app.get("/things")
async def list_things(p: tuple[int, int] = Depends(pagination.page_size)):
    page, size = p
    return {"page": page, "size": size}

Using dependency injection to manage db connection, ensuring that each request gets a fresh connection and that connections are properly closed after use:

from fastapi import FastAPI, Depends
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

DATABASE_URL = "sqlite:///./test.db"
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()

# Define a User model
class User(Base):
    __tablename__ = "users"
    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, index=True)
    email = Column(String, unique=True, index=True)

app = FastAPI()

def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()
# Endpoint to create a new user
@app.post("/users/")
async def create_user(name: str, email: str, db: Session = Depends(get_db)):
    user = User(name=name, email=email)
    db.add(user)
    db.commit()
    db.refresh(user)
    return user

@app.get("/users/")
async def read_users(db: Session=Depends(get_db)):
    users = db.query(User).all()
    return users

# Run the application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

An example in LLM context:

from fastapi import FastAPI, Depends
from pydantic import BaseModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

app = FastAPI()

class LLM:
    def __init__(self, model_name: str = "distilbert-base-uncased-finetuned-sst-2-english"):
        # Load the pre-trained model and tokenizer from Hugging Face
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.to("cuda" if torch.cuda.is_available() else "cpu")

    def predict(self, text: str, preprocess_steps: list):
        # Preprocess the text
        text = self.text_processor.preprocess(text, preprocess_steps)
        # Tokenize the input text
        inputs = self.tokenizer(text, return_tensors="pt")
        inputs = {k: v.to("cuda" if torch.cuda.is_available() else "cpu") for k, v in inputs.items()}
        # Perform prediction using the model
        with torch.no_grad():
            outputs = self.model(**inputs)
        # Get the predicted class
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(predictions, dim=1).item()
        return predicted_class

# Create a global LLM instance
llm_instance = LLM()

# Dependency to get the global LLM instance
def get_llm():
    return llm_instance

# Request model for prediction
class PredictionRequest(BaseModel):
    text: str

# Response model for prediction
class PredictionResponse(BaseModel):
    prediction: int

# Use the LLM dependency in an endpoint
@app.post("/predict/", response_model=PredictionResponse)
async def predict(request: PredictionRequest, llm: LLM = Depends(get_llm)):
    prediction = llm.predict(request.text)
    return {"prediction": prediction}

# Run the application
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

Run as follows: curl -X POST "http://127.0.0.1:8000/predict/" -H "Content-Type: application/json" -d '{"text": "I love FastAPI!"}'
Will return something like the following: { "prediction": 1 }

async/await syntax for LLM calls that are I/O-bound

from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
import httpx
import os
from dotenv import load_dotenv

load_dotenv()

app = FastAPI()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")

class OpenAIRequest(BaseModel):
    prompt: str

class HuggingFaceRequest(BaseModel):
    prompt: str

class OpenAIResponse(BaseModel):
    id: str
    object: str
    created: int
    model: str
    choices: list

class HuggingFaceResponse(BaseModel):
    generated_text: str

@app.post("/openai-generate/", response_model=OpenAIResponse) #respone should conform to the given Pydantic model
async def openai_generate(request: OpenAIRequest):
    url = "https://api.openai.com/v1/completions"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
    }
    data = {
        "model": "text-davinci-003",
        "prompt": request.prompt,
        "max_tokens": 100,
    }
    async with httpx.AsyncClient() as client: #always async with context managers
        try:
            response = await client.post(url, headers=headers, json=data)
            response.raise_for_status()  # Raises an error for 4xx/5xx responses
            return response.json()
        except httpx.HTTPStatusError as exc:
            raise HTTPException(status_code=exc.response.status_code, detail=exc.response.text)
        
@app.post("/huggingface-generate", response_model=HuggingFaceResponse)
async def huggingface_generate(request: HuggingFaceRequest):
    url = "https://api-inference.huggingface.co/models/gpt2"
    headers = {
        "Authorization": f"Bearer {HUGGINGFACE_API_KEY}",
    }
    data = {
        "inputs": request.prompt,
        "options": {"use_cache": False},
    }    
    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(url, headers = headers, json=data)
            response.raise_for_status()
            return {"generated_text": response.json()[0]['generated_text']}
        except httpx.HTTPStatusError as exc:
            raise HTTPException(status_code=exc.response.status_code, detail=exc.response.text)
# To run the app: uvicorn your_file_name:app --reload

Custom error handling

from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel
import httpx
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

LLM_API_KEY = os.getenv("LLM_API_KEY")

class LLMError(Exception):
    def __init__(self, message: str):
        self.message = message

@app.exception_handler(LLMError)
async def llm_exception_handler(request: Request, exc: LLMError):
    return JSONResponse(
        status_code=500,
        content={"message":f"LLM error occurred: {exc.message}"}
    )
class GenerateRequest(BaseModel):
    prompt: str

class GenerateResponse(BaseModel):
    generated_text: str

@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    if not request.prompt:
        raise HTTPException(status_code=400, detail="Prompt cannot be empty!")
    url = "https://api.llm.example/generate"
    headers = {
        "Authorization": f"Bearer {LLM_API_KEY}",
        "Content-Type": "application/json",
    }
    data = {"prompt": request.prompt}

    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(url, headers=headers, json=data)
            response.raise_for_status()  # Raise an error for 4xx/5xx responses
            response_data = response.json()
            return {"generated_text": response_data.get("generated_text", "")}
        except httpx.HTTPStatusError as exc:
            raise LLMError(message = exc.response.text)
        except Exception as exc:
            raise LLMError(message=str(exc))

# To run the app: uvicorn your_file_name:app --reload

Serving static files (like HTML, CSS, JS) with FastAPI: Jinja2+JS

# Given the following file structure
# /my_fastapi_app
# ├── app.py               # Your FastAPI application
# ├── static               # Directory for static files
# │   ├── css
# │   │   └── styles.css
# │   ├── js
# │   │   └── script.js
# │   └── index.html
# └── requirements.txt

# index.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{{ title }}</title>
    <link rel="stylesheet" href="/static/css/styles.css">
</head>
<body>
    <h1>{{ title }}</h1>
    <p>{{ description }}</p>
    <form id="generate-form">
        <label for="prompt">Enter your prompt:</label>
        <input type="text" id="prompt" name="prompt" required>
        <button type="submit">Generate</button>
    </form>
    <div id="result"></div>
    <script src="/static/js/scripts.js"></script>
</body>
</html>

# script.js
document.getElementById('generate-form').addEventListener('submit', async (event) => {
    event.preventDefault();
    const prompt = document.getElementById('prompt').value;
    const response = await fetch('/generate', {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({ prompt })
    });
    const data = await response.json();
    document.getElementById('result').innerText = data.generated_text;
});

from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates

app = FastAPI()

#mount the static directory
app.mount("/static",StaticFiles(directory="static"),name='static')

# @app.get("/",response_class=HTMLResponse)
# async def read_index():
#     with open("static/index.html") as f:
#         return f.read()
    
# Or better, serve HTML templates more dynamically:
templates = Jinja2Templates(directory="templates")

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    context = {
        "request": request,
        "title": "LLM Text Generator",
        "description": "Generate text using a large language model."
    }
    return templates.TemplateResponse("index.html", context)

# To run the app: uvicorn main:app --reload

When you navigate to http://127.0.0.1:8000/, it will serve your index.html file.
app.mount: specify URL path+directory for static files, allowing FastAPI to handle requests to those files automatically.

Serving static files (like HTML, CSS, JS) with FastAPI: Jinja2+HTMX

from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
import httpx
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

# Mount the static directory
app.mount("/static", StaticFiles(directory="static"), name="static")

# Serve HTML templates
templates = Jinja2Templates(directory="templates")

@app.get('/', response_class=HTMLResponse)
async def read_root(request:Request):
    context = {
        "request":request,
        "title": "LLM Text Generator",
        "description": "Generate text using a language model."
    }
    return templates.TemplateResponse("index.html", context)

@app.post("/generate",response_class=HTMLResponse)
async def generate_text(request: Request, prompt: str = Form(...))
    if not prompt:
        return HTMLResponse("<div id='result'><p>Error: Error: Prompt cannot be empty!</p></div>")
        url = "https://api.llm.example/generate"
    headers = {
        "Authorization": f"Bearer {os.getenv('LLM_API_KEY')}",
        "Content-Type": "application/json",
    }
    data = {"prompt": prompt}

    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(url, headers=headers, json=data)
            response.raise_for_status()  # Raise an error for 4xx/5xx responses
            response_data = response.json()
            generated_text = response_data.get("generated_text", "")
            return HTMLResponse(f"<div id='result'><h2>Generated Text:</h2><p>{generated_text}</p></div>") # update the #result endpoint
        except httpx.HTTPStatusError as exc:
            return HTMLResponse(f"<div id='result'><p>Error: {exc.response.text}</p></div>")
        except Exception as exc:
            return HTMLResponse(f"<div id='result'><p>Error: {str(exc)}</p></div>")

No longer any JavaScript found in the directory, all taken care of with HTMX as follows:
- HTML Template with HTMX:
  - The index.html template includes the HTMX library by adding a <script> tag that loads HTMX from a CDN.
  - The form uses HTMX attributes (hx-post, hx-target, hx-swap) to handle form submission and update the result dynamically:
  - hx-post=“/generate”: Sends a POST request to the /generate endpoint when the form is submitted.
  - hx-target=“#result”: Specifies the element (#result) to update with the server’s response.
  - hx-swap=“innerHTML”: Replaces the inner HTML of the target element with the server’s response.
- FastAPI Endpoint:
  - The /generate endpoint processes the form submission, interacts with the LLM API, and returns an HTML response with the generated text or an error message.
  - The response is an HTMLResponse that updates the #result element in the HTML template.*

# .
# ├── main.py
# ├── static
# │   ├── css
# │   │   └── styles.css
# └── templates
#     └── index.html
# templates/index.htmx
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{{ title }}</title>
    <link rel="stylesheet" href="/static/css/styles.css">
    <script src="https://unpkg.com/htmx.org@1.6.1"></script>
</head>
<body>
    <h1>{{ title }}</h1>
    <p>{{ description }}</p>
    <form hx-post="/generate" hx-target="#result" hx-swap="innerHTML">
        <label for="prompt">Enter your prompt:</label>
        <input type="text" id="prompt" name="prompt" required>
        <button type="submit">Generate</button>
    </form>
    <div id="result"></div>
</body>
</html>

Packaging with Docker

.
├── Dockerfile
├── main.py
├── requirements.txt
├── static
│   ├── css
│   │   └── styles.css
└── templates
    └── index.html

templates/index.html remains the same file compatible with HTMX

# requirements.txt
fastapi
uvicorn
torch==1.9.0+cu111  # Ensure this matches the CUDA version in the Docker image
transformers
python-dotenv
google-cloud-storage

For GPU support…

Install NVIDIA Docker: Install the NVIDIA Docker runtime on your host machine.

Modify the Dockerfile: Use a base image that includes CUDA and cuDNN libraries.

Update the Docker Run Command: Use the –gpus flag to allocate GPU resources to the container.

Ensure PyTorch is Installed with CUDA Support: Make sure the PyTorch version installed in the container supports CUDA.

# Use the official NVIDIA CUDA runtime as a parent image
FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN apt-get update && apt-get install -y \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir -r requirements.txt

# Expose port 80 to the outside world
EXPOSE 80

# Run the FastAPI application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

# Mount the static directory
app.mount("/static", StaticFiles(directory="static"), name="static")

# Serve HTML templates
templates = Jinja2Templates(directory="templates")

# Load the model and tokenizer from a shared storage location or model server
model_name = os.getenv("MODEL_NAME", "gpt2")  # Replace with your model path or name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to("cuda" if torch.cuda.is_available() else "cpu")

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    context = {
        "request": request,
        "title": "LLM Text Generator",
        "description": "Generate text using a large language model."
    }
    return templates.TemplateResponse("index.html", context)

@app.post("/generate", response_class=HTMLResponse)
async def generate_text(request: Request, prompt: str = Form(...)):
    if not prompt:
        return HTMLResponse("<div id='result'><p>Error: Prompt cannot be empty!</p></div>")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return HTMLResponse(f"<div id='result'><h2>Generated Text:</h2><p>{generated_text}</p></div>")

%% bash
docker build -t my-fastapi-app .
docker run --gpus all --env-file .env -d -p 80:80 --name my-fastapi-container my-fastapi-app

Deploying on GCP

Step 1: Set Up GCP Project: Create a GCP project and enable necessary APIs.

Step 2: Install and Configure Google Cloud SDK: Install the Google Cloud SDK and authenticate.

Step 3:Build and Push Docker Image to GCR: Build the Docker image and push it to Google Container Registry.

Step 4:Create a GKE Cluster: Create a Kubernetes cluster on GKE.

Step 5:Deploy the Application on GKE: Deploy the FastAPI application on the GKE cluster.

Step 6:Set Up Google Cloud Storage: Store the model in GCS and access it from the application.

Step 1: Set Up GCP Project: Create a GCP project and enable necessary APIs.

Create a GCP Project: Go to the Google Cloud Console.

Create a new project.

Enable the following APIs: Kubernetes Engine API, Container Registry API, Cloud Storage API

Step 2: Install and Configure Google Cloud SDK

Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install

Authenticate with GCP: gcloud init; gcloud auth login

Set the Project: gcloud config set project YOUR_PROJECT_ID

Step 3: Build and Push Docker Image to GCR

docker build -t gcr.io/YOUR_PROJECT_ID/my-fastapi-app .

docker push gcr.io/YOUR_PROJECT_ID/my-fastapi-app

Step 4:Create a GKE Cluster.

gcloud container clusters create my-cluster --num-nodes=3

gcloud container clusters get-credentials my-cluster

Step 5:Deploy the Application on GKE: Deploy the FastAPI application on the GKE cluster.

Create a Kubernetes Deployment deployment.yaml

## deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-fastapi-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-fastapi-app
  template:
    metadata:
      labels:
        app: my-fastapi-app
    spec:
      containers:
      - name: my-fastapi-app
        image: gcr.io/YOUR_PROJECT_ID/my-fastapi-app
        ports:
        - containerPort: 80
        env:
        - name: MODEL_NAME
          value: "gs://YOUR_BUCKET_NAME/model"

Create a Kubernetes Service service.yaml

apiVersion: v1
kind: Service
metadata:
  name: my-fastapi-app
spec:
  type: LoadBalancer
  selector:
    app: my-fastapi-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Deploy to GKE

kubectl apply -f deployment.yaml

kubectl apply -f service.yaml

Step 6: Set up Google Cloud Storage, upload model to bucket, update main.py to load model from GCS

gsutil mb gs://YOUR_BUCKET_NAME

gsutil cp model/* gs://YOUR_BUCKET_NAME/model/

from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
from google.cloud import storage
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

# Mount the static directory
app.mount("/static", StaticFiles(directory="static"), name="static")

# Serve HTML templates
templates = Jinja2Templates(directory="templates")

# Load the model and tokenizer from GCS
def download_blob(bucket_name, source_blob_name, destination_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

model_path = "/app/model"
os.makedirs(model_path, exist_ok=True)
download_blob("YOUR_BUCKET_NAME", "model/pytorch_model.bin", f"{model_path}/pytorch_model.bin")
download_blob("YOUR_BUCKET_NAME", "model/config.json", f"{model_path}/config.json")
download_blob("YOUR_BUCKET_NAME", "model/tokenizer_config.json", f"{model_path}/tokenizer_config.json")

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
model.to("cuda" if torch.cuda.is_available() else "cpu")

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    context = {
        "request": request,
        "title": "LLM Text Generator",
        "description": "Generate text using a large language model."
    }
    return templates.TemplateResponse("index.html", context)

@app.post("/generate", response_class=HTMLResponse)
async def generate_text(request: Request, prompt: str = Form(...)):
    if not prompt:
        return HTMLResponse("<div id='result'><p>Error: Prompt cannot be empty!</p></div>")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return HTMLResponse(f"<div id='result'><h2>Generated Text:</h2><p>{generated_text}</p></div>")

# To run the app: uvicorn main:app --reload

Loading the model and making predictions with FastAPI app

from fastapi import FastAPI, Request, Form
from fastapi.responses import HTMLResponse, JSONResponse
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

app = FastAPI()

# Mount the static directory
app.mount("/static", StaticFiles(directory="static"), name="static")

# Serve HTML templates
templates = Jinja2Templates(directory="templates")

# Load the model and tokenizer
model_name = os.getenv("MODEL_NAME", "gpt2")  # Replace with your model path or name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    context = {
        "request": request,
        "title": "LLM Text Generator",
        "description": "Generate text using a large language model."
    }
    return templates.TemplateResponse("index.html", context)

@app.post("/generate", response_class=HTMLResponse)
async def generate_text(request: Request, prompt: str = Form(...)):
    if not prompt:
        return HTMLResponse("<div id='result'><p>Error: Prompt cannot be empty!</p></div>")

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return HTMLResponse(f"<div id='result'><h2>Generated Text:</h2><p>{generated_text}</p></div>")

@app.post("/api/generate", response_class=JSONResponse)
async def api_generate_text(prompt: str):
    if not prompt:
        return JSONResponse(status_code=400, content={"error": "Prompt cannot be empty!"})

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return JSONResponse(content={"generated_text": generated_text})

# To run the app: uvicorn main:app --reload

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>LLM Text Generator</title>
    <link rel="stylesheet" href="/static/css/styles.css">
    <script src="https://unpkg.com/htmx.org@1.6.1"></script>
</head>
<body>
    <h1>LLM Text Generator</h1>
    <p>Generate text using a large language model.</p>
    <form hx-post="/generate" hx-target="#result" hx-swap="innerHTML">
        <label for="prompt">Enter your prompt:</label>
        <input type="text" id="prompt" name="prompt" required>
        <button type="submit">Generate</button>
    </form>
    <div id="result"></div>
</body>
</html>

From the user’s perspective, the interaction with the FastAPI application will be straightforward and intuitive. Here’s how each endpoint will be experienced by the user:

User Experience

Accessing the Application (/ Endpoint):
- When the user navigates to the root URL (http://127.0.0.1:8000/), they will see an HTML page with a form where they can input a prompt.
- The form will have a text input field for the prompt and a submit button.
Submitting the Form (/generate Endpoint):
- When the user types a prompt into the form and clicks the submit button, the form data is sent to the /generate endpoint.
- The /generate endpoint processes the form submission, queries the LLM, and returns the generated text as part of the HTML response.
- The user will see the generated text displayed on the same page below the form.
API Interaction (/api/generate Endpoint):
- The /api/generate endpoint is designed for programmatic access, such as from a frontend application or another service.
- Users or developers can send a JSON request to this endpoint with the prompt, and it will return the generated text as a JSON response.
- This endpoint is useful for integrating the LLM functionality into other applications or services.

Writing tests

Unit tests

import os
import pytest
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

@pytest.fixture
def model_and_tokenizer():
    model_name = os.getenv("MODEL_NAME", "gpt2")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    return model, tokenizer

def test_model_loading(model_and_tokenizer):
    model, tokenizer = model_and_tokenizer
    assert model is not None, "Model should be loaded"
    assert tokenizer is not None, "Tokenizer should be loaded"
    assert model.device.type in ["cuda", "cpu"], "Model should be on CUDA or CPU"

Integration tests

from fastapi.testclient import TestClient
from main import app

client = TestClient(app)

def test_read_root():
    response = client.get("/")
    assert response.status_code == 200
    assert "LLM Text Generator" in response.text

def test_generate_text():
    response = client.post("/generate", data={"prompt": "Hello, world!"})
    assert response.status_code == 200
    assert "Generated Text:" in response.text

def test_api_generate_text():
    response = client.post("/api/generate", json={"prompt": "Hello, world!"})
    assert response.status_code == 200
    json_response = response.json()
    assert "generated_text" in json_response
    assert isinstance(json_response["generated_text"], str)

Working with a database (Postresql)

# requirements.txt
fastapi
uvicorn
torch
transformers
python-dotenv
passlib[bcrypt]
pyjwt
sqlalchemy
databases
asyncpg
psycopg2-binary

# models.py
from sqlalchemy import Column, Integer, String, Boolean
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True, index=True)
    username = Column(String, unique=True, index=True)
    full_name = Column(String)
    email = Column(String, unique=True, index=True)
    hashed_password = Column(String)
    disabled = Column(Boolean, default=False)


class UserInDB(User):
    hashed_password: str

# database.py
from sqlalchemy import create_engine
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker
from .models import Base
from passlib.context import CryptContext

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")

DATABASE_URL = "postgresql+asyncpg://user:password@localhost/dbname"

engine = create_async_engine(DATABASE_URL, echo=True)  # echo to see SQL in terminal
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine, class_=AsyncSession)

async def init_db():
    async with engine.begin() as conn:
        await conn.run_sync(Base.metadata.create_all)

# Dependency to get the database session
async def get_db():
    async with SessionLocal() as session:
        yield session

# crud.py
from sqlalchemy.orm import Session
from .models import User
from .schemas import UserCreate, UserUpdate
from .auth import get_password_hash

def get_user(db: Session, user_id: int):
    return db.query(User).filter(User.id == user_id).first()

def get_user_by_username(db: Session, username: str):
    return db.query(User).filter(User.username == username).first()

def create_user(db: Session, user: UserCreate):
    hashed_password = get_password_hash(user.password)
    db_user = User(
        username=user.username,
        full_name=user.full_name,
        email=user.email,
        hashed_password=hashed_password,
        disabled=user.disabled,
    )
    db.add(db_user)
    db.commit()
    db.refresh(db_user)
    return db_user

def update_user(db: Session, user_id: int, user: UserUpdate):
    db_user = get_user(db, user_id)
    if db_user:
        db_user.username = user.username
        db_user.full_name = user.full_name
        db_user.email = user.email
        if user.password:
            db_user.hashed_password = get_password_hash(user.password)
        db_user.disabled = user.disabled
        db.commit()
        db.refresh(db_user)
    return db_user

def delete_user(db: Session, user_id: int):
    db_user = get_user(db, user_id)
    if db_user:
        db.delete(db_user)
        db.commit()
    return db_user

# auth.py
from datetime import datetime, timedelta
from typing import Optional
from fastapi import Depends, HTTPException, status
from fastapi.security import OAuth2PasswordBearer, OAuth2PasswordRequestForm
from jose import JWTError, jwt
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select
from .models import User
from .database import get_db, pwd_context

# Secret key to encode and decode JWT tokens
SECRET_KEY = "your_secret_key"
ALGORITHM = "HS256"
ACCESS_TOKEN_EXPIRE_MINUTES = 30

oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token")

def verify_password(plain_password: str, hashed_password: str) -> bool:
    return pwd_context.verify(plain_password, hashed_password)

def get_password_hash(password: str) -> str:
    return pwd_context.hash(password)

async def authenticate_user(db: AsyncSession, username: str, password: str):
    result = await db.execute(select(User).filter(User.username == username))
    user = result.scalars().first()
    if not user:
        return False
    if not verify_password(password, user.hashed_password):
        return False
    return user

def create_access_token(data: dict, expires_delta: Optional[timedelta] = None):
    to_encode = data.copy()
    if expires_delta:
        expire = datetime.now() + expires_delta
    else:
        expire = datetime.now() + timedelta(minutes=15)
    to_encode.update({"exp": expire})
    encoded_jwt = jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
    return encoded_jwt

async def get_current_user(token: str = Depends(oauth2_scheme), db: AsyncSession = Depends(get_db)):
    credentials_exception = HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail="Could not validate credentials",
        headers={"WWW-Authenticate": "Bearer"},
    )
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
        username: str = payload.get("sub")
        if username is None:
            raise credentials_exception
    except JWTError:
        raise credentials_exception
    result = await db.execute(select(User).filter(User.username == username))
    user = result.scalars().first()
    if user is None:
        raise credentials_exception
    return user

async def get_current_active_user(current_user: User = Depends(get_current_user)):
    if current_user.disabled:
        raise HTTPException(status_code=400, detail="Inactive user")
    return current_user

#schemas.py
from pydantic import BaseModel
from typing import Optional

class Token(BaseModel):
    access_token: str
    token_type: str

class TokenData(BaseModel):
    username: Optional[str] = None

# main.py
from fastapi import FastAPI, Depends, HTTPException, Request, Form
from fastapi.responses import HTMLResponse, JSONResponse
from fastapi.security import OAuth2PasswordRequestForm
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
from sqlalchemy.ext.asyncio import AsyncSession
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
from dotenv import load_dotenv
from .auth import authenticate_user, create_access_token, get_current_active_user
from .models import User
from .database import get_db, init_db
from .crud import get_user_by_username, create_user
from .schemas import UserCreate, Token
from contextlib import asynccontextmanager

# Load environment variables from .env file
load_dotenv()

app = FastAPI()


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup event
    await init_db()
    yield
    # Shutdown event
    # Perform any necessary cleanup here
# Mount the static directory
app.mount("/static", StaticFiles(directory="static"), name="static")

# Serve HTML templates
templates = Jinja2Templates(directory="templates")

# Load the model and tokenizer
model_name = os.getenv("MODEL_NAME", "gpt2")  # Replace with your model path or name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

@app.post("/token", response_model=Token)
async def login_for_access_token(form_data: OAuth2PasswordRequestForm = Depends(), db: AsyncSession = Depends(get_db)):
    user = await authenticate_user(db, form_data.username, form_data.password)
    if not user:
        raise HTTPException(
            status_code=401,
            detail="Incorrect username or password",
            headers={"WWW-Authenticate": "Bearer"},
        )
    access_token_expires = timedelta(minutes=ACCESS_TOKEN_EXPIRE_MINUTES)
    access_token = create_access_token(
        data={"sub": user.username}, expires_delta=access_token_expires
    )
    return {"access_token": access_token, "token_type": "bearer"}

@app.post("/users/", response_model=User)
async def create_new_user(user: UserCreate, db: AsyncSession = Depends(get_db)):
    db_user = await get_user_by_username(db, user.username)
    if db_user:
        raise HTTPException(status_code=400, detail="Username already registered")
    return await create_user(db, user)

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    context = {
        "request": request,
        "title": "LLM Text Generator",
        "description": "Generate text using a large language model."
    }
    return templates.TemplateResponse("index.html", context)

@app.post("/generate", response_class=HTMLResponse)
async def generate_text(request: Request, prompt: str = Form(...), current_user: User = Depends(get_current_active_user)):
    if not prompt:
        return HTMLResponse("<div id='result'><p>Error: Prompt cannot be empty!</p></div>")

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return HTMLResponse(f"<div id='result'><h2>Generated Text:</h2><p>{generated_text}</p></div>")

@app.post("/api/generate", response_class=JSONResponse)
async def api_generate_text(prompt: str, current_user: User = Depends(get_current_active_user)):
    if not prompt:
        return JSONResponse(status_code=400, content={"error": "Prompt cannot be empty!"})

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(inputs["input_ids"], max_length=100)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return JSONResponse(content={"generated_text": generated_text})

# To run the app: uvicorn main:app --reload

# index.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>LLM Text Generator</title>
    <link rel="stylesheet" href="/static/css/styles.css">
    <script src="https://unpkg.com/htmx.org@1.6.1"></script>
</head>
<body>
    <h1>LLM Text Generator</h1>
    <p>Generate text using a large language model.</p>

    <!-- User Registration Form -->
    <div id="registration">
        <h2>Register</h2>
        <form hx-post="/users/" hx-target="#registration-result" hx-swap="innerHTML">
            <label for="username">Username:</label>
            <input type="text" id="username" name="username" required>
            <label for="full_name">Full Name:</label>
            <input type="text" id="full_name" name="full_name">
            <label for="email">Email:</label>
            <input type="email" id="email" name="email" required>
            <label for="password">Password:</label>
            <input type="password" id="password" name="password" required>
            <button type="submit">Register</button>
        </form>
        <div id="registration-result"></div>
    </div>

    <!-- User Login Form -->
    <div id="login">
        <h2>Login</h2>
        <form hx-post="/token" hx-target="#login-result" hx-swap="innerHTML">
            <label for="login-username">Username:</label>
            <input type="text" id="login-username" name="username" required>
            <label for="login-password">Password:</label>
            <input type="password" id="login-password" name="password" required>
            <button type="submit">Login</button>
        </form>
        <div id="login-result"></div>
    </div>

    <!-- Text Generation Form -->
    <div id="text-generation">
        <h2>Generate Text</h2>
        <form hx-post="/generate" hx-target="#result" hx-swap="innerHTML">
            <label for="prompt">Enter your prompt:</label>
            <input type="text" id="prompt" name="prompt" required>
            <button type="submit">Generate</button>
        </form>
        <div id="result"></div>
    </div>
</body>
</html>

Basics

Notes following “Building Data Science Applications with FastAPI” by François Voron Chapter 2: Python specificities -> asyncio

Notes by Key Topic

Installation, virtual environment (conda), running, first app

Defining routes with path and query parameters (for user input) and validating requests.

Request and response models

Dependency Injection

Creating and using a parametrized dependency with a class

async/await syntax for LLM calls that are I/O-bound

Custom error handling

Serving static files (like HTML, CSS, JS) with FastAPI: Jinja2+JS

Serving static files (like HTML, CSS, JS) with FastAPI: Jinja2+HTMX

Packaging with Docker

Deploying on GCP

Step 1: Set Up GCP Project: Create a GCP project and enable necessary APIs.

Step 2: Install and Configure Google Cloud SDK

Step 3: Build and Push Docker Image to GCR

Step 4:Create a GKE Cluster.

Step 5:Deploy the Application on GKE: Deploy the FastAPI application on the GKE cluster.

Step 6: Set up Google Cloud Storage, upload model to bucket, update main.py to load model from GCS

Loading the model and making predictions with FastAPI app

User Experience

Writing tests

Unit tests

Integration tests

Working with a database (Postresql)

Above In progress