Come usare API GPT-5 nel 2026: guida tecnica per sviluppatori | Blog Prompti

L’API OpenAI GPT-5 è il modo standard per integrare l’AI generativa in applicazioni custom. Pricing per token, latenza sub-secondo per modelli mini, supporto streaming, function calling maturo, structured output JSON deterministico. Nel 2026 il workflow base setup → primo request → ottimizzazione costi è coperto da SDK ufficiali per Python, Node.js, .NET, Go, Java. Aggiorniamo qui la guida originale (era su GPT-4o) con codice e best practice 2026 per GPT-5.

Setup API key e autenticazione

Step 1: registrate account business su platform.openai.com (NON l’account ChatGPT consumer, è separato).

Step 2: andate su Settings → API Keys → Create new secret key. Salvate la key in luogo sicuro (non visualizzabile dopo).

Step 3: configurate variabili ambiente:

# .env file
OPENAI_API_KEY="sk-proj-..."
OPENAI_ORG_ID="org-..."  # opzionale, per multi-org
OPENAI_PROJECT_ID="proj_..."  # opzionale, per project-scoped keys

Step 4: installate SDK ufficiale:

# Python
pip install openai

# Node.js
npm install openai

# .NET
dotnet add package OpenAI

Best practice security:

Mai committare API key in repo Git (usate .env + .gitignore)
Usate project-scoped keys con permessi minimi
Implementate rate limiting lato applicazione
Log delle chiamate API per audit (senza loggare prompt sensibili)
Rotation delle key ogni 6-12 mesi

Primo request

Python

from openai import OpenAI

client = OpenAI()  # legge OPENAI_API_KEY da env

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "Sei assistente B2B italiano."},
        {"role": "user", "content": "Cosa è il prompt design?"}
    ],
    max_tokens=300,
    temperature=0.7
)

print(response.choices[0].message.content)
print(f"Tokens: {response.usage.total_tokens}")
print(f"Cost: ~${response.usage.total_tokens * 0.00001:.4f}")

Node.js

import OpenAI from 'openai';

const client = new OpenAI();

const response = await client.chat.completions.create({
  model: 'gpt-5',
  messages: [
    { role: 'system', content: 'Sei assistente B2B italiano.' },
    { role: 'user', content: 'Cosa è il prompt design?' }
  ],
  max_tokens: 300,
  temperature: 0.7
});

console.log(response.choices[0].message.content);
console.log(`Tokens: ${response.usage.total_tokens}`);

Modelli disponibili al 2026

gpt-5            # frontier, $5-15/M token, full reasoning
gpt-5-mini       # cost-optimized, $0,25-1,25/M token
gpt-5-nano       # ultra-low latency, ottimizzato edge
o3               # reasoning esteso, $20-60/M token, math/coding/audit
o4-mini          # reasoning compact, costi ridotti
text-embedding-3-large  # embeddings 3072 dim, RAG
text-embedding-3-small  # embeddings 1536 dim, lightweight
whisper-1        # speech-to-text
tts-1            # text-to-speech
gpt-image-1      # image generation (successore DALL-E 3)

Scegliete il modello in base a task: GPT-5 mini per classification/extraction, GPT-5 full per drafting/analysis, o3 per reasoning profondo.

Streaming responses

Per chat real-time con UX responsive (caratteri appaiono mentre il modello genera):

Python streaming

stream = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": "Spiega prompt design"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js streaming

const stream = await client.chat.completions.create({
  model: 'gpt-5',
  messages: [{ role: 'user', content: 'Spiega prompt design' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Streaming è essenziale per chatbot custom, code assistants, qualsiasi UI dove la latenza percepita conta.

Function calling (tool use)

Function calling permette al modello di chiamare funzioni del vostro codice. Use case: lookup database, API esterne, azioni stato (create/update/delete).

Esempio Python — tool per lookup ordini

tools = [{
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Get current status of an order by ID",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "description": "Order ID like 'ORD-12345'"}
            },
            "required": ["order_id"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "user", "content": "Qual è lo stato dell'ordine ORD-12345?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Il modello restituisce tool_call invece di response
tool_call = response.choices[0].message.tool_calls[0]
order_id = json.loads(tool_call.function.arguments)["order_id"]

# Chiamiamo la nostra funzione
status = lookup_order_in_db(order_id)

# Passiamo il risultato indietro al modello per response finale
final_response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "user", "content": "Qual è lo stato dell'ordine ORD-12345?"},
        response.choices[0].message,
        {"role": "tool", "tool_call_id": tool_call.id, "content": status}
    ]
)

print(final_response.choices[0].message.content)

Function calling è il fondamento per agent flows: il modello decide quali tool chiamare, voi eseguite, il modello sintetizza response. Per orchestrazione complessa vedi il nostro servizio sviluppo agenti AI.

Strutturazione output JSON

Per casi dove serve output strutturato deterministico (estrazione dati, classification), usate structured output:

from pydantic import BaseModel

class OrderExtraction(BaseModel):
    order_id: str
    customer_name: str
    total_amount: float
    items_count: int

response = client.chat.completions.parse(
    model="gpt-5",
    messages=[
        {"role": "system", "content": "Estrai dati ordine in JSON."},
        {"role": "user", "content": email_text}
    ],
    response_format=OrderExtraction
)

order = response.choices[0].message.parsed
print(order.order_id, order.total_amount)

Con response_format=Pydantic_model, OpenAI garantisce output JSON valido conforme allo schema. Errori di parsing scendono dal 5-10% del 2024 al sub-1% nel 2026. Adatto per ETL, data extraction, integration B2B.

Ottimizzazione costi

Cinque tecniche concrete per ridurre costi 30-70%:

1. Model routing automatico

def choose_model(complexity_score):
    if complexity_score < 3:
        return "gpt-5-mini"  # task semplici
    elif complexity_score < 7:
        return "gpt-5"  # standard
    else:
        return "o3"  # reasoning profondo

response = client.chat.completions.create(
    model=choose_model(complexity_score(prompt)),
    messages=messages
)

2. Prompt caching

OpenAI cache prefix automatico per prompt riutilizzati frequentemente (sconto fino al 50%). Strutturate prompt con prefix lungo stabile + variabili dinamiche alla fine:

SYSTEM_PROMPT = """Lungo system prompt con istruzioni dettagliate,
few-shot examples, brand voice guidelines, ecc...""" # 2K-5K token stabili

# Questi prefix vanno in cache, sconto applicato automaticamente
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}  # variabile
]

3. Batch API per workload non urgenti

# Crea batch file con 100K+ requests
batch_response = client.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"  # 50% sconto vs real-time
)

# Polling per completion
while batch_response.status != "completed":
    time.sleep(60)
    batch_response = client.batches.retrieve(batch_response.id)

# Download results
results = client.files.content(batch_response.output_file_id)

4. Token monitoring

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-5")
token_count = len(encoder.encode(prompt))

if token_count > 4000:
    # log warning, optimize prompt
    pass

cost_estimate = token_count * 0.00001
log_to_metrics(model="gpt-5", tokens=token_count, cost=cost_estimate)

5. Caching applicativo

import hashlib
from redis import Redis

redis = Redis()

def cached_completion(messages, model):
    cache_key = hashlib.md5(json.dumps(messages).encode()).hexdigest()
    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(model=model, messages=messages)
    redis.setex(cache_key, 86400, json.dumps(response.dict()))
    return response

Per query frequentemente ripetute (FAQ, classification ricorrente) cache applicativa salva 50-90% costi token.

Pricing 2026 GPT-5 vs GPT-5-mini vs o3

Modello	Input ($/M token)	Output ($/M token)	Casi d’uso
gpt-5-mini	~$0,25	~$1,25	Classification, extraction, formatting, chatbot ad alto volume
gpt-5	~$5	~$15	Drafting, analysis, customer support, content production
o3	~$20	~$60	Math, coding profondo, due diligence, reasoning critico

Workload tipico PMI 50 dipendenti con casi d’uso medi:

70% chiamate su gpt-5-mini → $50-200/mese
25% chiamate su gpt-5 → $150-700/mese
5% chiamate su o3 → $50-300/mese
Totale tipico: $250-1.200/mese

I valori effettivi dipendono dal volume, vanno benchmarkati sul vostro workload reale.

Domande frequenti

Quale SDK ufficiale è meglio?

Tutti gli SDK ufficiali (Python, Node, .NET, Go, Java) sono mantenuti attivamente e supportano feature parity. Scegliete in base allo stack della vostra applicazione. Python è il più adottato per ML/data engineering, Node.js per applicazioni web full-stack.

Quanto è la latenza tipica?

GPT-5 mini: 200-500ms first token, 50-100 token/secondo streaming. GPT-5 full: 500-1500ms first token, 30-80 token/secondo. o3: 5-30 secondi (reasoning), latenza variabile.

Per UX real-time chatbot streaming è essenziale.

Posso usare API OpenAI per dati personali?

Sì con accorgimenti. Account business ha by default no-training, retention 30 giorni per abuse monitoring (zero retention richiedibile per workload approvati). EU Data Residency disponibile su Enterprise/API Platform su richiesta. Per dati sensibili (sanitari, biometrici, giudiziari) considerate Azure OpenAI Service per tenancy dedicata Microsoft.

Come gestisco rate limit e errori?

Implementate retry con exponential backoff, rate limiting lato applicazione, fallback su modello più economico in caso di overflow. Pattern:

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5))
def call_openai(messages):
    return client.chat.completions.create(model="gpt-5", messages=messages)

Gli SDK supportano Azure OpenAI Service?

Sì, OpenAI SDK Python e Node supportano sia OpenAI direct sia Azure OpenAI con minima differenza di configurazione. Vedi Microsoft Azure AI per il dettaglio.

Approfondimenti

Per approfondire:

OpenAI GPT — guida completa modelli e piani
Servizio Prompt Engineering — implementazione enterprise
Servizio Sviluppo Agenti AI — agent flows custom
Microsoft Azure AI — Azure OpenAI Service (alternativa)
Richiedi una consulenza — call iniziale di un’ora, €240