Skip to main content
APIYI fully supports the official Gemini native format (the /v1beta generateContent endpoint): point the base_url at https://api.apiyi.com and your existing Gemini code and official SDKs migrate seamlessly — no format conversion needed. This page is based on the official Google documentation (ai.google.dev/gemini-api/docs, as of June 2026). All examples are copy-paste ready.

Why the Native Format

The OpenAI-compatible format can call Gemini too, but the following are native-only:
  • Full thinking controls: thinking_level (Gemini 3 series) / thinking_budget (2.5 series), thought summaries, thought signatures
  • Native multimodal Parts: inline images / audio / video, with media_resolution cost control — see Multimodal & Code Execution
  • Code execution tool: code_execution runs Python in a sandbox
  • Fine-grained usage fields: thoughts_token_count, cached_content_token_count, and more
For plain text chat, or one codebase across multiple vendors, use OpenAI Compatible Mode instead.

Quick Start

Use Google’s official unified SDK google-genai (the legacy google-generative-ai was sunset on November 30, 2025 (UTC)):
pip install google-genai
from google import genai

client = genai.Client(
    api_key="YOUR_API_KEY",  # your APIYI key
    http_options={"base_url": "https://api.apiyi.com"}
)

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Introduce yourself in one sentence"
)
print(response.text)
The base_url is https://api.apiyi.com (without /v1) — different from the OpenAI-compatible format’s https://api.apiyi.com/v1. Use your APIYI key, not a Google AI Studio key.

Streaming

stream = client.models.generate_content_stream(
    model="gemini-3.5-flash",
    contents="Write a short essay on quantum computing"
)

for chunk in stream:
    print(chunk.text, end="", flush=True)

Thinking Controls

Gemini models think by default, and the two generations use different parameters — mixing them raises an error:
Model seriesParameterValues
Gemini 3 / 3.1 / 3.5thinking_levelminimal (Flash family only) / low / high (default)
Gemini 2.5thinking_budgettoken cap (e.g. 0–8192); model auto-controls if unset
Passing both thinking_level and thinking_budget to a Gemini 3 series model returns an error — pick one (use thinking_level for the 3 series).
from google.genai import types

# Gemini 3 series: level-based control
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Prove that the square root of 2 is irrational",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="high")
    )
)

# Gemini 2.5 series: token-budget control
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Simple question, be quick",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0)  # thinking off
    )
)
Choosing a level: minimal for low-latency simple tasks (classification, extraction); low for routine chat; high for complex reasoning and code. Thinking tokens bill at the output rate — higher levels cost more.

Thought summaries and thought signatures

  • Thought summaries: include_thoughts=True returns a summary of the reasoning (parts where part.thought is True)
  • Thought signatures: encrypted reasoning state introduced with Gemini 3. In multi-turn conversations (especially function calling), pass the thought_signature from the response back unchanged so the model can continue its reasoning chain. Official SDKs handle this automatically; don’t strip the field in hand-written REST calls — see Function Calling
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Analyze the time complexity of: def fib(n): return n if n <= 1 else fib(n-1) + fib(n-2)",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="high", include_thoughts=True)
    )
)

for part in response.candidates[0].content.parts:
    if getattr(part, "thought", False):
        print(f"[Thought summary] {part.text}")
    else:
        print(f"[Final answer] {part.text}")

Common Config Parameters

Passed via config (GenerateContentConfig):
ParameterDescription
system_instructionSystem prompt
temperature0–2. Google recommends keeping the default 1.0 for Gemini 3 series — lowering it can hurt reasoning quality
max_output_tokensOutput cap (includes thinking tokens)
thinking_configThinking controls, see above
response_mime_typeSet application/json to force JSON output
response_schemaSchema constraint for structured JSON output
toolsFunction declarations / code_execution and other tools
media_resolutionMultimodal input cost control, see Multimodal page

Usage Fields (usage_metadata)

usage = response.usage_metadata
print(f"Input: {usage.prompt_token_count}")
print(f"Output: {usage.candidates_token_count}")
print(f"Thinking: {usage.thoughts_token_count}")
print(f"Cache hits: {usage.cached_content_token_count}")
FieldDescriptionBilling
prompt_token_countInput tokensInput rate
candidates_token_countOutput tokensOutput rate
thoughts_token_countThinking tokensOutput rate — tune the level to save
cached_content_token_countCached tokensOfficial discount, see Cache Billing
total_token_countTotal

Supported Models and Pricing

ModelInput (per 1M tokens)Output (per 1M tokens)Notes
gemini-3.5-flash$1.50$9.00Current workhorse — beats 3.1 Pro on several benchmarks, 1M context
gemini-3.1-pro-preview$1.80$10.80Pro flagship
gemini-3-pro-preview$1.80$10.80Previous Pro
gemini-3-flash-preview$0.44$2.64Light and fast
gemini-3.1-flash-lite$0.25$1.50Ultra-budget
gemini-2.5-pro$1.25$10.002.5-series Pro
gemini-2.5-flash$0.30$2.402.5-series workhorse
gemini-2.5-flash-lite$0.10$0.40Cheapest
Some models have -thinking / -nothinking alias variants (e.g. gemini-3-flash-preview-nothinking) that pin thinking on/off — handy for clients where you can’t change request parameters. Full list: Models & Pricing.

Native vs OpenAI-Compatible

FeatureGemini nativeOpenAI-compatible
base_urlhttps://api.apiyi.comhttps://api.apiyi.com/v1
SDKgoogle-genaiopenai
Thinking controlthinking_level / thinking_budgetreasoning_effort
Thought summaries / signatures
Code execution tool
Media inputNative inline Parts (PIL / bytes)Base64 image_url
Cache hit fieldcached_content_token_countcached_tokens

Notes

  • The Files API is not supported (client.files.upload()); media must be passed inline and each file must stay under 20MB — see Multimodal & Code Execution
  • Cache discounts and hit-rate expectations: Cache Billing