Model Usage & Cost
Across Langfuse, usage and cost are tracked for LLM generations:
- Usage: token/character counts
- Cost: USD cost of the generation
Both usage and cost can be either
- ingested via API, SDKs or integrations
- or inferred based on the
model
parameter of the generation. Langfuse comes with a list of predefined popular models and their tokenizers including OpenAI, Anthropic, and Google models. You can also add your own custom model definitions or request official support for new models via GitHub. Inferred cost are calculated at the time of ingestion.
Ingested usage and cost are prioritized over inferred usage and cost:
Via the Daily Metrics API, you can retrieve aggregated daily usage and cost metrics from Langfuse for downstream use in analytics, billing, and rate-limiting. The API allows you to filter by application type, user, or tags.
Ingest usage and/or cost
If available in the LLM response, ingesting usage and/or cost is the most accurate and robust way to track usage in Langfuse.
Many of the Langfuse integrations automatically capture usage and cost data from the LLM response. If this does not work as expected, please create an issue on GitHub.
@observe(as_type="generation")
def anthropic_completion(**kwargs):
# optional, extract some fields from kwargs
kwargs_clone = kwargs.copy()
input = kwargs_clone.pop('messages', None)
model = kwargs_clone.pop('model', None)
langfuse_context.update_current_observation(
input=input,
model=model,
metadata=kwargs_clone
)
response = anthopic_client.messages.create(**kwargs)
langfuse_context.update_current_observation(
usage={
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
# "total": int, # if not set, it is derived from input + output
"unit": "TOKENS", # any of: "TOKENS", "CHARACTERS", "MILLISECONDS", "SECONDS", "IMAGES"
# Optionally, also ingest usd cost. Alternatively, you can infer it via a model definition in Langfuse.
# Here we assume the input and output cost are 1 USD each.
"input_cost": 1,
"output_cost": 1,
# "total_cost": float, # if not set, it is derived from input_cost + output_cost
}
)
# return result
return response.content[0].text
@observe()
def main():
return anthropic_completion(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello, Claude"}
]
)
main()
You can also update the usage and cost via generation.update()
and generation.end()
.
Compatibility with OpenAI
For increased compatibility with OpenAI, you can also use the following attributes to ingest usage:
generation = langfuse.generation(
# ...
usage={
# usage
"prompt_tokens": int,
"completion_tokens": int,
"total_tokens": int, # optional, it is derived from prompt + completion
},
# ...
)
You can also ingest OpenAI-style usage via generation.update()
and generation.end()
.
Infer usage and/or cost
If either usage or cost are not ingested, Langfuse will attempt to infer the missing values based on the model
parameter of the generation at the time of ingestion. This is especially useful for some model providers or self-hosted models which do not include usage or cost in the response.
Langfuse comes with a list of predefined popular models and their tokenizers including OpenAI, Anthropic, Google. Check out the full list (opens in a new tab) (you need to sign-in).
You can also add your own custom model definitions (see below) or request official support for new models via GitHub.
Usage
If a tokenizer is specified for the model, Langfuse automatically calculates token amounts for ingested generations.
The following tokenizers are currently supported:
Model | Tokenizer | Used package | Comment |
---|---|---|---|
gpt-4o | o200k_base | tiktoken (opens in a new tab) | |
gpt* | cl100k_base | tiktoken (opens in a new tab) | |
claude* | claude | @anthropic-ai/tokenizer (opens in a new tab) | According to Anthropic, their tokenizer is not accurate for Claude 3 models. If possible, send us the tokens from their API response. |
Cost
Model definitions include prices per unit (input, output, total).
Langfuse automatically calculates cost for ingested generations at the time of ingestion if (1) usage is ingested or inferred, (2) and a matching model definition includes prices.
Custom model definitions
You can flexibly add your own model definitions to Langfuse. This is especially useful for self-hosted or fine-tuned models which are not included in the list of Langfuse maintained models.
Models are matched to generations based on:
Generation Attribute | Model Attribute | Notes |
---|---|---|
model | match_pattern | Uses regular expressions, e.g. (?i)^(gpt-4-0125-preview)$ matches gpt-4-0125-preview . |
unit | unit | Unit on the usage object of the generation (e.g. TOKENS or CHARACTERS ) needs to match. |
start_time | start_time | Optional, can be used to update the price of a model without affecting past generations. If multiple models match, the model with the most recent model.start_time that is earlier than generation.start_time is used. |
User-defined models take priority over models maintained by Langfuse.
Further details
When using the openai
tokenizer, you need to specify the following tokenization config. You can also copy the config from the list of predefined OpenAI models. See the OpenAI documentation (opens in a new tab) for further details. tokensPerName
and tokensPerMessage
are required for chat models.
{
"tokenizerModel": "gpt-3.5-turbo", // tiktoken model name
"tokensPerName": -1, // OpenAI Chatmessage tokenization config
"tokensPerMessage": 4 // OpenAI Chatmessage tokenization config
}
Troubleshooting
Usage and cost are missing for historical generations. Except for changes in prices, Langfuse does not retroactively infer usage and cost for existing generations when model definitions are changed. You can request a batch job (Langfuse Cloud) or run a script (self-hosting) to apply new model definitions to existing generations.