-
Notifications
You must be signed in to change notification settings - Fork 2
Cid/attachments #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cid/attachments #581
Conversation
| if name: | ||
| processed_msg["name"] = name | ||
| processed_messages.append(processed_msg) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this cause issues?
when the content type is unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we encounter an unknown content type, we preserve the original message unchanged. If OpenAI introduces a new content type we don't recognize, we don't break the trace, we just pass it through as-is. The trace will still work, we just won't have special handling for that new type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it log a warning or emit an alert so we can learn about this new type and add it later?
| elif item_type == "file": | ||
| file_data = item.get("file", {}) | ||
| file_id = file_data.get("file_id") | ||
| file_data_b64 = file_data.get("file_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the base64 is too big? I don't know if it will cause issues while sending like for example if the request size exceeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The base64 data is stored temporarily in the Attachment object, but it's not sent in the trace payload. The flow is:
- Base64 data is extracted and stored in
Attachment.data_base64 - Before sending the trace,
upload_trace_attachments()uploads the data to our blob storage and gets astorage_uri - After upload,
data_base64is cleared (set toNone) inattachment_uploader.py - Only the
storage_uriis serialized into the trace payload
So the actual trace request only contains the URI reference, not the heavy base64 data.
| elif file_id: | ||
| # Just reference the file ID (can't download without API call) | ||
| attachment = Attachment(name=filename) | ||
| attachment.metadata["openai_file_id"] = file_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we just have openai_file_id, what will we show in the UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we only have openai_file_id, we create an attachment with just the filename and store the file_id in metadata. This is a design decision and in the UI, I thought we could show:
- The
filename(if provided, otherwise "file") - The
openai_file_idreference in metadata - No preview/content (since we don't have access to the actual data)
This is a limitation. We could potentially add an option to fetch the file content using the client, but that adds latency and complexity. For now, I thought we could just capture the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add this as a comment somewhere so @eluzzi5 can come up with a good design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a Linear issue?
| The attachment with storage_uri populated (if upload was needed). | ||
| """ | ||
| # Skip if already uploaded | ||
| if attachment.is_uploaded(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still upload data from openai or anywhere else to our storage? they might remove that data after sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if an attachment has an external URL (like a hosted image URL), we keep that URL and don't re-upload to our storage.
Pros of current approach:
- Faster (no download + re-upload)
- Less storage cost
Cons:
- External URLs may expire
- Data could become unavailable
I think this is worth discussing as a follow-up improvement. We could add an option like persist_external_urls=True that would download and re-upload external content to ensure long-term availability. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think storage is cheap. We can do it async on our backend later but we should surely store it with us. I am sure that particular file is important and if it expires we miss critical information. We can do it later as a followup task but make it default.
Thoughts @gustavocidornelas @whoseoyster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a compliance risk if we copy data to Openlayer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! The attachment upload feature is opt-in. The user needs to explicitly set attachment_upload_enabled=True when they configure the tracer.
However, we should document this clearly. Worth discussing further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with us copying files over to our storage wherever possible, but allowing opt out
| try: | ||
| from .attachment_uploader import upload_trace_attachments | ||
|
|
||
| upload_count = upload_trace_attachments(trace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this _upload_and_publish_trace will be called when we do have an attachment to upload - right?
because if theres nothing to upload then upload_count will be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_upload_and_publish_trace is called for every trace, regardless of whether there are attachments. It handles both uploading attachments (if any exist and attachment_upload_enabled=True) and publishing the trace data to Openlayer.
If there are no attachments, upload_count will be 0, but the function still proceeds to publish the trace. The attachment upload is just one step in the trace completion process. The naming could be clearer - it's really "process and publish trace" with attachment upload being an optional first step.
matheusportela
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gustavocidornelas I left a bunch of comments but none is blocking (some are literally just because I'm curious)! Let me know if you have any questions
| upload_fields = dict(fields) if fields else {} | ||
| upload_fields["file"] = (object_name, data, content_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the fields type signature to fields: Dict = {}, you can simplify this:
upload_fields = {
"file": (object_name, data, content_type),
**fields
}| headers = {"Content-Type": content_type} | ||
| if extra_headers: | ||
| headers.update(extra_headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, this can be simplified by changing extra_headers to a default {}:
extra_headers: Dict[str, str] = {}
headers = {
"Content-Type": content_type,
**extra_headers
}| collected_function_call["arguments"] += delta.tool_calls[ | ||
| 0 | ||
| ].function.arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird formatting choice by the linter lol
| has_audio = ( | ||
| hasattr(response.choices[0].message, "audio") | ||
| and response.choices[0].message.audio is not None | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly simpler check:
| has_audio = ( | |
| hasattr(response.choices[0].message, "audio") | |
| and response.choices[0].message.audio is not None | |
| ) | |
| has_audio = bool(getattr(response.choices[0].message, "audio", None)) |
| for item in content: | ||
| content_item = _normalize_content_item(item) | ||
| normalized_content.append(content_item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a list comprehension:
normalized_content = [_normalize_content_item(item) for item in content)| return Path(self.file_path).exists() | ||
| return False | ||
|
|
||
| def is_valid(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do if an attachment is not valid?
| decoded = base64.b64decode(data_base64) | ||
| size_bytes = len(decoded) | ||
| checksum_md5 = hashlib.md5(decoded).hexdigest() | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen?
| # TODO: Implement metadata extraction from file | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant?
| # File-like object | ||
| file_bytes = data.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen? Can we explicit check for the classes (e.g. BinaryIO) and have an else clause that will raise an error if an unexpected type is passed in? IMO it would be a better developer experience than breaking on data.read(), for instance.
| # Reset attachment uploader | ||
| from .attachment_uploader import reset_uploader | ||
|
|
||
| reset_uploader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to reset the uploader here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _uploader holds a reference to the Openlayer client. When configure() is called with new credentials/settings, we need to reset the uploader so it creates a new client with the updated configuration. Otherwise, it would continue using the old client.
Pull Request
Summary
Introduces:
Changes
Attachments
Attachmentabstraction, which represents media data uploaded to a blob store. TheAttachmentobject stores metadata and thestorageUri, which can be used by the Openlayer platform to fetch the media.attachmentsfield, which is an array ofAttachmentobjects. This allows users to log arbitrary media to a step. For example:OpenAI multimodal
attachments, this PR also instruments thetrace_openaiwrapper to parse image/audio/files in the input or output of OpenAI LLM calls.Note that if the
typeis one ofimage,audio, orfile, the other object field is anattachment, which is a serializedAttachmentobject.Context
OPEN-8683: Multimodal attachment support for the Python SDK and OPEN-8684: Enhance OpenAI tracer to support multimodal inputs/outputs
Testing