Computed features vs PC with new AnnData format by Soorya19Pradeep · Pull Request #342 · mehta-lab/VisCy

Soorya19Pradeep · 2025-11-20T18:56:22Z

The code is modified to read from the zarrs with the new anndata format. @srivarra, can you let me know if there is a better way to do this?

@edyoshikun , I have computed the image features outside the library as I used the segmentations of G3BP1 for the image feature computation.

srivarra

My main question is, do we want to make full use of AnnData, or is it necessary to save the intermediate features .csv .

srivarra · 2025-11-25T20:57:57Z

We can work directly with AnnData.obs and use AnnData.to_df to convert X to a Dataframe.

https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.to_df.html

The part I am still struggling with is matching the annotations and embedding feature based on the combination of 'track_id', 'fov_name' and 't'. I have to deal with a dataframe for this step as I sort and remove rows based on the match.

srivarra · 2025-11-25T20:59:14Z

+                output_file = '/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/feature_list_G3BP1_2025_07_22_192patch.csv'
+
+                # Write to CSV - append if file exists, create new if it doesn't
+                position_df.to_csv(output_file, mode='a', header=not os.path.exists(output_file), index=False)


Why do we need to write out an intermediate CSV? We can directly append to the AnnData object (assuming if it exists). Should we assume the AnnData object already exists before users start computing image features?

srivarra · 2025-11-25T21:04:24Z

+    feature_values = features.filter(like="feature_")
+
+    # compute the PCA features
+    pca = PCA(n_components=10)


Are we recomputing the PCA features here? Should we instead put these AnnData oriented ones over in dimensionality_reduction.py.

Also is there any functionality from feature.py which we could reuse?

srivarra · 2025-11-25T21:07:44Z

+    correlation_df = pd.DataFrame(pca_features, columns=[f"PCA{i+1}" for i in range(pca_features.shape[1])], index=features.index)
+
+    # get the computed features like 'contrast', 'homogeneity', 'energy', 'correlation', 'edge_density', 'organelle_volume', 'organelle_intensity', 'no_organelles', 'size_organelles'
+    image_features_df = features.filter(regex="contrast|homogeneity|energy|correlation|edge_density|organelle_volume|organelle_intensity|no_organelles|size_organelles").copy()


If the features are fixed, like in compute_image_features then I don't think we need regex right? we can just select those columns directly.

srivarra · 2025-11-25T21:11:27Z

+    # Rename columns to avoid conflicts during merge
+    # Rename 't' in features_df_filtered to 'time_point' to match computed_features
+    features_df_filtered = features_df_filtered.rename(columns={"t": "time_point"})


We should instead change compute_image_features to use t.

This was an issue from the older computations when we created headers of our choice. We have converged to use 't' from now on. I can redo the computed features to have 't' column to solve this.

Instead of recomputing the computed features, could you just rename the column?

srivarra · 2025-11-25T21:12:49Z

+                    cell_features = {
+                        'fov_name': '/'+well_id+'/'+pos_id,
+                        'track_id': row['track_id'],
+                        'time_point': timepoint,


We should use t instead of time_point to match the rest of codebase.

srivarra · 2025-11-25T21:15:23Z

+    # Select columns (features) in the desired order
+    feature_order = ["edge_density", "correlation", "energy", "homogeneity", "contrast", "no_organelles", "organelle_volume", "organelle_intensity"]
+    # Filter to only include features that actually exist in the dataframe
+    feature_order_filtered = [f for f in feature_order if f in correlation_selected.columns]


Do we need to check their existence if they are already fixed features from up earlier in the script?

Soorya19Pradeep · 2025-11-25T21:31:18Z

@srivarra , the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles. I think a tool like CellProfiler can compute a 1000-feature list, which can be further filtered for the most significant features. I haven't implemented anything like that yet. That is when it will be ready to be added to AnnData as it will be a more constant list.

srivarra · 2025-11-25T21:44:17Z

@Soorya19Pradeep

the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles.

Ah gotcha, so keep the csv output for quick iterations / until we know exactly what features we want?

Soorya19Pradeep added 2 commits November 20, 2025 10:50

use new anndata format

6e8bb9a

image features from G3BP1

eef147b

Soorya19Pradeep requested review from edyoshikun and srivarra November 20, 2025 18:56

srivarra requested changes Nov 25, 2025

View reviewed changes

Conversation

Soorya19Pradeep commented Nov 20, 2025

Uh oh!

srivarra left a comment

Choose a reason for hiding this comment

Uh oh!

srivarra Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Soorya19Pradeep commented Nov 25, 2025

Uh oh!

srivarra commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srivarra Nov 25, 2025 •

edited

Loading