Explicit Title Format Performs Worse Than Simple Format (Contradicts Docs)

#35

by Farhadtehrani - opened 12 days ago

12 days ago

The documentation states:
"providing a title, if available, will improve model performance"
But testing shows the opposite - simple formats perform up to 18.7% better.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("google/embeddinggemma-300m")

title = "Marketing Manager"
content = "Lead marketing team and drive brand growth."
query = "Marketing Manager"

Compare formats

doc_official = f"title: {title} | text: {content}" # Per docs
doc_simple = f"{title} | {content}" # Simple format

query_prompt = f"task: search result | query: {query}"
q_emb = model.encode(query_prompt)

sim_official = util.cos_sim(q_emb, model.encode(doc_official)).item()
sim_simple = util.cos_sim(q_emb, model.encode(doc_simple)).item()

print(f"Official format: {sim_official:.4f}") # 0.5867
print(f"Simple format: {sim_simple:.4f}") # 0.6961 (+18.7%!)

Official format: 0.5867
Simple format: 0.6961 (+18.7% better!)

Additional Findings
Large-scale eval (500 queries, 80k docs): Official format has 9.3% worse NDCG@100
Title-only documents: Official format loses 17.5% performance when text: is empty
Longer titles suffer more: 60-char titles show bigger performance gaps

Question
Should I use simple concatenation ({title} | {content}) instead of the documented format (title: {title} | text: {content}) for retrieval tasks?
The simple format consistently outperforms. Is this expected behavior or should documentation be updated?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment