Explicit Title Format Performs Worse Than Simple Format (Contradicts Docs)
The documentation states:
"providing a title, if available, will improve model performance"
But testing shows the opposite - simple formats perform up to 18.7% better.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("google/embeddinggemma-300m")
title = "Marketing Manager"
content = "Lead marketing team and drive brand growth."
query = "Marketing Manager"
Compare formats
doc_official = f"title: {title} | text: {content}" # Per docs
doc_simple = f"{title} | {content}" # Simple format
query_prompt = f"task: search result | query: {query}"
q_emb = model.encode(query_prompt)
sim_official = util.cos_sim(q_emb, model.encode(doc_official)).item()
sim_simple = util.cos_sim(q_emb, model.encode(doc_simple)).item()
print(f"Official format: {sim_official:.4f}") # 0.5867
print(f"Simple format: {sim_simple:.4f}") # 0.6961 (+18.7%!)
Official format: 0.5867
Simple format: 0.6961 (+18.7% better!)
Additional Findings
Large-scale eval (500 queries, 80k docs): Official format has 9.3% worse NDCG@100
Title-only documents: Official format loses 17.5% performance when text: is empty
Longer titles suffer more: 60-char titles show bigger performance gaps
Question
Should I use simple concatenation ({title} | {content}) instead of the documented format (title: {title} | text: {content}) for retrieval tasks?
The simple format consistently outperforms. Is this expected behavior or should documentation be updated?