15 KiB
Citation Management & Hallucination Prevention
This reference provides a complete workflow for managing citations programmatically, preventing AI-generated citation hallucinations, and maintaining clean bibliographies.
Contents
- Why Citation Verification Matters
- Citation APIs Overview
- Verified Citation Workflow
- Python Implementation
- BibTeX Management
- Common Citation Formats
- Troubleshooting
Why Citation Verification Matters
The Hallucination Problem
Research has documented significant issues with AI-generated citations:
- ~40% error rate in AI-generated citations (Enago Academy research)
- NeurIPS 2025 found 100+ hallucinated citations slipped through review
- Common errors include:
- Fabricated paper titles with real author names
- Wrong publication venues or years
- Non-existent papers with plausible metadata
- Incorrect DOIs or arXiv IDs
Consequences
- Desk rejection at some venues
- Loss of credibility with reviewers
- Potential retraction if published
- Wasted time chasing non-existent sources
Solution
Never generate citations from memory—always verify programmatically.
Citation APIs Overview
Primary APIs
| API | Coverage | Rate Limits | Best For |
|---|---|---|---|
| Semantic Scholar | 214M papers | 1 RPS (free key) | ML/AI papers, citation graphs |
| CrossRef | 140M+ DOIs | Polite pool with mailto | DOI lookup, BibTeX retrieval |
| arXiv | Preprints | 3-second delays | ML preprints, PDF access |
| OpenAlex | 240M+ works | 100K/day, 10 RPS | Open alternative to MAG |
API Selection Guide
Need ML paper search? → Semantic Scholar
Have DOI, need BibTeX? → CrossRef content negotiation
Looking for preprint? → arXiv API
Need open data, bulk access? → OpenAlex
No Official Google Scholar API
Google Scholar has no official API. Scraping violates ToS. Use SerpApi ($75-275/month) only if Semantic Scholar coverage is insufficient.
Verified Citation Workflow
5-Step Process
1. SEARCH → Query Semantic Scholar with specific keywords
↓
2. VERIFY → Confirm paper exists in 2+ sources
↓
3. RETRIEVE → Get BibTeX via DOI content negotiation
↓
4. VALIDATE → Confirm the claim appears in source
↓
5. ADD → Add verified entry to .bib file
Step 1: Search
Use Semantic Scholar for ML/AI papers:
from semanticscholar import SemanticScholar
sch = SemanticScholar()
results = sch.search_paper("transformer attention mechanism", limit=10)
for paper in results:
print(f"Title: {paper.title}")
print(f"Year: {paper.year}")
print(f"DOI: {paper.externalIds.get('DOI', 'N/A')}")
print(f"arXiv: {paper.externalIds.get('ArXiv', 'N/A')}")
print(f"Citation count: {paper.citationCount}")
print("---")
Step 2: Verify Existence
Confirm paper exists in at least two sources:
import requests
def verify_paper(doi=None, arxiv_id=None, title=None):
"""Verify paper exists in multiple sources."""
sources_found = []
# Check Semantic Scholar
sch = SemanticScholar()
if doi:
paper = sch.get_paper(f"DOI:{doi}")
if paper:
sources_found.append("Semantic Scholar")
# Check CrossRef (via DOI)
if doi:
resp = requests.get(f"https://api.crossref.org/works/{doi}")
if resp.status_code == 200:
sources_found.append("CrossRef")
# Check arXiv
if arxiv_id:
resp = requests.get(
f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
)
if "<entry>" in resp.text:
sources_found.append("arXiv")
return len(sources_found) >= 2, sources_found
Step 3: Retrieve BibTeX
Use DOI content negotiation for guaranteed accuracy:
import requests
def doi_to_bibtex(doi: str) -> str:
"""Get verified BibTeX from DOI via CrossRef content negotiation."""
response = requests.get(
f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"},
allow_redirects=True
)
response.raise_for_status()
return response.text
# Example: "Attention Is All You Need"
bibtex = doi_to_bibtex("10.48550/arXiv.1706.03762")
print(bibtex)
Step 4: Validate Claims
Before citing a paper for a specific claim, verify the claim exists:
def get_paper_abstract(doi):
"""Get abstract to verify claims."""
sch = SemanticScholar()
paper = sch.get_paper(f"DOI:{doi}")
return paper.abstract if paper else None
# Verify claim appears in abstract
abstract = get_paper_abstract("10.48550/arXiv.1706.03762")
claim = "attention mechanism"
if claim.lower() in abstract.lower():
print("Claim appears in paper")
Step 5: Add to Bibliography
Add verified entry to your .bib file with consistent key format:
def generate_citation_key(bibtex: str) -> str:
"""Generate consistent citation key: author_year_firstword."""
import re
# Extract author
author_match = re.search(r'author\s*=\s*\{([^}]+)\}', bibtex, re.I)
if author_match:
first_author = author_match.group(1).split(',')[0].split()[-1]
else:
first_author = "unknown"
# Extract year
year_match = re.search(r'year\s*=\s*\{?(\d{4})\}?', bibtex, re.I)
year = year_match.group(1) if year_match else "0000"
# Extract title first word
title_match = re.search(r'title\s*=\s*\{([^}]+)\}', bibtex, re.I)
if title_match:
first_word = title_match.group(1).split()[0].lower()
first_word = re.sub(r'[^a-z]', '', first_word)
else:
first_word = "paper"
return f"{first_author.lower()}_{year}_{first_word}"
Python Implementation
Complete Citation Manager Class
{% raw %}
"""
Citation Manager - Verified citation workflow for ML papers.
"""
import requests
import time
from typing import Optional, List, Dict, Tuple
from dataclasses import dataclass
try:
from semanticscholar import SemanticScholar
except ImportError:
print("Install: pip install semanticscholar")
SemanticScholar = None
@dataclass
class Paper:
title: str
authors: List[str]
year: int
doi: Optional[str]
arxiv_id: Optional[str]
venue: Optional[str]
citation_count: int
abstract: Optional[str]
class CitationManager:
"""Manage citations with verification."""
def __init__(self, api_key: Optional[str] = None):
self.sch = SemanticScholar(api_key=api_key) if SemanticScholar else None
self.verified_papers: Dict[str, Paper] = {}
def search(self, query: str, limit: int = 10) -> List[Paper]:
"""Search for papers using Semantic Scholar."""
if not self.sch:
raise RuntimeError("Semantic Scholar not available")
results = self.sch.search_paper(query, limit=limit)
papers = []
for r in results:
paper = Paper(
title=r.title,
authors=[a.name for a in (r.authors or [])],
year=r.year or 0,
doi=r.externalIds.get('DOI') if r.externalIds else None,
arxiv_id=r.externalIds.get('ArXiv') if r.externalIds else None,
venue=r.venue,
citation_count=r.citationCount or 0,
abstract=r.abstract
)
papers.append(paper)
return papers
def verify(self, paper: Paper) -> Tuple[bool, List[str]]:
"""Verify paper exists in multiple sources."""
sources = []
# Already found in Semantic Scholar via search
sources.append("Semantic Scholar")
# Check CrossRef if DOI available
if paper.doi:
try:
resp = requests.get(
f"https://api.crossref.org/works/{paper.doi}",
timeout=10
)
if resp.status_code == 200:
sources.append("CrossRef")
except Exception:
pass
# Check arXiv if ID available
if paper.arxiv_id:
try:
resp = requests.get(
f"http://export.arxiv.org/api/query?id_list={paper.arxiv_id}",
timeout=10
)
if "<entry>" in resp.text and "<title>" in resp.text:
sources.append("arXiv")
except Exception:
pass
return len(sources) >= 2, sources
def get_bibtex(self, paper: Paper) -> Optional[str]:
"""Get BibTeX for verified paper."""
if paper.doi:
try:
resp = requests.get(
f"https://doi.org/{paper.doi}",
headers={"Accept": "application/x-bibtex"},
timeout=10,
allow_redirects=True
)
if resp.status_code == 200:
return resp.text
except Exception:
pass
# Fallback: generate from paper data
return self._generate_bibtex(paper)
def _generate_bibtex(self, paper: Paper) -> str:
"""Generate BibTeX from paper metadata."""
# Generate citation key
first_author = paper.authors[0].split()[-1] if paper.authors else "unknown"
first_word = paper.title.split()[0].lower().replace(',', '').replace(':', '')
key = f"{first_author.lower()}_{paper.year}_{first_word}"
# Format authors
authors = " and ".join(paper.authors) if paper.authors else "Unknown"
bibtex = f"""@article{{{key},
title = {{{paper.title}}},
author = {{{authors}}},
year = {{{paper.year}}},
{'doi = {' + paper.doi + '},' if paper.doi else ''}
{'eprint = {' + paper.arxiv_id + '},' if paper.arxiv_id else ''}
{'journal = {' + paper.venue + '},' if paper.venue else ''}
}}"""
return bibtex
def cite(self, query: str) -> Optional[str]:
"""Full workflow: search, verify, return BibTeX."""
# Search
papers = self.search(query, limit=5)
if not papers:
return None
# Take top result
paper = papers[0]
# Verify
verified, sources = self.verify(paper)
if not verified:
print(f"Warning: Could only verify in {sources}")
# Get BibTeX
bibtex = self.get_bibtex(paper)
# Cache
if bibtex:
self.verified_papers[paper.title] = paper
return bibtex
# Usage example
if __name__ == "__main__":
cm = CitationManager()
# Search and cite
bibtex = cm.cite("attention is all you need transformer")
if bibtex:
print(bibtex)
{% endraw %}
Quick Functions
def quick_cite(query: str) -> str:
"""One-liner citation."""
cm = CitationManager()
return cm.cite(query)
def batch_cite(queries: List[str], output_file: str = "references.bib"):
"""Cite multiple papers and save to file."""
cm = CitationManager()
bibtex_entries = []
for query in queries:
print(f"Processing: {query}")
bibtex = cm.cite(query)
if bibtex:
bibtex_entries.append(bibtex)
time.sleep(1) # Rate limiting
with open(output_file, 'w') as f:
f.write("\n\n".join(bibtex_entries))
print(f"Saved {len(bibtex_entries)} citations to {output_file}")
BibTeX Management
BibTeX vs BibLaTeX
| Feature | BibTeX | BibLaTeX |
|---|---|---|
| Unicode support | Limited | Full |
| Entry types | Standard | Extended (@online, @dataset) |
| Customization | Limited | Highly flexible |
| Backend | bibtex | Biber (recommended) |
Recommendation: Use natbib with BibTeX for conference submissions — all major venue templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM) ship with natbib and .bst files. BibLaTeX with Biber is an option for journals or personal projects where you control the template.
LaTeX Setup
% In preamble
\usepackage[
backend=biber,
style=numeric,
sorting=none
]{biblatex}
\addbibresource{references.bib}
% In document
\cite{vaswani_2017_attention}
% At end
\printbibliography
Citation Commands
\cite{key} % Numeric: [1]
\citep{key} % Parenthetical: (Author, 2020)
\citet{key} % Textual: Author (2020)
\citeauthor{key} % Just author name
\citeyear{key} % Just year
Consistent Citation Keys
Use format: author_year_firstword
vaswani_2017_attention
devlin_2019_bert
brown_2020_language
Common Citation Formats
Conference Paper
@inproceedings{vaswani_2017_attention,
title = {Attention Is All You Need},
author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and
Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and
Kaiser, Lukasz and Polosukhin, Illia},
booktitle = {Advances in Neural Information Processing Systems},
volume = {30},
year = {2017},
publisher = {Curran Associates, Inc.}
}
Journal Article
@article{hochreiter_1997_long,
title = {Long Short-Term Memory},
author = {Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
journal = {Neural Computation},
volume = {9},
number = {8},
pages = {1735--1780},
year = {1997},
publisher = {MIT Press}
}
arXiv Preprint
@misc{brown_2020_language,
title = {Language Models are Few-Shot Learners},
author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and others},
year = {2020},
eprint = {2005.14165},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
Troubleshooting
Common Issues
Issue: Semantic Scholar returns no results
- Try more specific keywords
- Check spelling of author names
- Use quotation marks for exact phrases
Issue: DOI doesn't resolve to BibTeX
- DOI may be registered but not linked to CrossRef
- Try arXiv ID instead if available
- Generate BibTeX from metadata manually
Issue: Rate limiting errors
- Add delays between requests (1-3 seconds)
- Use API key if available
- Cache results to avoid repeat queries
Issue: Encoding problems in BibTeX
- Use proper LaTeX escaping:
{\"u}for ü - Ensure file is UTF-8 encoded
- Use BibLaTeX with Biber for better Unicode
Verification Checklist
Before adding a citation:
- Paper found in at least 2 sources
- DOI or arXiv ID verified
- BibTeX retrieved (not generated from memory)
- Entry type correct (@inproceedings vs @article)
- Author names complete and correctly formatted
- Year and venue verified
- Citation key follows consistent format
Additional Resources
APIs:
- Semantic Scholar: https://api.semanticscholar.org/api-docs/
- CrossRef: https://www.crossref.org/documentation/retrieve-metadata/rest-api/
- arXiv: https://info.arxiv.org/help/api/basics.html
- OpenAlex: https://docs.openalex.org/
Python Libraries:
semanticscholar: https://pypi.org/project/semanticscholar/arxiv: https://pypi.org/project/arxiv/habanero(CrossRef): https://github.com/sckott/habanero
Verification Tools:
- Citely: https://citely.ai/citation-checker
- ReciteWorks: https://reciteworks.com/