
Part 1: Cutting-Edge Large Language Models
GPT Series includes related papers on GPT1, GPT2, GPT3, Codex, InstructGPT, and GPT4. These papers are straightforward and clear. Additionally, GPT3.5, 4o, o1, and o3 are more related to release activities and system cards.
GPT1 https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
GPT2 https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT3 https://arxiv.org/pdf/2005.14165
Codex https://arxiv.org/abs/2107.03374
InstructGPT https://arxiv.org/pdf/2203.02155
GPT4 https://arxiv.org/abs/2303.08774
Claude and Gemini Series To understand competitors, you can check the related papers on Claude 3 and Gemini 1. The latest iteration is Claude 3.5 Sonnet and Gemini 2.0 Flash/Flash Thinking, as well as Gemma 2.
Claude3 https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
Gemini1 https://arxiv.org/abs/2312.11805
Claude 3.5 Sonnet https://www.latent.space/p/claude-sonnet
Gemini 2.0 Flash https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0-flash
Flash Thinking https://ai.google.dev/gemini-api/docs/thinking-mode
Gemma2 https://arxiv.org/abs/2408.00118
LLaMA Series includes related papers on LLaMA 1, LLaMA 2, and LLaMA 3 to understand leading open-source models. Additionally, Mistral 7B, Mixtral, and Pixtral can be viewed as branches on the LLaMA family tree.
Llama 1 https://arxiv.org/abs/2302.13971
Llama 2 https://arxiv.org/abs/2307.09288
Llama 3 https://arxiv.org/abs/2407.21783
Mistral 7B https://arxiv.org/abs/2310.06825
Mixtral https://arxiv.org/abs/2401.04088
Pixtral https://arxiv.org/abs/2410.07073
DeepSeek Series includes related papers on DeepSeek V1, Coder, MoE, V2, and V3, showcasing the results of leading (relative) open-source model labs.
DeepSeek V1 https://arxiv.org/abs/2401.02954
Coder https://arxiv.org/abs/2401.14196
MoE https://arxiv.org/abs/2401.06066
V2 https://arxiv.org/abs/2405.04434
V3 https://github.com/deepseek-ai/DeepSeek-V3
Apple Intelligence Papers This is Apple’s AI-related research results on every Mac and iPhone.
https://arxiv.org/abs/2407.21075
You can also use and learn from many non-cutting-edge LLMs. In particular, BERTs as workload classification models are underrated; see ModernBERT for the latest technology. Notably, AI2 (Olmo, Molmo, OlmOE, Tülu 3, Olmo 2), Grok, Amazon Nova, Yi, Reka, Jamba, Cohere, Nemotron, Microsoft Phi, HuggingFace SmolLM, most are ranked lower or lack papers. Alpaca and Vicuna have historical significance, while Mamba 1/2 and RWKV have potential future significance. If time permits, it is recommended to read scaling law literature: Kaplan, Chinchilla, Emergence/Mirage, post-Chinchilla law.
ModernBERT https://buttondown.com/ainews/archive/ainews-modernbert-small-new-retrieverclassifier/
Grok https://github.com/xai-org/grok-1
Part 2: Benchmarking and Evaluation
MMLU Papers serve as the main knowledge benchmark, alongside GPQA and BIG-Bench. By 2025, leading labs mainly use MMLU Pro, GPQA Diamond, and BIG-Bench Hard.
MMLU https://arxiv.org/abs/2009.03300
GPQA https://arxiv.org/abs/2311.12022
BIG-Bench https://arxiv.org/abs/2206.04615
MMLU Pro https://arxiv.org/abs/2406.01574
GPQA Diamond https://arxiv.org/abs/2311.12022
BIG-Bench Hard https://arxiv.org/abs/2210.09261
MuSR Papers are used to evaluate long-context capabilities, alongside LongBench, BABILong, and RULER. They focus on solving “Lost in The Middle” and similar problems, such as “Needle in a Haystack”.
MuSR https://arxiv.org/abs/2310.16049 LongBench https://arxiv.org/abs/2412.15204 BABILong https://arxiv.org/abs/2406.10149 RULER https://www.latent.space/p/gradient Lost in The Middle https://arxiv.org/abs/2307.03172 Needle in a Haystack https://github.com/gkamradt/LLMTest_NeedleInAHaystack
MATH Papers compile math competition problems. Leading labs focus on specific subsets of MATH, including MATH level 5, AIME, FrontierMath, and AMC10/AMC12.
MATH https://arxiv.org/abs/2103.03874 AIME https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024 FrontierMath https://arxiv.org/abs/2411.04872 AMC10/AMC12 https://github.com/ryanrudes/amc
IFEval Papers are leading instruction-following evaluation tools and are the only external benchmark tool adopted by Apple. At the same time, MT-Bench can also be seen as a form of IFEval.
IFEval Papers
https://arxiv.org/abs/2311.07911 adopted by Apple https://machinelearning.apple.com/research/introducing-apple-foundation-models MT-Bench https://arxiv.org/abs/2306.05685
ARC AGI Challenge is a well-known abstract reasoning “IQ test” benchmark that lasts longer than many quickly saturating benchmarks.
ARC AGI https://arcprize.org/arc
Related Courses and Content We cover many such benchmarks in Benchmarks 101 and Benchmarks 201, while programs related to Carlini, LMArena, and Braintrust explore private, arena, and product evaluations (refer to LLM-as-Judge and Applied LLMs papers). Additionally, benchmarks are closely related to datasets.
Benchmarks 101 https://www.latent.space/p/benchmarks-101 Benchmarks 201 https://www.latent.space/p/benchmarks-201 Carlini https://www.latent.space/p/carlini LMArena https://www.latent.space/p/lmarena LLM-as-Judge https://hamel.dev/blog/posts/llm-judge/
Part 3: Prompts, ICL, and Chain of Thought
Note: The GPT-3 paper (“Language Models are Few-Shot Learners”) has introduced In-Context Learning (ICL), which is closely related to prompting. Additionally, we also consider prompt injection as necessary background knowledge – refer to Lilian Weng and Simon W’s content.
Prompt Report https://arxiv.org/abs/2406.06608 https://www.latent.space/p/learn-prompting
Chain of Thought https://arxiv.org/abs/2201.11903 Scratchpads https://arxiv.org/abs/2112.00114 Let’s Think Step By Step https://arxiv.org/abs/2205.11916
https://arxiv.org/abs/2305.10601 https://www.latent.space/p/shunyu
https://aclanthology.org/2021.emnlp-main.243 https://arxiv.org/abs/2101.00190 https://arxiv.org/abs/2402.10200 https://vgel.me/posts/representation-engineering https://github.com/xjdr-alt/entropix
https://arxiv.org/abs/2211.01910 https://arxiv.org/abs/2310.03714
The importance of Part 3 is that in this field, merely reading various scattered papers may not be as effective as some practical guides. It is recommended to refer to Lilian Weng, Eugene Yan, and Anthropic’s prompt engineering tutorials and AI engineer workshops.
https://nlp.stanford.edu/IR-book/information-retrieval-book.html https://en.wikipedia.org/wiki/Information_retrieval#History https://en.wikipedia.org/wiki/Tf%E2%80%93idf https://en.wikipedia.org/wiki/Okapi_BM25 https://github.com/facebookresearch/faiss https://arxiv.org/abs/1603.09320
https://arxiv.org/abs/2005.11401 https://contextual.ai/introducing-rag2/ https://docs.llamaindex.ai/en/stable/optimizing/advanced_retrieval/query_transformations/ https://research.trychroma.com/evaluating-chunking https://cohere.com/blog/rerank-3pt5 https://www.youtube.com/watch?v=i2vBaFzCEJw https://www.youtube.com/watch?v=TRjq7t2Ms5I&t=152s https://www.youtube.com/watch?v=FDEmbYPgG-s https://www.youtube.com/watch?v=DId2KP8Ykz4
https://arxiv.org/abs/2210.07316 https://news.ycombinator.com/item?id=42504379 https://www.youtube.com/watch?v=VIqXNRsRRQo https://huggingface.co/blog/matryoshka
https://arxiv.org/pdf/2404.16130 https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/?utm_source=ainews&utm_medium=email&utm_campaign=ainews-graphrag https://buttondown.com/ainews/archive/ainews-graphrag/ https://www.youtube.com/watch?v=knDDGYHnnSI https://github.com/stanford-futuredata/ColBERT
https://arxiv.org/abs/2309.15217 https://x.com/swyx/status/1724490887147978793 https://arxiv.org/abs/2407.07858v1 https://lilianweng.github.io/posts/2024-07-07-hallucination/ https://x.com/_jasonwei/status/1871285864690815053
https://arxiv.org/abs/2310.06770 https://www.latent.space/p/iclr-2024-benchmarks-agents?utm_source=publication-search#%C2%A7section-b-benchmarks https://www.latent.space/p/claude-sonnet https://openai.com/index/introducing-swe-bench-verified/ https://x.com/jiayi_pirate/status/1871249410128322856 https://arxiv.org/abs/2405.15793 https://arxiv.org/abs/2410.03859 https://kprize.ai/
https://arxiv.org/abs/2210.03629 https://www.latent.space/p/shunyu https://gorilla.cs.berkeley.edu/ https://gorilla.cs.berkeley.edu/leaderboard.html https://arxiv.org/abs/2302.04761 https://arxiv.org/abs/2303.17580
https://arxiv.org/abs/2310.08560 https://openai.com/index/memory-and-new-controls-for-chatgpt/ https://langchain-ai.github.io/langgraph/concepts/memory/#episodic-memory https://arxiv.org/abs/2308.00352 https://arxiv.org/abs/2308.08155 https://github.com/joonspk-research/generative_agents
https://arxiv.org/abs/2305.16291 https://arxiv.org/abs/2309.02427 https://arxiv.org/abs/2409.07429
https://www.anthropic.com/research/building-effective-agents https://github.com/openai/swarm
https://www.latent.space/p/2024-agents https://www.youtube.com/watch?v=wnsZ7DuqYp0
https://arxiv.org/abs/2211.15533 https://huggingface.co/datasets/bigcode/the-stack-v2 https://arxiv.org/abs/2402.19173
https://arxiv.org/abs/2401.14196 https://arxiv.org/abs/2409.12186 https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://www.latent.space/p/claude-sonnet
https://arxiv.org/abs/2107.03374 https://aider.chat/docs/leaderboards/ https://arxiv.org/abs/2312.02143 https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard https://livecodebench.github.io/ https://buttondown.com/ainews/archive/ainews-to-be-named-5745/
https://arxiv.org/abs/2401.08500 https://news.ycombinator.com/item?id=34020025 https://x.com/RemiLeblond/status/1732419456272318614
https://criticgpt.org/criticgpt-openai/ https://arxiv.org/abs/2412.15004v1 https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-code
https://www.youtube.com/watch?v=Ve-akpov78Q https://www.youtube.com/watch?v=T7NWjoD_OuY&t=8s
https://arxiv.org/abs/1506.02640 https://github.com/ultralytics/ultralytics https://news.ycombinator.com/item?id=42352342 https://arxiv.org/abs/2304.08069
https://arxiv.org/abs/2103.00020 https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2201.12086 https://arxiv.org/abs/2301.12597 https://www.latent.space/i/152857207/part-vision
https://arxiv.org/abs/2401.06209 https://www.latent.space/p/2024-vision https://arxiv.org/abs/2311.16502 https://arxiv.org/abs/2410.03859
https://arxiv.org/abs/2304.02643 https://arxiv.org/abs/2408.00714 https://latent.space/p/sam2 https://github.com/IDEA-Research/GroundingDINO
https://arxiv.org/abs/2304.08485 https://www.latent.space/p/neurips-2023-papers https://huyenchip.com/2023/10/10/multimodal.html https://arxiv.org/abs/2405.09818 https://arxiv.org/abs/2411.14402 https://lilianweng.github.io/posts/2022-06-09-vlm/
https://cdn.openai.com/papers/GPTV_System_Card.pdf https://arxiv.org/abs/2309.17421 https://blog.roboflow.com/gpt-4o-object-detection/ https://buttondown.com/ainews/archive/ainews-llama-32-on-device-1b3b-and-multimodal/ https://www.youtube.com/watch?v=T7sxvrJLJ14
https://arxiv.org/abs/2212.04356 https://news.ycombinator.com/item?id=33884716 https://news.ycombinator.com/item?id=38166965 https://github.com/huggingface/distil-whisper https://amgadhasan.substack.com/p/demystifying-openais-new-whisper
http://audiopalm https://arxiv.org/abs/2407.21783
https://arxiv.org/abs/2205.04421?utm_source=chatgpt.com https://arxiv.org/abs/2403.03100
http://moshi/ https://www.youtube.com/watch?v=hm2IJSKcYvo https://www.hume.ai/blog/introducing-octave
https://www.latent.space/p/realtime-api
https://arxiv.org/abs/2112.10752 https://stability.ai/news/stable-diffusion-v2-release https://arxiv.org/abs/2307.01952 https://arxiv.org/abs/2403.03206 https://github.com/black-forest-labs/flux
https://arxiv.org/abs/2102.12092 https://arxiv.org/abs/2204.06125 https://cdn.openai.com/papers/dall-e-3.pdf
https://arxiv.org/abs/2205.11487 https://deepmind.google/technologies/imagen-2/ https://arxiv.org/abs/2408.07009 https://www.reddit.com/r/singularity/comments/1exsq4d/introducing_ideogram_20_our_most_advanced/
https://arxiv.org/abs/2303.01469 https://arxiv.org/abs/2310.04378 https://www.latent.space/p/tldraw https://arxiv.org/abs/2410.11081
https://openai.com/index/sora/ https://arxiv.org/abs/2212.09748 https://artificialanalysis.ai/text-to-video/arena?tab=Leaderboard https://arxiv.org/abs/2412.00131 https://lilianweng.github.io/posts/2024-04-12-diffusion-video/
https://arxiv.org/abs/2106.09685 http://arxiv.org/abs/2305.14314 https://www.latent.space/p/cosine https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
https://arxiv.org/abs/2305.18290 https://arxiv.org/abs/1707.06347 https://platform.openai.com/docs/guides/fine-tuning#preference
https://arxiv.org/abs/2404.03592
https://www.microsoft.com/en-us/research/blog/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators/ https://www.latent.space/p/2024-syndata-smolmodels
https://www.interconnects.ai/p/openais-reinforcement-finetuning https://arxiv.org/abs/2305.20050 https://x.com/swyx/status/1867990396762243324
https://github.com/unslothai/unsloth https://www.philschmid.de/fine-tune-llms-in-2025