Skip to content
Source | The Robot Brains Podcast Translation | Xu Jiayu, Jia Chuan, Yang TingIn 2017, Google released the paper “Attention Is All You Need,” which proposed the Transformer architecture. This has become one of the most influential technological innovations in the field of neural networks over the past decade and has been widely applied in various fields such as NLP, computer vision, and protein folding. More importantly, it has become the cornerstone of many large models, including ChatGPT.
However, the eight authors of the Transformer have all left Google. Among them, Lukasz Kaiser went to OpenAI, Llion Jones recently left to start a business, while the other six authors have participated in founding Adept, Cohere, Character.ai, Inceptive, and NEAR.AI, which are also popular companies in the industry.
As one of the researchers behind the Transformer, Aiden Gomez recognized the potential of language in AI early on and co-founded Cohere in 2019 with Nick Frosst and Ivan Zhang, primarily developing commercial language large models for enterprises and developers. Recently, Cohere secured $270 million in Series C funding from organizations like NVIDIA, Oracle, and Salesforce, becoming a unicorn valued at $2 billion. Notably, AI scholars such as Geoffrey Hinton, Fei-Fei Li, and Pieter Abbeel are also investors in it.
In Aiden Gomez’s view, during the technical research phase, organizations like Google Brain are a paradise for researchers to explore, but in the phase of utilizing technology to build real products and experiences, large companies like Google are far less free, flexible, and fast in research and development deployment compared to startups.
However, building large models is not an easy task. He believes that what Cohere is currently doing is akin to a human rocket project, composed of many different components and sensors, and if any one part has a problem, the rocket will explode— the process of building large models is similar; each stage of model construction depends on the quality of the work from the previous stage, which is intricate and complex. Recently, in a conversation with reinforcement learning expert Pieter Abbeel, Aiden shared his views on language large models, Cohere’s entrepreneurial ideas, and the story behind the creation of the Transformer paper.
(Original text: https://www.youtube.com/watch?v=zBK2CPka5jo)
Basic Principles of Language Large Models
Pieter Abbeel: As the co-founder and CEO of Cohere, you are training language large models (LLM). What are LLMs? What do they do?
Aidan Gomez: When it comes to Transformers and this generation of language models, we can see many powerful applications, such as GPT-3, Cohere, and ChatGPT. Their basic principle is to model more complex datasets by scaling up the model. Clearly, the most complex datasets should be data like internet data, which has accumulated over decades, and currently, internet users account for about 60-70% of the global population, engaging in various activities online, such as launching programming courses, language courses, and discussing various events and issues.
If we want to model this large and highly diverse dataset, we need to use extremely complex models, which is where the Transformer comes in. The Transformer is a neural network architecture that excels at scaling and can effectively perform parallel processing, which is crucial for training on large supercomputers equipped with thousands of GPU accelerators. Scaling models and datasets yields excellent results, as OpenAI has stated: Transformer models have become masters of multitasking.
This means that the same model and the same set of weights can perform multiple tasks, including translation, entity extraction, writing blogs and articles, etc. Now, we have created models that can complete tasks through communication (Cohere calls this command model, while OpenAI calls it instruction model), and language large model technology has entered people’s lives, becoming more intuitive and usable. Nowadays, in the eyes of most people, we can give natural language instructions to language large models, and then the model will generate corresponding results according to the instructions.
Pieter Abbeel: Essentially, the original model was trained on internet text, and through this way, the model learned to generate responses, etc. But the problem is that the quality of text on the internet varies greatly, and people’s needs are also different. How can we obtain a powerful model beyond expectations just by scaling the training data from the internet?
Aidan Gomez: As you said, there is a lot of noise in internet data, and the impact of this content on machine learning is more harmful than beneficial, so filtering is an important part of the data processing pipeline. Through filtering, we can eliminate noise, empty strings, or highly repetitive string content, thereby keeping the dataset as clean as possible. Clearly, across the entire internet, no matter how advanced and sophisticated the filter used, noise is always unavoidable, and there will always be a certain proportion of noise in the dataset. What we can do is minimize that proportion as much as possible.
But I believe that what happens after training is key, as it determines the model’s functionality and user experience. Initially, we crawled a large amount of web data from the internet to train the giant model (the web data may have been filtered), but next, we need to fine-tune the model on a manually curated refined dataset. The initial large model has acquired vast amounts of knowledge from the web, and what we need to do is guide the model to develop in the desired direction.
If the initial raw model has trillions of words from web pages, then the amount of data needed during the fine-tuning phase is much smaller, and what is important is that the data at this stage reflects our expectations for the model’s behavior. For example, in the case of command models (similar to OpenAI’s instruction model), we want to give the model some natural language instructions, and then the model can respond in an intuitive and correct way, such as asking the model to edit a blog post in an excited tone. To achieve this goal, we need to collect blog posts with excited tones, include them in the dataset, and fine-tune the model with this data, thus guiding the knowledgeable large model towards a model that can be intuitively controlled.
Similar to ChatGPT or conversational models, when facing knowledgeable and powerful models, we can fine-tune them with small refined datasets to guide the model towards the development direction we want. If we want a conversational model, we need to show the model a large amount of conversations, and the model will adjust according to the dataset we present, gradually developing into what we hope it to be through several training phases (from the initial chaotic large dataset to step by step fine-tuning).
Pieter Abbeel: Interestingly, fine-tuning can be done on such small datasets, which saves a lot of trouble related to large dataset processing. Moreover, language models have the most natural interactive interface, capable of engaging in long conversations based on what the user says. However, they all require hidden prompts to generate expected responses. How do you view this prompt engineering? How will it develop in the future?
Aidan Gomez: Prompt engineering is similar to providing instructions to the model and guiding its behavior. Fine-tuning the model on small refined datasets is an example of using data to guide the model, while the prompt mechanism refers to the instructions given to the model, where we can specify certain behaviors or tones, and provide demonstration examples. We can say, “I want you to write a blog about X in the tone of the blog below,” and then the model can start writing. This is few-shot prompting, which can demonstrate the desired behavior for the model’s response.
I hope that in the future, prompting will no longer be necessary. I really liked the early analogies about language large models, when language large models were compared to extraterrestrial technology, and we needed to learn how to communicate with them. However, doing so is very difficult; you need to reverse-engineer the language learned by the language large models from the internet, which requires a lot of effort. As the models become more powerful and the fine-tuning becomes more precise, I hope that mastering and understanding specific language models will no longer be a burden or an advantage.
Different models have different personalities, and “personality” refers to the fact that if you want the model to act according to your intent, you must learn to communicate with the model. Due to differences in the training datasets, we may need to learn different languages, which means that with each update of the model, we must readjust the internal representation of the conversation with the model. Therefore, most language large model builders in the industry, including myself, hope to gradually reduce and overcome this issue.
Pieter Abbeel: Ideally, models should transition in the most natural way, just like humans who are good at teamwork, and the model is expected to reach the level of an excellent collaborator.
Aidan Gomez: If this can be achieved, it would be an amazing accomplishment. Unfortunately, on a technical level, it is still difficult to reach or approach the level of human information sharing and conversational interaction. I believe that language large models are a technical interface that unlocks dialogue and fundamentally changes the scope of what can be built.
Pieter Abbeel: It is often said that training language large models (the larger the model, the better) requires a lot of computing power, which means significant financial investment. Is the training cost of SOTA language large models still high?
Aidan Gomez: Indeed, significant financial resources are needed for everything from purchasing equipment to attracting talent. One of the reasons I co-founded Cohere with Ivan Zhang was to lower this barrier. Using models to develop products should not be limited to companies that have raised large amounts of seed or Series A funding. Currently, this is indeed a huge obstacle, but we are trying to address this issue. The solution is to bear the costs of computing, talent, and data collection ourselves, and spread the costs among a broader user base, making the models affordable and easy to use.
Opportunities of Language Large Models
Pieter Abbeel: Cohere was founded in 2019 and is the first startup focused on language large models. Although OpenAI was already working on language large models at that time, their work was quite mixed. Since Cohere was established, many language large model startups have emerged, such as Anthropic (founded by former OpenAI employees), Character.ai (founded by former Google employees), and Adept (founded by former Google employees), etc. Many companies have raised over $100 million. Of course, in this competition, there are also large companies like Google, Meta, Amazon, and DeepMind, which have abundant resources. It can be said that the field of language large models is currently in a stage of competition. How do you view this situation?
Aidan Gomez: Language large models are a foundational technology, which is why many smart people want to work in this area, and we all welcome this. The applications of general language are very broad, with various types of products. Due to the diversity of language, it can provide very rich value, so the development prospects in this field are very vast.
Currently, many companies (from startups to the largest enterprises globally) are building language large models because the products and services that language large models can produce are almost limitless. This is exactly what Cohere advocates: to provide a platform on which anyone can build new products without needing to raise large amounts of funding. I believe that in the future, more competitors and new products will emerge, and I also look forward to the rapid expansion of technology coverage.
Pieter Abbeel: This reminds me of a tweet by Andrej Karpathy (OpenAI scientist): “English is now the hottest new programming language.”
Aidan Gomez: This statement indeed summarizes the current situation well. A recent paper proposed some interesting programming strategies, such as Loop Transformers, which achieve universality by recursively feeding the output of the Transformer back into itself.
Additionally, strategies like LangChain can combine calls and set loops and logic. The application of these strategies forms an ecosystem that increases utility through continuous reuse. I saw a fantastic project at the Sky Hackathon, where the backend was entirely based on the calls of language large models, with only one UI component. This idea remains significantly enlightening, showing that various traditional backend architectures can be replaced with language large models.
Pieter Abbeel: There are many languages in the world, and Cohere recently launched a multilingual text understanding model.
Aidan Gomez: The multilingual text understanding model is different from generative models like GPT or ChatGPT; it is a representation model similar to Bert. This model covers 109 languages and is mainly used for classification or semantic retrieval.
When training translation systems or multilingual systems, the traditional approach is to prepare paired examples. These examples need to have the same semantic content, one represented in language A and the other in language B. Some languages have a lot of paired examples, such as English and French, so the model is very good at translating these languages; however, some languages have fewer paired examples, such as Swahili and Korean.
For such languages, the team first needs to scrape an extremely large and diverse dataset, but this dataset does not have paired data, only a large amount of unsupervised text. Then, all possible pairing data is collected, and a smaller paired dataset is used to align different language representations. This way, a model that can support cross-language can be obtained.
Cohere currently supports 109 languages, which provides more possibilities when building classifiers. For example, I am a developer who only speaks English, but I can use the classifier developed on Cohere to provide some training examples, thus mapping to all 109 languages. Therefore, regardless of what language the user speaks to me, I can provide powerful classification capabilities. The same goes for information retrieval; even if I only understand English, I can search on a dataset containing multiple languages and retrieve all relevant documents.
Pieter Abbeel: The implementation of these technologies is indeed astonishing. 20 or 30 years ago, machine translation was still an unattainable dream, achieving little substantial results. And now, people can use this technology to translate between 109 languages and perform well even in languages without paired data.
Pieter Abbeel: What is Cohere’s user base like? Are there any commercial use cases?
Aidan Gomez: From a technical application perspective, we are still in the early stages of the adoption curve. Although ChatGPT has achieved some breakthroughs due to widespread public recognition, from the application curve perspective, we are still in the early stages. Typically, the first adopters of emerging technologies are students, engineers, and entrepreneurs, and Cohere’s initial use cases and most users are precisely these groups.
Now, executives at every large enterprise are pondering “What is ChatGPT? What are language large models? How can we leverage them?” Discussions on these topics began a few months ago, and it is expected that by 2024, all discussions will shift to how to apply ChatGPT on a large scale to existing large enterprises and products. I believe this trend will cover all industries.
Pieter Abbeel: If you have a large company with thousands of employees, and internal personnel are communicating via Slack and email, can we make the Cohere model access all the communication information so that the model can answer any work-related questions?
Aidan Gomez: Our direction of development is to create an assistant that interacts with large knowledge bases, which is one of Cohere’s goals. Although we have not fully achieved this goal yet, we are very close, and the prospects are very bright. In academia, people have been working hard on such issues, such as projects like Retrieval Augmented Generation, which aim to connect external knowledge bases with models for real-time updates. Theoretically, we only need to let the model access that database to retrieve documents and return the results in text form to users. More and more applications now support this functionality.
Imagine a system like this: Users can log in and connect to applications like Twitter, Slack, Discord, Gmail, etc., to access all their content and search through it. Additionally, users can ask the system questions related to their information in these programs and request it to take actions (such as sending emails or purchasing items). We are very close to such a system now, and we can definitely build it. In the near future, the system can access a large-scale, well-maintained, and personalized knowledge base and take action using tools, thereby producing substantial impacts and transformations.
Pieter Abbeel: I find this very interesting. I work in the field of reinforcement learning, where agents make decisions and take actions, etc. Do you think that in the near future, AI assistants will be able to achieve these functions?
Aidan Gomez: Such systems have significant application value in communication, especially in customer support. For example, in scenarios where agents can resolve user queries like resetting passwords, this type of system may be one of the most valuable applications. We hope to support this functionality first.
We have already built all the tools. For example, in video chats in browsers, if the model can control the browser or use the browser as a tool, then the model can execute all tasks of the browser, which will be very useful. Controlling a browser and browsing the web is more challenging than calling discrete APIs because the web environment is more irregular and variable. But I think this is one of the most anticipated and challenging projects currently.
Pieter Abbeel: I agree with you. If a technology can assist people in their lives, work, and entertainment, then the field where that technology resides will be very active.
Creation of the Non-Profit Organization Cohere for AI
Pieter Abbeel: I understand that Cohere also established a non-profit research lab.
Aidan Gomez: Cohere for AI is an independent non-profit organization led by Sara Hooker. Before founding Cohere, I and Ivan Zhang and others formed a small research team. At that time, I had just returned to the University of Toronto from Google, and I had many research ideas I wanted to implement, but I lacked like-minded people. So, I posted a message in the CS Slack of the University of Toronto, saying, “I am ready to do machine learning research; does anyone want to join?” As a result, I got responses from Ivan and some Google Brain researchers. We spent the entire summer working on project development and published our research results at ICLR (International Conference on Representation Learning).
Later, we decided to expand the team. We learned that many undergraduates, graduate students, and even graduates wanted to explore the field of machine learning but lacked collaboration. So we created a small website to share our work. The website gradually attracted 80 members from different countries (from 18 countries).
However, later on, I and Ivan founded Cohere, which took up the time for website maintenance, and the community’s development gradually stagnated. During a chat with Sara, I found out that she was very passionate about promoting the development of machine learning, eager to help others, and advocated mentorship. So, we hit it off and decided to use Cohere to support an independent non-profit organization focused on publishing high-quality research results and introducing a group of mentors to help these people complete projects and papers, providing support and guidance for their entry into the field of machine learning.
So, when this came to fruition, I was overjoyed. Now, Cohere for AI also has scholar programs and AI research residencies.
Pieter Abbeel: As a startup, why is the board willing to spend so much money to support a non-profit organization?
Aidan Gomez: I am fortunate to have such excellent board members, such as Mike Volpi (Index Partner) and Jordan Jacobs (Managing Partner of Radical Ventures). They support bringing this technology to market and focus more on the long-term development of the company than on short-term benefits. Even when we make decisions like funding a non-profit organization, they evaluate it from a comprehensive perspective.
The Birth of the Transformer
Pieter Abbeel: You co-authored the paper “Attention Is All You Need” with Google researchers, which introduced the Transformer architecture. This architecture has replaced all AI technologies in recent years and continues to develop. What were your thoughts when writing that paper? Did you predict the impact it would have, or were you surprised and found it all very novel?
Aidan Gomez: At that time, I joined Google as an intern and thought I was doing something entirely different. We met in Mountain View, and coincidentally, I was sitting next to Noam Shazeer, while Lukasz Kaiser was my mentor. We were preparing to build a platform specifically for training large autoregressive models that could be trained on distributed computing. At Lukas’s persuasion, Noam joined our team and began working on tensor to tensor.
In the field of translation, the team led by Jakob Uszkoreit focused on thinking about things that no one had thought of before, such as pure attention models. During my 12-week internship, our team’s research focus was on tensor to tensor conversion, optimizing architectures and hyperparameters, and submitting a paper to NeurIPS, which was published.
During my internship, I exchanged ideas with other authors about “the potential impact in the future,” among whom at least two had the foresight to predict what could happen, but I was not one of them, which is normal because it was my first paper. When we submitted the paper to NeurIPS, it was already 2 a.m., and only Ashish Vaswani and I were left in the office. I was lying on the sofa, and Ashish was sitting next to me, exhausted. He said to me, “We are doing something big.” I just looked at him and asked, “Why do you say that?” He replied, “I don’t know how to say it, but it’s important.” I just casually said, “Maybe,” and then went to sleep.
Months later, the academic community reached a consensus on the Transformer architecture. I think most authors felt similarly: if we didn’t do this, we would miss the opportunity. In research at DeepMind, there had already been similar works like PixelCNN and WaveNet. Therefore, the adoption of the Transformer architecture was a very unexpected result for most people, and its popularity surprised us.
Pieter Abbeel: It has been a long time since your first paper was published, but it is quite interesting that it is difficult to write another such paper again.
Aidan Gomez: Yes, that’s why I want to exit the research field as soon as possible.
Pieter Abbeel: The Transformer architecture has completely changed the field and made things that were previously impossible possible. Do you expect to see another architectural revolution like the Transformer in the future, or are there other factors that will determine the direction of AI development?
Aidan Gomez: I sincerely hope that the Transformer is not the final architecture. Although many improvements have been made since the Vanilla Transformer, I am not sure how significant the breakthroughs of the final architecture will be compared to the original. Perhaps some new features will emerge, such as a mixture of experts system with large-scale sparse distributed systems, but these components may still be very similar to the Transformer.
Now new questions have arisen, such as whether the attention mechanism is really important? How much attention mechanism do we need? Can we reduce some attention mechanisms without compromising performance to save computational resources? I really hope to find something better, more efficient, and more scalable, so that people can continue to work in this field and explore more possibilities in architecture.
Pieter Abbeel: From another angle, if we can gain a deeper understanding of the structure and function of the human brain, we can compare current model architectures with it to improve.
Aidan Gomez: If you have ever talked to Geoffrey Hinton, you would know that what you are saying is to some extent his source of inspiration. His recent forward-forward is very interesting, not only biologically feasible but potentially more efficient because the forward method is easier to pipeline than the backward path. Following a biological bottom-up approach too rigidly may be overly stringent, but Geoffrey’s strategy is very flexible; he is not bound by details and is good at drawing inspiration from known working systems. I find his approach very effective.
Pieter Abbeel: Geoffrey makes it very efficient in execution. Reading his work and communicating with him is always inspiring. He believes that being fixated on reading the model’s weight parameters is too rigid, and we should focus on just training the neural network. Of course, in production environments, it may be rough if we cannot replicate the weights to another machine, but some implicit assumptions limit our thinking.
Advancement of AI Researchers
Pieter Abbeel: What caught your attention the most when you were a child? How did these help you become an AI researcher and entrepreneur?
Aidan Gomez: I grew up in a forest in Ontario, where my family ran a maple syrup farm, so my upbringing was greatly influenced by Canada. I found such a life very comfortable, but it had its drawbacks, which was the lack of internet. In fact, even now, there is no internet there, so I later bought my parents Starlink, which is the best option currently available. Therefore, during my upbringing, I could not access as much as my friends could, but because of this, the internet became more attractive to me.
I had a computer with a modem at home, which was the only thing I could access at that time. So, I knew that I had to make the most of it and maximize its utility, which forced me to learn coding and understand how to make computers work better for me. I told myself to play with computers better than anyone else and to maximize their performance as much as possible.
Later, I gradually learned some programming skills and became interested in it. Almost at that time, everyone was beginning to access the internet, especially in rural areas. As a result, many stores needed to build websites and mark their locations on Google Maps.
So, I started a small company to help businesses in my town (about 10,000 people) get online, and of course, I charged a fee for it. Looking back now, the fee I charged was quite high, but it was actually very reasonable because it was below the market price. This was my first venture and my first exposure to customers. I really enjoyed this process because I could help people realize their business dreams, scale their businesses, and make them visible to more people.
Pieter Abbeel: Can you give some advice to kids who want to enter the AI industry? What should they do? What is the best path to achieve it?
Aidan Gomez: It depends on personal interests, the technical aspects one wants to engage in, and how much time one wants to invest in it. If one wants to become a researcher, it is necessary to delve into multivariable calculus, linear algebra, optimization theory, etc., gradually deepening into neural networks and machine learning, and of course, reading relevant papers is also essential.
I remember when I was at the University of Toronto, I often brought a stack of papers to the gym, reading papers and taking notes during breaks. Mike Volpi, a partner at Index Ventures, often teased me about these things, and looking back, it was quite embarrassing. But this approach was very practical; instead of wasting time during breaks or while exercising, it was better to read and write papers to enrich myself. I believe one should be obsessed with research materials and truly invest time in studying.
When you first start reading papers, you may not understand a single sentence, and at that point, you need to stop and learn the relevant knowledge before continuing. Therefore, it may take you several days or even weeks to read a paper. But you will eventually learn this knowledge and become very familiar with it, and even contribute new insights in that field.
In terms of products, do not waste time on low-level details, but focus on meaningful products. Grasp technology trends, insight into what products are currently lacking in the market, and think about what technologies are needed to realize such products. Make bold guesses and frequently communicate with potential customers to validate your ideas. In short, choosing different directions requires different paths to take.
