What else can LLMs do?

from words to everything!

Mar 04, 2025

The famous scene begins wiht John Williams’ unforgettable motof:

The sequence starts on D, rises one whole step to E, then jumps down to C (still in a middle register).
The fourth note is C again but an octave below.
Finally, it ascends to G (which sits between those two Cs).

John Williams chose these five notes as an iconic “contact” motif—like a greeting between humans and extraterrestrials. The intervals create a memorable pattern that’s neither too simple nor overly dissonant. But look where it took them!

1. Introduction: From Words to Everything

When Transformers were first introduced in the now-famous paper “Attention Is All You Need” (Vaswani et al., 2017), they revolutionised Natural Language Processing (NLP). The core idea was that language tokens (words, subwords, etc.) could be processed in parallel using self-attention to capture dependencies regardless of distance in the sequence.

But here’s the broader insight:

If you can represent your data as a sequence of symbols (or “tokens”), and if you can train a model to learn patterns of relationships among those tokens, then you can harness the Transformer’s power.

This realiaation means we can apply Transformers to just about any domain if we figure out how to “tokenise” it.

2. The Mechanics of Tokenisation

In NLP, tokenisation involves splitting text into words, subwords, or characters. But more generally:

Define a Vocabulary: A vocabulary is the set of atomic symbols in your domain. For text, that might be words or subword units (e.g., Byte-Pair Encoding). For DNA, it’s {A, C, G, T}. For music, it might be note events or MIDI tokens.
Map Raw Data to Tokens: Convert raw data into a sequence of those symbols. In images, for instance, you might treat each patch of pixels as a “token.” In time-series data, each time step (or aggregated chunk) can be a token.
Embed the Tokens: Each token is then mapped to a vector embedding that the Transformer can process.

2.1 Why This Matters

The self-attention mechanism inside a Transformer compares each token’s embedding to every other token’s embedding, figuring out which tokens should pay attention to which. This ability to learn pairwise relationships at scale is what powers so many breakthroughs across data domains.

3. Transformer Expansion Across Domains

3.1 Genomics & Proteomics

DNA as Text

In genomics, DNA is literally a string composed of four characters (A, C, G, T). Early work like DNABERT (Ji et al., 2021) applies the same masked language modeling concept from NLP to DNA sequences. It masks certain nucleotides and trains the model to predict them.
This approach has helped discover motifs and regulatory elements that are key to gene expression.

Protein Sequences

Proteins are sequences of amino acids, each represented by a 1-letter code (20 standard amino acids). Models like ProtBert (Elnaggar et al., 2020) or Facebook AI’s ESM (Evolutionary Scale Modeling) (Rives et al., 2021) treat proteins like text—learning to predict masked amino acids. They then use these pretrained embeddings for downstream tasks like structure prediction, function annotation, or variant effect prediction.
AlphaFold (Jumper et al., 2021) integrates attention mechanisms (though it’s more specialized) and made a huge leap in protein structure prediction by effectively capturing long-range dependencies in amino acid sequences.

3.2 Computer Vision

Vision Transformers (ViT)

Proposed by Dosovitskiy et al. (2021), ViT splits an image into patches (e.g., 16×16 pixels), each patch becomes a “token,” and the Transformer processes these tokens in parallel.
It showed that pure self-attention (with minimal convolution) could rival or exceed traditional convolutional neural networks (CNNs).

Hybrid Approaches

Some models combine convolutional layers for early feature extraction with Transformer blocks for higher-level processing (e.g., Swin Transformer, Liu et al., 2021).
This synergy captures local image features (convs) plus global context (self-attention).

3.3 Audio and Speech

wav2vec 2.0

Baevski et al. (2020) introduced a self-supervised Transformer model for raw audio waveforms. The model learns latent speech representations by masking portions of the audio and predicting them.
This approach parallels masked language modeling and has become a state-of-the-art technique in automatic speech recognition.

Music Generation

Music Transformer (Huang et al., 2019) from Google processes sequences of musical events (like MIDI notes) using self-attention. It can generate coherent melodies and multi-instrument pieces.
OpenAI’s Jukebox (Dhariwal et al., 2020) uses a VQ-VAE + Transformer approach to generate raw audio in various genres, though classical/choral fidelity is still challenging.

3.4 Time-Series and Forecasting

Temporal Fusion Transformer

Lim et al. (2019) introduced a Transformer variant designed for multivariate time-series forecasting. Each time step becomes a token, and the model learns complex dependencies (seasonality, external variables, etc.).
Transformers often outperform classic RNN-based approaches (LSTMs) when the sequence length is large or when you need to capture intricate patterns over time.

3.5 Graph Data

Graph Transformers

While graphs aren’t inherently linear, researchers have found ways to linearize adjacency or define specialized attention that factors in graph structure. For instance, Graph-BERT (Zhang et al., 2020) and Graphormer (Ying et al., 2021) adapt self-attention to handle nodes and edges.
This approach has proven useful in molecular property prediction (molecules are graphs of atoms), social network analysis, and knowledge graph tasks.

3.6 Code and Programming

Codex / GitHub Copilot

Code is text—just in a more structured form. OpenAI’s Codex (Chen et al., 2021) was fine-tuned on GitHub repositories. By treating code lines and tokens like regular text, the Transformer learns to generate or complete code.
This has led to tools like GitHub Copilot, which can assist developers by suggesting entire functions or boilerplate code.

3.7 Structured Tabular Data

Tabular Transformers

Tabular data is typically arranged in rows and columns, but each row can become a “sequence” of feature embeddings.
Researchers have shown that Transformers can capture interactions among features better than some gradient boosting methods, especially when many features (columns) have complex relationships (Some references: Huang et al., 2020; Arik & Pfister, 2021).

3.8 Multimodal Data

CLIP

Radford et al. (2021) introduced CLIP, which trains two encoders: one for text and one for images. It uses a contrastive objective to align the two modalities in a shared embedding space.
This concept extends to audio, video, and more. The Transformer can process diverse token types—image patches, text tokens, audio frames, etc.—and unify them in a single model.

4. Why It Works: The Transformer’s Power

Self-Attention: Each token compares itself with every other token, learning a rich set of pairwise interactions. This enables the model to capture long-range dependencies—crucial for tasks like DNA analysis (where distant nucleotides may influence gene expression) or music composition (where motifs recur across measures).
Parallelization: Transformers process tokens in parallel rather than sequentially (like RNNs). This drastically speeds up training on GPUs/TPUs.
Flexible Architecture: Transformers don’t rely on domain-specific operations (like convolution for images). They rely on attention over “tokens” in a generalizable framework.

5. Common Tokenization Schemes Across Domains

Text: Subwords via Byte-Pair Encoding (BPE) or SentencePiece.
DNA: Single nucleotides or k-mer tokens (e.g., chunks of length k=3).
Proteins: Single amino-acid tokens or grouped tokens for related subsequences.
Images: Square patches (e.g., 16×16 pixels).
Audio: Waveform frames or spectrogram patches.
Time-Series: Each time slice or a chunk of consecutive measurements as a token.
Graphs: Node labels or sequences of edges/nodes (with adjacency encoded in positional embeddings or specialized attention layers).

6. Key Challenges and Ongoing Research

Tokenization Efficiency: For high-resolution images or long DNA strands, naive tokenization can lead to enormous sequence lengths (thousands, even millions of tokens). There’s ongoing research into hierarchical tokenization or using efficient attention mechanisms (e.g., Performer, Linformer, Reformer) to handle longer sequences.
Positional Encoding: Transformers rely on positional encodings (sine/cosine or learned) to provide sequence order. Different data types (2D images, 3D proteins) might need more sophisticated positional representations.
Domain-Specific Constraints: Some domains (like music theory or structural biology) have strong prior knowledge about valid transitions or structural constraints. Incorporating these constraints explicitly can improve performance and interpretability.
Computational Cost: Self-attention scales with O(n²) complexity in sequence length. For large data (e.g., entire genomes, high-resolution videos), this becomes a bottleneck.

7. References

NLP:
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Vision:
- Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
- Liu, Z., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
Audio:
- Baevski, A., et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. NeurIPS 2020.
Music:
- Huang, C. A., et al. (2019). Music Transformer. ICLR 2019.
- Dhariwal, P., et al. (2020). Jukebox: A Generative Model for Music. arXiv preprint.
Genomics/Proteomics:
- Ji, Y., et al. (2021). DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers for DNA-language in genome. Bioinformatics, 37(15).
- Elnaggar, A., et al. (2020). ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv preprint.
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873).
Time-Series:
- Lim, B., et al. (2019). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. arXiv preprint.
Graphs:
- Zhang, Z., et al. (2020). Graph-BERT: Only Attention is Needed for Learning Graph Representations. arXiv preprint.
- Ying, C., et al. (2021). Do Transformers Really Perform Bad for Graph Representation?. NeurIPS 2021.
Code:
- Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint.
Multimodal:
- Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.

8. Closing Thoughts

Transformers have shown remarkable adaptability because the concept of “tokens + attention” is general. As long as we can sensibly represent data as sequences, we can leverage powerful pretrained models and fine-tune them for specialized tasks—be it discovering gene-regulatory mechanisms, forecasting the stock market, generating music in the style of Bach, or classifying images of distant galaxies.

Looking Forward

We’re seeing new research on efficient attention mechanisms (Performer, Linformer, BigBird, etc.) to handle extremely long sequences.
Multimodal “foundation models” like DALL·E, CLIP, and the next generation of large language models are pushing the boundaries of how Transformers handle cross-domain data (images + text + audio + more).
Expect more specialized architectures that integrate domain knowledge (like protein folding constraints, music theory, or graph connectivity) with the raw power of attention-based representations.

Paul’s Substack

Discussion about this post

Ready for more?