01 / task
Pick a task
Before any code, the most important question: what is the kid going to learn to do? We picked the simplest task that still produces something interesting:
Given some text, predict the next character.
That's it. No grammar rules. No dictionary. Just — here are some letters; what comes next? It turns out that if you get very good at this one task, on enough text, you accidentally learn an enormous amount about language along the way.
Step 1: get some text
We use the same dataset Andrej Karpathy uses in his classic char-rnn tutorial: a 1.1 MB plaintext file containing the complete works of William Shakespeare. About 1,115,394 characters. Here's the first chunk of it:
Loading…
URL = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
if not os.path.exists("input.txt"):
print("Downloading tinyshakespeare...")
urllib.request.urlretrieve(URL, "input.txt")
text = open("input.txt").read()Step 2: turn it into (input, target) pairs
To train the kid to predict the next character, we need lots of examples of "here's some text → here's what came next." The trick is: we don't need to label anything by hand. The text labels itself. The target is just the input shifted by one character.
def get_batch(split):
"""Sample BATCH_SIZE random chunks of length BLOCK_SIZE.
x is the input, y is x shifted by one (the next-char target)."""
d = train_data if split == "train" else val_data
ix = torch.randint(len(d) - BLOCK_SIZE - 1, (BATCH_SIZE,))
x = torch.stack([d[i:i + BLOCK_SIZE] for i in ix])
y = torch.stack([d[i + 1:i + BLOCK_SIZE + 1] for i in ix])
return x.to(DEVICE), y.to(DEVICE)