Generating Markov Chains¶
A Markov chain is a mathematical system that transitions from one state to another within a finite set of states. It is characterized by the memoryless property, which means that the next state depends only on the current state and not on the sequence of events that preceded it.
https://simple.wikipedia.org/wiki/Markov_chain
Generating Text with Markov Chains¶
One fun application of Markov chains is to generate text that sounds like it was written by a particular author. If we have lots of text written by someone with a distinct voice and vocabulary, we can use it to calculate probabilities that one word will follow the next, then we can use those probabilities to "autocomple" next words.
Here's how it works:
States: The words in the text are treated as states.
Transitions: The transitions between states are the occurrences of words following each other in the text. Each word has a probability of being followed by other words, based on how often that transition occurs in the corpus.
Transition Probabilities: By analyzing the text, we calculate the probabilities of each word following another word. This is done by counting how many times each word follows another and then converting these counts into probabilities.
Text Generation: Starting with an initial word (state), the Markov chain uses the transition probabilities to randomly choose the next word.
This process is repeated to generate a sequence of words, creating a new piece of text that mimics the style and structure of the original corpus.
Example:
Here's a very simple example using simple python that you should be able to read and recognize.
import random
# Step 1: Initialize a list of words
word_list = [
"the", "cat", "in", "the", "hat", "the", "cat", "on", "the", "mat",
"the", "dog", "and", "the", "cat", "the", "cat", "and", "the", "hat"
]
# Step 2: Calculate the transition probabilities
transitions = {}
# Populate the transitions dictionary with counts
for i in range(len(word_list) - 1):
current_word = word_list[i]
next_word = word_list[i + 1]
if current_word not in transitions:
transitions[current_word] = {}
if next_word not in transitions[current_word]:
transitions[current_word][next_word] = 0
transitions[current_word][next_word] += 1
# Convert counts to probabilities
for current_word, next_words in transitions.items():
total = sum(next_words.values())
for next_word in next_words:
transitions[current_word][next_word] /= total
# Step 3: Implement the function to generate the next word
def next_word(current_word):
if current_word not in transitions:
return random.choice(list(transitions.keys()))
probabilities = transitions[current_word]
next_words = list(probabilities.keys())
probabilities = list(prbabilities.values())
return random.choices(next_words, probabilities)[0]
# Step 4: Generate a sequence of words
def generate_sequence(start_word, length):
sequence = [start_word]
current_word = start_word
for _ in range(length - 1):
current_word = next_word(current_word)
sequence.append(current_word)
return sequence
# Example usage
start_word = "the"
length = 20
sequence = generate_sequence(start_word, length)
print("Generated sequence:", sequence)
Assignment¶
Copy and paste the above code into a code block and test it out. Read it and try and understand what's happening.
Replace the
word_list
with one you'll create on your own.Load the
songs.csv
dataset from the shared folder.from google.colab import drive import pandas as pd drive.mount('/content/gdrive') df = pd.read_csv('/content/gdrive/My Drive/datasets/songs.csv')
This dataset is a collection of songs and lyrics by different artists.
Filter the dataset down to a single artist and extract all of the text into a list of words.
Here's an example of filtering the dataset to only taylor swift.
taylor_df = df[df["Artist"] == "Taylor Swift"] taylor_df.head()
And here's an example of extracting her lyrics into one long
word_list
.taylor_words = taylor_df["Lyrics"].str.cat(sep=" ").split()
Use the above code, dataset, and your own modifications to generate a song written by the artist of your choice from this dataset.
import random
# Step 1: Initialize a list of words
word_list = [
"the", "cat", "in", "the", "hat", "the", "cat", "on", "the", "mat",
"the", "dog", "and", "the", "cat", "the", "cat", "and", "the", "hat"
]
# Step 2: Calculate the transition probabilities
transitions = {}
# Populate the transitions dictionary with counts
for i in range(len(word_list) - 1):
current_word = word_list[i]
next_word = word_list[i + 1]
if current_word not in transitions:
transitions[current_word] = {}
if next_word not in transitions[current_word]:
transitions[current_word][next_word] = 0
transitions[current_word][next_word] += 1
# Convert counts to probabilities
for current_word, next_words in transitions.items():
total = sum(next_words.values())
for next_word in next_words:
transitions[current_word][next_word] /= total
# Step 3: Implement the function to generate the next word
def next_word(current_word):
if current_word not in transitions:
return random.choice(list(transitions.keys()))
probabilities = transitions[current_word]
next_words = list(probabilities.keys())
probabilities = list(probabilities.values())
return random.choices(next_words, probabilities)[0]
# Step 4: Generate a sequence of words
def generate_sequence(start_word, length):
sequence = [start_word]
current_word = start_word
for _ in range(length - 1):
current_word = next_word(current_word)
sequence.append(current_word)
return sequence
# Example usage
start_word = "the"
length = 20
sequence = generate_sequence(start_word, length)
print("Generated sequence:", sequence)
Generated sequence: ['the', 'cat', 'and', 'the', 'cat', 'on', 'the', 'hat', 'the', 'cat', 'in', 'the', 'dog', 'and', 'the', 'cat', 'the', 'cat', 'and', 'the']
from google.colab import drive
import pandas as pd
#drive.mount('/content/gdrive')
#df = pd.read_csv('/content/gdrive/My Drive/datasets/songs.csv')
taylor_df = df[df["Artist"] == "Taylor Swift"]
all_word_list = df["Lyrics"].str.cat(sep=" ").split()
all_word_list += taylor_df["Lyrics"].str.cat(sep=" ").split()
all_word_list += taylor_df["Lyrics"].str.cat(sep=" ").split()
all_word_list += taylor_df["Lyrics"].str.cat(sep=" ").split()
transitions = {}
for i in range(len(word_list) - 1):
current_word = word_list[i]
next_word = word_list[i + 1]
if current_word not in transitions:
transitions[current_word] = {}
if next_word not in transitions[current_word]:
transitions[current_word][next_word] = 0
transitions[current_word][next_word] += 1
# Populate the transitions dictionary with counts
for i in range(len(word_list) - 1):
current_word = word_list[i]
next_word = word_list[i + 1]
if current_word not in transitions:
transitions[current_word] = {}
if next_word not in transitions[current_word]:
transitions[current_word][next_word] = 0
transitions[current_word][next_word] += 1
# Convert counts to probabilities
for current_word, next_words in transitions.items():
total = sum(next_words.values())
for next_word in next_words:
transitions[current_word][next_word] /= total
# Step 3: Implement the function to generate the next word
def next_word(current_word):
if current_word not in transitions:
return random.choice(list(transitions.keys()))
probabilities = transitions[current_word]
next_words = list(probabilities.keys())
probabilities = list(probabilities.values())
return random.choices(next_words, probabilities)[0]
# Step 4: Generate a sequence of words
def generate_sequence(start_word, length):
sequence = [start_word]
current_word = start_word
for _ in range(length - 1):
current_word = next_word(current_word)
sequence.append(current_word)
return sequence
# Example usage
start_word = "When"
length = 100
sequence = generate_sequence(start_word, length)
song = " ".join(sequence)
print("Generated sequence:", song)
Generated sequence: When you three This ain't my heart warm If your brother as hard time 'Cause I can) Wondering if you're the rest of his voice is me you it what you've seen53EmbedShare URLCopyEmbedCopy You like I dreamed of us, I was lost in breakable heaven But the day I just made me My smoking gun was just get just so many signs You started in my whole scene You hold on home on your goodbye And you'd call?) (Back when it's all the back as fast as fast as hard As long time you love the well, yeah Call it
df["Artist"].unique()
array(['Taylor Swift', 'Billie Eilish', 'The Beatles', 'David Bowie', 'Billy Joel', 'Ed Sheeran', 'Eric Clapton', 'Bruce Springsteen', 'Vance Joy', 'Lana Del Rey', 'Bryan Adams', 'Leonard Cohen', 'Nat King Cole', '\u200btwenty one pilots', 'Ray LaMontagne', 'Bob Dylan', 'John Denver', 'Frank Sinatra', 'Queen', 'Elton John', 'George Michael'], dtype=object)