With our current LLM wave, which is fascinating, I've begun reading about their basics. I have a draft or two of posts with small experiments I'm doing to replicate tiny pieces of their systems (text processing is a topic that I'm not sure why but sparks my curiosity), but I remembered that not long ago, I had written a simple Markov model, a Markov chain to generate variants of the sentences found in a text.
It reads a
.txt file line by line, assuming each line is a sentence, and fills a Python dictionary with the "chain" of words that forms each sentence (plus the start and end of sentence delimiters).
Using a few sentences from my sample
Be the change you want to see in this world
Be the person your dog thinks you are
Everything a person can imagine, others will do
When it finishes reading, it knows it can begin a sentence with "
Be" or "
Everything"; If it (randomly) chooses "
Be", then it has only seen the word "
the" after it, so must it must follow; but the third word can either be "
change" or "
person"; If it chose "
person", then the next word could either be "
your" or "
can"; and so it goes until it either picks an end of sentence delimiter, or we reach the maximum number of words per sentence we've setup.
It could generate the sentence "
Be the person can imagine, others will do .". It is incorrect, but the bigger the input text you feed, the greater the variety and potential chance of generating something making more sense.
As an example with the quotes file, running it a few times, sometimes produces funny philosopher quotes:
Experience is easy .
Don’t teach them like a professional is right not improving .
Be the worst .
He who seek the things will never have to control complexity not a shorter letter .
Spend your own happiness and go home .
Boy Scout Rule Always leave the happiness and go home .
Learn the best and distribute the rules like a priority .
Work hard and practice something you’ve never have written a marvellous thing that you don't feel like doing them to ...
He who thinks you want to think .
Try to create it wrong because nobody sees it again .
I've also included a transcript of a TED talk, which again mostly generates gibberish, but at times almost looks correct.
Not precisely groundbreaking, but fun and illustrative of some basic "brute-forcing" method of creating new text.
You can find the Python code on my GitHub.