# Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.
#
# The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.