
If you have just started working with the babyGPT module, you would need to run the
scripts in the Examples directory in the order shown below:


    1.    run_gatherer.py

            Before you run this script, check the URLs that are specified in the list

                       urls

            near the top of the script.  For the very first run, you might just want
            to have a single URL in that list just to make sure that there exist no
            network issues (caused by, say, firewalls, etc.) that would keep you from
            downloading the articles.

            By default, the downloaded articles are saved in the following directory

                      saved_articles_dir

            The Examples directory also contains the following shellscript

                      total_chars.sh                      

            Execute this script inside the directory "saved_articles_dir" to see the
            total number of chars in all of the text data you downloaded.



    2.    train_tokenizer.py

            If the text corpus you have collected is for a specialized domain (such
            as movies, sports, healthcare, etc.), you are likely to get better
            results from babyGPT if you first train a tokenizer for that domain using
            this script.

            Note that the module comes with a pretrained tokenizer with a vocab size
            of around 50,000 tokens.  I trained this tokenizer using the babyGPT
            module on the athlete news dataset created by Adrien Dubois. The name of
            the tokenizer JSON in the Examples directory is:

                            109_babygpt_tokenizer_49275.json


    3.    apply_tokenizer.py

            If you have created a new JSON file for the tokenizer, this script is
            just to test the tokenizer on a small txt file.  To get started with using
            this script, try it out with the following command line:

               python3  apply_tokenizer.py   text_sample_for_testing.txt   111_babygpt_tokenizer_49270.json

           where the sample file "text_sample_for_testing.txt" should already be in 
           the Examples directory of the distro and where the last arg is the JSON 
           you are testing.




    4.    extend_previously_trained_tokenizer.py

            You need to run this script only if you wish to extend a previously trained
            tokenizer with a larger target vocabulary.

            Pay attention to the call syntax for this script since it expects command-line
            arguments.  Here is an example:

              python3   extend_previously_trained_tokenizer.py   tokenizer_outputs/111_babygpt_tokenizer_20025.json    30000          

            which says you want to extend the JSON in the penultimate arg with a 
            new target vocab size of 30000.
          


    5.    create_base_model_with_buffered_context.py

            This trains an unsupervised model for your corpus on the basis of next
            token prediction.

            This script hangs to the last few tokens in each instance from the
            previous batch to provide a context for the first token in the
            corresponding instance in the new batch.

            When you train model using the "create_" script, a checkpoint is
            automatically saved every 10,000 training iterations.  The checkpoints
            are saved in the directory

                        checkpoint_dir



    6.    interact_with_prompts.py

            This is the script for interacting with a trained babyGPT model through
            prompts.  The idea is that you supply a small number of words (as, say,
            the beginning of a new thought) as a prompt and the model supplies the
            rest of the words to complete the thought.  At this time, the model
            extends your prompt until it reaches a period (or the end dictated by the
            size of the "max_seq_length" parameter.




