Large Language Modules are quite capable of generating a variety of academic materials but when it comes to question-creation, there are some constraints to be addressed. This article identifies the underlying problems and provides a step-by-step guide to creating a full-fledged question generator application. It uses frameworks like ‘Langchain’ to handle the LLM-based application development, ‘Pydantic’ for providing the structure, ‘Streamlit’ for designing the user interface, and of-course ‘Python’ for enabling everything.
Table of Contents
- Problems with generating questions with LLMs
- How does QuestGen solve the problem?
- Description of the tools used
- Step by Step Hands-on guide: Create a GenAI Application
Problems with generating questions with LLMs
One of the sectors that is going to witness a paradigm shift in the coming years is Education, all thanks to the rapidly emerging Generative AI. A majority of students and teachers are already leveraging the capabilities of OpenAI models to assist them in day-to-day academic and non-academic activities, but still, there are a few problems that are unaddressed.
The first problem is that there isn’t an easy way to provide the context of any notes or book to the LLM models. So it is difficult to ascertain the scope of the questions. Secondly, handling prompts for different models is not as easy a task in itself, and figuring out the right prompt for a custom requirement can be tricky at times.
Third, the output is not structured, and that implies that it would be hard to extract the desired features, also it would hinder automation as it is not possible to chain it to any LMS(Learning Management System).
How does QuestGen solve the problem?
QuestGen provides an easy-to-use user interface where the necessary parameters can be selected without any need to ponder the prompts. It also facilitates file uploads, which assists in providing a valid scope for the generation of the questions. The output provided is a well-structured pandas data frame that can be downloaded as a CSV file.
Description of the tools used
QuestGen uses the Langchain framework to handle most of the procedural steps that are involved in a typical GenAI project. Langchain enables seamless integration with APIs, memory, agents, and data sources.
BaseModel from Pydentic has been utilized to construct the structure for the output. BaseModel is a foundational class for creating data validation and serialization of models. It ensures type safety, automatic validation, and easy conversion of data structures Streamlit has been used to design the user interface. It is an open-source Python library that makes it easy to build interactive web apps for data science and machine learning with minimal code.
‘Annotated’ from ‘typing’ has been used to provide extra context to the features of the output. It provides the context for type checkers, validators, or frameworks. It is commonly used with libraries like Pydantic for data validation
Step by Step Hands-on guide: Create a GenAI Application
Step 1. Creating a streamlit page
Streamlit is an open-source Python library for building interactive web applications with minimal code. It is widely used for creating data-driven apps, especially in machine learning, data visualization, and AI applications. This article in itself is a testimony of the ease with which a web application can be designed with the help of this library.
Output:
The user can load multiple PDF files at a time to provide the context and scope of question creation. The user can decide on the number of questions, inclusion of solution, and also the question type out of the following five options:
- Short Descriptive
- Long Descriptive
- Multiple Choice
- True-False
- Word-Puzzle
Step 2. Loading the relevant file
From the multiple PDF files that are supposed to be uploaded, the user can select a particular file to generate questions. The following snippet of code adds a select box, then fetches the selected file from the uploaded files and stores it as a temporary object to be used further by the LLM-model.
Output:
Step 3. PyPDFLoader to extract text.
PyPDFLoader is a document loader in LangChain that extracts text from PDF files. It loads a PDF, splits it into pages, and converts them into Document objects that LangChain can process. Page-wise split is more useful for semantic search operations, but since here the requirement is to pass the whole chapter or book as a reference to the LLM-model, the pages have been stitched with the string-join operation.
Step 4. Initiating model from OpenAI
At the core of QuestGen lies the OpenAI model. Langchain provides a seamless way to initialize and operate the model. For the demonstration ‘gpt-4o-mini’ model has been used but that can be easily modified as per the user preferences.
The ‘temperature’ parameter used here decides on the ‘deterministic’ level of the model. A zero value implies a very determined model which would not alter the output over re-runs while a larger temperature would result in more creative outputs.
Step 5. Question-type-based output schema
‘BaseModel from ‘Pydantic’ helps in creating a structure for the output of the LLM-model. The different question types will follow different schemas e.g. a long descriptive question have only 3 features e.g. Serial Number, Question and Answer while a multiple choice question will have 8 features e.g. Serial Number, Question, Option A, Option B, Option C, Option D, Correct Option and Explanation.
‘Annotated’ from ‘typing’ helps in annotating the features of the schema, so that apart from the structure, the LLM-model also gets feature-specific instructions to be followed. Here is an example of creating an MCQ schema.
Step 6. Generating results with prompt
This is a crucial step where the text generation has to be done. Using the schema created in the previous step and the ‘GPT-4o-mini’ LLM model, a detailed prompt is passed. The prompt calls the ‘full_text’ file inside the f-string that had been created in the third step. This file provides the context and scope of the model. It also calls the ‘number of questions’ as a variable, which is a user input. The prompt has to be straightforward and it may require some tweaks depending on the model being used.
The ‘result’ variable stores all the generated text, as per the schema set in the 5th step. Individual features can be taken from it to construct a dataframe
Step 7. Constructing dataframe for output
The result has been generated irrespective of ‘Include Answer’ being toggled on or off. Depending on the toggle state the answer or explanation part can be included or excluded in the dataframe.
Output (include_answer : Off):
Output (include_answer : On):
Note that the overlying text can be visualized by double-clicking on the cell and additionally the table can be downloaded with one click, with the help of the button at the top right corner of the table.
Step 8. Some creative question types
With a bit of manipulation, the power of LLM models can be harnessed to create some very interesting question types e.g. word-cue puzzles, word-scramble, matches, etc. For example, to create a word-cue puzzle, a list of questions with one-word answers can be generated with the following code:
Next, by replacing some random characters of the answer with ‘_’, puzzle cues can be generated. A simple transformation function applied to the pandas series of answers should do the job
Output (include_answer : On)
Final Words
Though LLMs have some constraints today, pairing them with the right tools can do wonders in any domain, and the educational domain is no exception. Apart from the illustrations demonstrated in this article, numerous other possibilities can be explored. Our creativity accompanied with LLMs resourcefulness can achieve some marvelous creations and unconventional automations.
Link to codes:
GitHub Repository: https://github.com/abhishekmishra2903/QuestGen.git