Project Requirements
The peer-reviewed project will include five major sections, with relevant sub-sections to organize your work using the CGScholar structure tool.
BUT! Please don’t use these boilerplate headings. Make them specific to your chosen topic, for instance: “Introduction: Addressing the Challenge of Learner Differences”; “The Theory of Differentiated Instruction”; “Lessons from the Research: Differentiated Instruction in Practice”; “Analyzing the Future of Differentiated Instruction in the Era of Artificial Intelligence;” “Conclusions: Challenges and Prospects for Differentiated Instruction.”
Include a publishable title, an Abstract, Keywords, and Work Icon (About this Work => Info => Title/Work Icon/Abstract/Keywords).
Overall Project Wordlength – At least 3500 words (Concentration of words should be on theory/concepts and educational practice)
Part 1: Introduction/Background
Introduce your topic. Why is this topic important? What are the main dimensions of the topic? Where in the research literature and other sources do you need to go to address this topic?
Part 2: Educational Theory/Concepts
What is the educational theory that addresses your topic? Who are the main writers or advocates? Who are their critics, and what do they say?
Your work must be in the form of an exegesis of the relevant scholarly literature that addresses and cites at least 6 scholarly sources (peer-reviewed journal articles or scholarly books).
Media: Include at least 7 media elements, such as images, diagrams, infographics, tables, embedded videos, (either uploaded into CGScholar, or embedded from other sites), web links, PDFs, datasets, or other digital media. Be sure these are well integrated into your work. Explain or discuss each media item in the text of your work. If a video is more than a few minutes long, you should refer to specific points with time codes or the particular aspects of the media object that you want your readers to focus on. Caption each item sourced from the web with a link. You don’t need to include media in the references list – this should be mainly for formal publications such as peer reviewed journal articles and scholarly monographs.
Part 3 – Educational Practice Exegesis
You will present an educational practice example, or an ensemble of practices, as applied in clearly specified learning contexts. This could be a reflection practice in which you have been involved, one you have read about in the scholarly literature, or a new or unfamiliar practice which you would like to explore. While not as detailed as in the Educational Theory section of your work, this section should be supported by scholarly sources. There is not a minimum number of scholarly sources, 6 more scholarly sources in addition to those for section 2 is a reasonable target.
This section should include the following elements:
Articulate the purpose of the practice. What problem were they trying to solve, if any? What were the implementers or researchers hoping to achieve and/or learn from implementing this practice?
Provide detailed context of the educational practice applications – what, who, when, where, etc.
Describe the findings or outcomes of the implementation. What occurred? What were the impacts? What were the conclusions?
Part 4: Analysis/Discussion
Connect the practice to the theory. How does the practice that you have analyzed in this section of your work connect with the theory that you analyzed on the previous section? Does the practice fulfill the promise of the theory? What are its limitations? What are its unrealized potentials? What is your overall interpretation of your selected topic? What do the critics say about the concept and its theory, and what are the possible rebuttals of their arguments? Are its ideals and purposes hard, easy, too easy, or too hard to realize? What does the research say? What would you recommend as a way forward? What needs more thinking in theory and research of practice?
Part 5: References (as a part of and subset of the main References Section at the end of the full work)
Include citations for all media and other curated content throughout the work (below each image and video)
Include a references section of all sources and media used throughout the work, differentiated between your Learning Module-specific content and your literature review sources.
Include a References “element” or section using APA 7th edition with at least 10 scholarly sources and media sources that you have used and referred to in the text.
Be sure to follow APA guidelines, including lowercase article titles, uppercase journal titles first letter of each word), and italicized journal titles and volumes.
The explosion of Artificial Intelligence tools into the public sphere has sparked innumerable discussions and debates, especially within corridors, offices, and classrooms of higher education. Much of the discourse has ranged from acritical optimism to apocalyptic warnings of the eschaton of education. Then, there are those who seek solace in the fallacy of the mean, arguing that the panacea to the Artificial Intelligence (AI) problem is to be found somewhere in the middle of these two extremes.
Recently, a colleague started a petition to ban the use of AI tools, specifically ChatGPT, in the grading of students. He expressed concerns about the overreliance on generative AI, specifically Large Language Models (LLMs), in evaluating student writing. He does a good job of highlighting what he thinks are tasks AI can do, such as creating a logo, and tasks that AI cannot reliably perform, like grading essays. He shared an experiment he conducted, in which he had AI generate an essay and evaluate it multiple times, yielding inconsistent grades (ranging from 86 to 94). He further argued that this inconsistency demonstrates AI's unreliability for grading. He went on to contend that relying on AI for grading could lead to student disengagement, as students themselves would realize how arbitrary AI evaluations are. He criticized claims of grading efficiency through automation, suggesting it ultimately wastes time, especially when AI results must be checked for consistency. Instead, he advocated investing in human evaluators. To be fair, my colleague does support using AI as a writing aid for students, but he firmly opposes using it for grading, emphasizing that grading requires human discernment, not generation.
Does my colleague have valid points? Are tools like ChatGPT and Claude unreliable and therefore should not be used to grade student assignments? Is the trust being extended to these tools unjustifiable? It is difficult to determine from where potential harms will come. Even Floridi and Cowls (2019) warned that “it is not entirely clear whether it is the people developing AI, or the technology itself, which should be encouraged not to do harm; in other words, whether it is Frankenstein or his monster against whose maleficence we should be guarding” (p. 539). Ultimately, are the cons and potential harms outweighed by the pros? These are a few questions that guide this work. What follows is an engagement of the inquiries that have been at the heart of our discussion on AI integration.
Many educators simply do not understand the basic components and functions of Artificial Intelligence. The primary factor that contributes to this misunderstanding is the nomenclature itself. A term coined by John McCarthy in 1955, artificial intelligence has been viewed as a rival to human intelligence, differing only in that it is artificial. This view inappropriately frames AI and obscures the real and significant differences that assist in defining and understanding AI. Cope, Kalantzis, & Searsmith (2020) point out that a crucial hurdle that needs to be overcome “in defining artificial intelligence, then, is to specify the parameters of artificiality, or the ways in which computers are unlike human intelligence. They are much less than human intelligence - they can only calculate. And they are much more - they can calculate larger numbers and faster than humans” (p. 2). Another limitation of the term “artificial intelligence” is that it does not signify the back-and-forth dynamic that exists between itself and human beings. To highlight and reinforce this dynamic, Cope and Kalantzis (2024) prefer to replace “artificial” with “cyber” – a term taken from Norbert Wiener’s use of the Greek kubernētēs - because it connotates a “human-machine feedback” system that is at the heart of what is being discussed (pp. 18-19). These insights and nuances are effective at placing AI within its proper frame and prevents us from going down any unwarranted slippery slopes. Possessing a general understanding of how these tools work also contributes to this prevention.
In “Jargon-Free Explanation of How AI Large Language Models Work,” Lee and Trott (2023) explain that large language models (LLMs), such as ChatGPT, are trained using massive amounts of text data to predict the next word in a sequence. Unlike traditional software that follows explicit human-written instructions, LLMs learn through pattern recognition using neural networks, resulting in systems that even their creators do not fully understand. Vectors, which are numerical representations that place similar words, like "cat" and "dog," closer together in a multidimensional space, are central to the functioning of these tools. This approach allows LLMs to perform complex operations, such as analogies, through vector arithmetic (Lee & Trott, 2023). The model’s understanding of words' meanings is refined layer by layer by the use of transformers - the neural network architecture behind LLMs (Lee & Trott, 2023).
There are four primary transpositions within a “transpositional grammar” that mediates between meaning and binary calculations (Cope & Kalantzis, 2020, p. 2). Nameability is the naming of things via binary notation does not capture every aspect of the thing named, there is an enormous benefit via the “number of things that can be named, hugely more than natively possible in personal memory or spontaneous speech” (Cope & Kalantzis, 2020, p. 3). Naming is the basis for calculability, which is comprised “of countable empirical instances, classification by concept, and then analysis by calculation,” and with focused training” (Cope & Kalantzis, 2020, p. 7). But it mist be maintained that “the processes of transposition of meaning through binary number in artificial intelligence are nothing like human intelligence, and can never be” (Cope & Kalantzis, 2020, p. 3). Despite their impressive capabilities, much about how these models remains unclear, and they often reflect the biases present in the data they are trained on (Lee & Trott, 2023).
Almasri (2024), using the Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) guidelines, examined 74 records, and discovered that AI tools were beneficial in the areas of learning context enhancement, formative assessment creation, general student assessment, and predicting student learning outcomes. Generally, most educators would agree that these are areas wherein benefits can be easily identified. Nevertheless, Almasri had to point out that “issues arose from AI’s limited ability to understand particular subject matter, its inability to adjust to various educational contexts, and the variation in performance between various AI models” (Almasri, 2024, p. 993). Of course, Almasri also identified the ethical issues that accompany the use of AI in education. Karan & Angadi (2023) highlighted the risks and challenges of incorporating AI into K through 12 classrooms. As a result they came up with six risk areas that can be used when discussing AI educational integration: privacy and autonomy risks, AI biases, accuracy and functional risks, deep fakes and FATE risks, social knowledge and skill building risks, and risks regarding teachers’ roles (Karan & Angadi, 2023). Of course, these represent some of the more ostensible risks associated with AI integration into education.
Blodgett and Madaio (2021) raised some less-intuited issues that are, of course, no less important. First, they point out that the arguments that the integration of technologies would improve logistic efficiencies have fallen flat in the face of these technologies being used primarily to surveil students (Blodgett & Madaio, 2021, pp. 1-2). It was purported that “foundation models may reproduce harmful ideologies about what is valuable for students to know and how students should be taught, including ideologies about the legitimacy and appropriateness of minoritized language varieties” (Blodgett & Madaio, 2021, p 3). Because these tools often do not allow for true learner-centered or stakeholder-involved approaches, Blodgett and Madaio (2021) warned: “This narrow scope for involvement of key stakeholders such as teachers and students is at odds with participatory, learner-centered paradigms from educational philosophy and the learning sciences, where learners’ interests and needs shape teachers’ choices about what and how to teach. In addition, this may have the effect of further disempowering teachers from having meaningful agency over choices about content or pedagogy, further contributing to the deskilling of the teaching profession, in ways seen in earlier technologies of scale in education” (p. 3). This is especially important for educators and institutions that are committed to equity and prefer a more constructivist approach to education.
Fuller and Bixby (2024) conducted a study driven by three questions: 1) How consistently can ChatGPT, Claude, and Bard grade written assignments using a rubric? 2) How consistently can these systems provide feedback? and 3) How much grading and feedback variance exists on the same assignments across systems? These questions were closer to the concerns raised by my colleague. Fuller and Bixby (2024), examining two undergraduate and two graduate writing assignments using an assessment rubric, found that ChatGPT and Claude produced unreliable and inconsistent grading and feedback results. These were the same systems tested by my colleague, and his results were the same. What about the grading performance of these tools in other disciplines?
Kortemeyer, Nöhl, and Onishchuk (2024) assessed the exam responses of 252 physics students, who had to complete four multipart thermodynamics problems. The researchers found that ChatGPT was “useful for assisting in free-form exam grading; they can deal with fuzzy data in a probabilistic manner” (Kortemeyer, Nöhl, & Onishchuk, 2024, p. 19). It was discovered that a “fine-grained grading rubric, applied to a whole problem at a time, leads to frequent bookkeeping errors and failed attempts. Grading the whole problems by a handful of parts, using the full sample solution, turned out to be more reliable, but misses some of the nuanced weightings of a rubric” (Kortemeyer, Nöhl, & Onishchuk, 2024, p. 19). The results here are a bit complex. On the one hand, it supports the idea that ChatGPT is unreliable when used to grade, but on the other hand it also has its strengths. Is there evidence that these tools can be beneficial to assessment and grading efforts when compared to humans and other AI systems?
Moore, Bier, and Stamper (2024) compared the effectiveness and reliability of human-based crowdsourcing and the use of large language models (LLMs) in the assessment of multiple-choice questions (MCQs) and short answer questions (SAQs). One set of crowdworkers were novices (MTurk) and another set possessed at least a bachelor’s degree in the domain associated with their assigned questions (Prolific). Three LLMs were used: GPT-4, Gemini, and Claude. The content of the questions spanned six subject areas: philosophy, statistics, chemistry, team collaboration, calculus, and chemistry. A 19-criteria Item-Writing Flaws (IWF) rubric was used by both crowdworkers and LLMs to evaluate the MCQs. For the SAQs, a 9-item rubric was used by both crowdworkers and LLMs for evaluation. It was discovered that the MTurk group had the highest exact match ratio and had the lowest Hamming Loss (i.e., the proportional/ fractional measurement of incorrect labels to the total number of labels). This made the MTurk group the best method for IWFs. For SAQs, Prolific had considerably high precision and recall, and the lowest Hamming Loss. ChatGPT did, however, have the highest exact match measurement and the highest uniformity across the different categories. The Prolific group had the highest performance in precision and recall, and the lowest Hamming Loss. The outperformance of MTurk was attributed to the “detailed and less subjective nature of the IWF rubric,” which “likely aided MTurk workers by providing sufficient guidance, despite their varied knowledge levels” (Moore, Bier, & Stamper, 2024, 121). The superior performance of Prolific with SAQs was attributed to the crowdworkers possessing relevant background knowledge. Out of the LLM-based tools, Gemini (1.5 Pro) and Claude (3 Opus) were the least effective and most inefficient. While the crowd surpassed ChatGPT in the application of the IWF overall, ChatGPT did outperform the crowd regarding the “implausible distractor” factor (Moore, Bier, & Stamper, 2024, p. 122). This supports the conclusion that “while human input remains crucial in the question quality evaluation process, automated methods could effectively handle specific criteria where their performance is comparable to that of humans” (Moore, Bier, & Stamper, 2024, p. 122). This study certainly adds considerable complexity to our discourse, especially as it pertains to factors that might contribute to AI tools being beneficial to grading, especially when combined with human input.
Along similar lines, Marcin Jukiewicz (2023) explored the effectiveness of ChatGPT (gpt-3.5-turbo) in grading Python (3.10) programming assignments. In this study, 67 students from the Cognitive Science program completed a total of 1579 tasks over the course of 15-weeks. It was found that there was a significant relationship between the grades assigned by teachers and grades assigned by Chat GPT. This is an important finding because it is almost categorically stated that AI-assigned grades are far more unreliable than human-assigned grades, but this is evidence that AI-assigned grades are not too far from human-assigned ones. Jukiewicz purports that this study supports the idea that “AI, like a human, assesses whether a student’s work is ‘correct,’ ‘almost correct’ or ‘incorrect.’ However, there are differences in the point allocations. ChatGPT assigns fewer points, suggesting it may be perceived as a stricter grader” (p. 3). This can be due to the fact that all of the nuanced meanings for possible student responses are not being accounted for in the rubrics or commands that are given to the AI tool. Support for this view can be found in this study when the AI tool mislabeled a response as ‘almost correct,’ which was “an example of ChatGPT not always fully understanding the task description it needs to accomplish” (Jukiewicz, 2023, p. 4). This can likely be improved by providing highly detailed and emphasized task descriptions.
How efficient might AI tools be in assessing more elusive learning outcomes, like communication and collaboration? Tomić, Kijevčanin, Ševarac, and Jovanović (2023) used students from an extracurricular Java programming undergraduate course and a soft skill assessment rubric to assess collaboration among the course participants. It must be highlighted that in order to assist with the assessment, the researchers created an open-source Java application that was fine-tuned to assess collaboration. The result showed that among fuzzy rules, neural network, decision tree, and random forest, it is best to use fuzzy rules, which use degrees or approximations of truth to make decisions instead of the more precise approaches. The researchers found that “the overall good performance of the examined AI methods is, at least partially, due to well defined and explicit collaboration rubric” (Tomić, Kijevčanin, Ševarac, & Jovanović, 2023, p. 300). The optimal performance can be achieved when teachers intervene and check for errors. Furthermore, it was suggested that the use of these approaches was beneficial if teachers “were struggling to allocate enough time to assess and grade each project manually” and “were struggling not to make errors due to the complexity of the projects, software tools, and collaboration assessment procedure” (Tomić, Kijevčanin, Ševarac, & Jovanović, 2023, p. 301).
It is clear that the issue of AI tools grading student coursework is a bit more complex than my colleague and I may have initially been aware or were willing to admit. The issues with AI being used are clear, including issues around privacy and autonomy risks, AI biases, accuracy and functional risks, deep fakes and FATE risks, social knowledge and skill building risks, and risks regarding teachers’ roles (Karan & Angadi, 2023). Specific to my colleague’s point, there is evidence that AI tools, like ChatGPT and Claude, produce unreliable and inconsistent grading and feedback results (Fuller and Bixby, 2024). Then there is evidence that shows that, while there are valid issues with using these AI tools, there are benefits that cannot be ignored or overlooked (Kortemeyer, Nöhl, and Onishchuk, 2024). Moore, Bier, and Stamper (2024) demonstrated that not all AI generative tools are created equal, and there are some instances where certain tools can outperform human grading. Marcin Jukiewicz (2023) demonstrated the strength of AI grading in certain areas and how grading can be optimized when coupled with human input and correction. Where does this leave us? It is clear that a false dichotomy of human vs. AI will not only cause us to overlook the complexity of this issue but will also miss ostensible opportunities.
What happens when these tools are provided with the foundational data that will make them more efficient and reliable at assessing writing assignments? What happens when LLM-based tools are trained to become more proficient and accurate at analyzing artifacts and providing feedback? ChatGPT has proven to be quite capable of providing very valuable feedback when given well-defined rubrics. The rubric assists the AI tool by providing the structures (Cope & Kalantzis, 2020, 13:44) that are needed for translation of meaning, or transposition and the construction of meaning:
“Grammatically, digital databases and digital environments operate on this concept/instance distinction. So digital text then is driven by ontologies, which among other things are continually aligning instances and concepts. This is transposition with mechanical support” (Cope & Kalantzis, 2020, 14:00).
This is an example of what happens when tools like ChatGPT are used to analyze student work and provide feedback based on the associated rubrics. I believe that the use of the tools can complement the assessment feedback of educators. The purposes of assessment and grading is “to provide feedback about the learners' progress and achievement of learning outcomes” and “support learners' self-regulation and metacognition” (Oklahoma State University Institute of Teaching and Learning Excellence, 2024, 1:23). Then it makes sense for AI tools to be used in conjunction with human insights and expertise in order to provide quality, timely feedback.
It is imperative that the ethical imperatives are not ignored. It will be important that ethical guardrails are put in place to provide a framework from which decisions can be made. At the heart of this framework needs to be the core ethical principles of beneficence, non-malficence, human autonomy, justice, and explicability (Floridi & Cowls, 2019). Of course, these are just foundational principles that can assist in our dicussion about privacy and autonomy risks, AI biases, accuracy and functional risks, deep fakes and FATE risks, social knowledge and skill building risks, and risks regarding teachers’ roles (Karan & Angadi, 2023).
In “On cyber-social learning” Cope and Kalantzis (2024) highlighted the etymology of ”cyber” as an alternative nomenclature to “artificial.” Norbert Weiner coined the term ‘cybernetics” from the Greek kubernētēs, which means “steersman” (Cope and Kalantzis, 2024, p. 18). This is the key component to the ideal interaction and relationship. While machines can accomplish extraordinary calculative feats, it is appropriate that machines are seen as an extension of human beings. Instead of advocating their ban, as my colleague has done, it might be best to facilitate the proper relation to these tools, which is captured in the “cyber-social arrangement,” wherein it is incumbent upon the steersman to manage the voyage (Cope and Kalantzis, 2024, p. 19).
Almasri, F. (2024). Exploring the impact of artificial intelligence in teaching and learning of science: A systematic review of empirical research. Research in Science Education, 54, 977–997. https://doi.org/10.1007/s11165-024-10176-3
Blodgett, S. L., & Madaio, M. (2021). Risks of AI foundation models in education. arXiv. https://arxiv.org/abs/2110.10024
Cope, B., Kalantzis, M., & Searsmith, D. (2020). Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologies. Educational Philosophy and Theory, 53(12), 1229-1245. https://doi.org/10.1080/00131857.2020.1728732
Cope, B., & Kalantzis, M. (2020, May 26). What's this about? Specification [Video]. Education at Illinois. YouTube. https://www.youtube.com/watch?v=tDPCL9I9FP4
Cope, B., & Kalantzis, M. (2024). On cyber-social learning: A critique of artificial intelligence in education. In D. Kourkoulou, A. O. Tzirides, B. Cope, & M. Kalantzis (Eds.), Trust and inclusion in AI-mediated education. Postdigital Science and Education. Springer. https://doi.org/10.1007/978-3-031-64487-0_1
Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.8cd550d1
Fuller, L., & Bixby, C. (2024). The theoretical and practical implications of OpenAI system rubric assessment and feedback on higher education written assignments. American Journal of Educational Research, 12(4), 147-158. https://doi.org/10.12691/education-12-4-4
Jukiewicz, M. (2023). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Elsevier. https://doi.org/10.13140/RG.2.2.22103.85924
Karan, B., & Angadi, G. R. (2023). Potential risks of artificial intelligence integration into school education: A systematic review. Bulletin of Science, Technology & Society, 43(3-4), 67–85. https://doi.org/10.1177/02704676231162742
Kortemeyer, G., Nöhl, J., & Onishchuk, D. (2024). Grading assistance for a handwritten thermodynamics exam using artificial intelligence: An exploratory study. arXiv. https://doi.org/10.48550/arXiv.2406.17859
Lee, T. B., & Trott, S. (2023, July 31). A jargon-free explanation of how AI large language models work. Ars Technica. https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/?utm_medium=10today.us.mon.rd.20230731.436.1&utm_source=email&utm_content=article&utm_campaign=email-2022
Moore, S., Bier, N., & Stamper, J. (2024). Assessing educational quality: Comparative analysis of crowdsourced, expert, and AI-driven rubric applications. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 12(1), 115-125. https://doi.org/10.1609/hcomp.v12i1.31606
Oklahoma State University Institute of Teaching and Learning Excellence. (February 12, 2024). Assessment in Higher Education: A History and Purpose [Video]. Retrieved February 16, 2024, from https://video.okstate.edu/media/Assessment+in+Higher+EducationA+History+and+Purpose/1_2y0u0zb9
Rishworth, J. (2024, June 23). AI for teachers: Empowering educators with the power of language models [Image]. LinkedIn. https://www.linkedin.com/pulse/ai-teachers-empowering-educators-power-language-models-jon-rishworth-t12de
Tomić, B. B., Kijevčanin, A. D., Ševarac, Z. V., & Jovanović, J. M. (2023). An AI-based approach for grading students’ collaboration. IEEE Transactions on Learning Technologies, 16(3), 292-305. https://doi.org/10.1109/TLT.2022.3225432
UltraTech Cement. Building foundations for a house. UltraTech Cement. https://www.ultratechcement.com/for-homebuilders/home-building-explained-single/descriptive-articles/building-foundations-for-a-house