Kappa Fleiss Analysis: Evidence Of Content Validity for Formative Assessment Literacy Test for Teachers of Fundamental Subjects in Vocational Special School/ Analisis Fleiss Kappa: Evidens Kesahan Kandungan Ujian Literasi Pentaksiran Formatif Guru-Guru Subjek Asas Sekolah Menengah Pendidikan Khas Vokasional


  • Ibnatul Jalilah Yusof School of Education in Measurement and Evaluation, Faculty of Social Sciences and Humanities, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia
  • Fathin Edora Abdul Rahim School of Education in Measurement and Evaluation, Faculty of Social Sciences and Humanities, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor, Malaysia




Formative assessment tes, formative assessment, expert validity, Fleiss Kappa analysis


The Formative Assessment Literacy Test (FAT) is a new instrument designed to identify and assess fundamental subject teachers’ formative assessment literacy (FAL) in Vocational Special School. In order to determine the level of FAL, teachers require a precise measurement tool in terms of content validity as proven by expert validity. The purpose of this study is to validate the FAT instrument through expert validity and expert agreement standardization by using Kappa Fleiss analysis. Three experts with more than ten years of experience in the field of measurement and evaluation education, as well as on expert from the field of Special Education were chosen to verify the FAT content. Overall, the FAT content is rated as excellent with kappa (k) value of 0.84. While for each knowledge component’s concept, principle, and skill component’s method, k= 0.81, 0.81, and 0.89 respectively. Any changes and alterations is performed in response to expert advice and suggestion. In conclusion, based on expert evaluation, FAT has good content validity (excellent) and may be used to detect and measure teachers’ knowledge and abilities on formative assessment.


Ab Halim, F., Wan Muda, W.H., & Izam, S. (2019). The Relationship between Employability Skills and Self-Efficacy of Students With Learning Disabilities In Vocational Stream.

Anizam, M., Y, Mohd Ali., M, Mohd Noor., N. (2020). Penerapan Kemahiran Kebolehgajian Terhadap Murid Berkeperluan Pendidikan Khas. Online journal for TVET practitioners 5(1), 36-42

Arguello, J., Diaz, F., Callan, J., Carterette, B. (2011). A methodology for evaluating aggregated search results. Proceedings of the 33rd European conference on Advances in information retrieval. 141–152 Springer-Verlag, Berlin, Heidelberg.

Ariffin, S., R. (2008). Inovasi dalam pengukuran dan penilaian pendidikan. Bangi: Universiti Kebangsaan Malaysia.

Asbulah, L. H., Lubis, M. A., & Aladdin, A. (2018). Kesahan Dan Kebolehpercayaan Instrumen Pengetahuan Kolokasi Bahasa Arab IPT (I-KAC IPT) Menggunakan Model Pengukuran Rasch. Asean Comparative Education Research Journal On Islam And Civilization (Acer-J). 2(1), 97-106. eISSN2600-769X,

Bahagian Pendidikan Khas. (2019). Pentaksiran Bilik Darjah (PBD) Murid Berkeperluan Khas (MBK) Masalah Pembelajaran tahap 1

Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41 (3), 687-699.

Brookhart, S. & Lazarus, S. (2017) Formative Assessment for Students with Disabilities. Commissioned by the Council of Chief State School Officers State

Collaboratives on Assessing Special Education Students and Formative Assessment, Washington, DC.

Burns, N., & Grove, S. K. (1993). The practice of nursing research conduct, critique, and utilization (2nd ed.). Philadelphia: WB Saunders Company.

Butler, D. & Schnellert, L. (2015). “Success for students with learning disabilities: What does self-regulation have to do with it?” In T. Cleary (Ed.), Self-regulated learning interventions with at-risk populations: Academic, mental health, and contextual considerations. Washington, DC: American Psychological Association Press.

Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence, and kappa. Journal of Clinical Epidemiology, 46 (5), 423-429.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20 (1), 37–46.

Davis, L.L. (1992). Instrument review: Getting the most from your panel of experts. Applied Nursing Research, 5, 194–197.

Dempsey, P. A., & Dempsey, A. D. (1986). The research process in nursing (2nd ed.). Boston: Jones and Bartlett Publishers. tggu almeer

Ebel, R., L. (1967). Evaluating content validity. In D. Payne & R. McMorris (Eds.), Educational and Psychological Measurement: Contributions to theory and Practice. 85–94. Waltham Blaisdell.

Feinstein, A. R. & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of the two paradoxes. Journal of Clinical Epidemiology, 43 (6), 543- 549.

Fleiss, J. L., Levin, B., & Paik, M. C. (1981). The measurement of interrater agreement. Statistical methods for rates and proportions, 2(212-236), 22-23.

FIELD, A. P. (2005). Discovering Statistics Using SPSS, Sage Publications Inc.

Friese. S (2020). Measuring Inter-coder Agreement – Why Cohen’s Kappa is not a good choice. Retrieved from https://atlasti.com/2020/07/12/measuring-intercoder-agreement/ Retrieved date on Retrieved on March 8, 2021

Glen, S. (2022). "Fleiss’ Kappa" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/fleiss-kappa/ Retrieved date on October 07, 2022

Jiamu, C. 2001. The great importance of the distinction between declarative and procedural knowledge. Análise Psicológica 4(XIX), 559–566. https://doi.org/10.14417/ap.387

Johnston, P., & Wilkinson, K. (2009). Enhancing validity of critical tasks selected for college and university program portfolios. National Forum of Teacher Education Journal, 19(3), 1–6.

Mitchell, R.E. & Karchmer, M. A. (2012). Demographic and achievement characteristics of deaf and hard-of-hearing students. Oxford Handbooks Online. doi: 10.1093/oxfordhb/9780199750986.013.0003

Kementerian Pelajaran Malaysia. (2013; 2021). Laman web rasmi Kementerian Pelajaran Malaysia.

Kraemer, H. C. (1979). Ramifications of a population model for k as a coefficient of reliability. Psychometrika, 44 (4), 461– 472

Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 363-374.

Lian., L. H., Yew., W. T., & Meng., C. M. (2014). Enhancing Malaysian Teachers’ Assessment Literacy. International Education Studies; 7(10). ISSN 1913-9020, E-ISSN 1913-9039.

Lindell, M.K., & Brandt, C.J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of the CVI, T, rWG(J), and rWG(J) indexes. Journal of Applied Psychology, 84, 640–647.

Lynn, M.R. (1986). Determination and quantification of content validity. Nursing Research, 35, 382–385.

Omar., M, & Ali, D. F. (2019). A Review of vocational education for special needs learners. Jurnal Persatuan Teknik dan Vokasional Malaysia [Journal of Technic

and Vocational Association Malaysia], 8, 58-65.

McHugh, Mary L. (2012). "Interrater reliability: The kappa statistic". Biochemia Medica. 22 (3), 276–282. doi:10.11613/bm.2012.031. PMC 3900052. PMID 23092060.

Mertler, C., A., & Campbell, C. (2005). Measuring teachers’ knowledge & application of Classroom Assessment Concepts: Development of the “Assessment Literacy Inventory”. American Educational Research Association

Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research In Nursing & Health, 30(4), 459-467.

Popham, W. J. (2000). Modern educational measurement: practical guidelines for educational leaders. Los Angeles: Allyn & Bacon

Popham, W. J. (2004). Why Assessment Literacy Is Professional Suicide. Educational Leadership, 62(1), 82-83.

Tahir., M, Yassin, Mustafa. (2009) Pendidikan Teknik Dan Vokasional Untuk Pelajar Berkeperluan Khas (Vocational Dan Technical Education For Special Needs

Students). The Asia Pacific Journal of Educators and Education (formerly known as Journal of Educators and Education), 24 (1), 1-15. ISSN 2289-9057

Taherdoost, H. (2016). Validity and Reliability of the Research Instrument; How to test the validation of a Questionnaire/Survey in research. International Journal of Academic Research in Management (IJARM). 5(3), 28-36

Talib, R., Kamsah, M. Z., Ghafar, M. N. A., Zakaria, M. A. Z. M., & Naim, H. A. (2013). T-assess: Assessment literacy for Malaysian teachers. Paper presented at

the International Conference on Assessment for Higher Education Across Domains and Skills, Kuala Lumpur.

Sekaran, U., & Bougie, R. (2011). Research methods for business: a skill building approach (5th ed.). New Delhi: John Wiley & Sons

Smith, K., Finney, S., & Fulcher, K. (2019). Connecting assessment practices with curricula and pedagogy via implementation fidelity data. Assessment and Evaluation in Higher Education, 44(2), 263–282. https://doi.org/10.1080/02602938.2018.1496321

Smith, P., Cheema, J., Kumi-Yeboah, A., Warrican, S.J., & Alleyne, M. (in press). Language-based literacy differences in the literacy performance of bidialectal youth. Teachers CollegeRecord,120(1)

Stiggins, R. (1995). Assessment literacy for the 21st Century. Phi Delta Kappan, 77, 238

Sulaiman, T., Kotamjani, S. S., Rahim, S. S. A., & Hakim, M. N. (2020). Malaysian Public University Lecturers’ Perceptions and Practices of Formative and Alternative Assessments. International Journal of Learning, Teaching and Educational Research, 19(5), 379–394. https://doi.org/10.26803/IJLTER.19.5.23

Waltz, C.F., & Bausell, R.B. (1981). Nursing research: Design, statistics, and computer analysis. Philadelphia: F. A. Davis

Xie, Q. (2013). Agree or disagree? A demonstration of an alternative statistic to Cohens kappa for measuring the extent and reliability of agreement between observer.

In Proceedings of the Federal Committee on Statistical Methodology Research Conference, The Council of Professional Associations on Federal Statistics, Washington, DC, USA, 2013.

Zapf, A., Castell, S., Morawietz, L. (2016). Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med

Res Methodol 16, 93. https://doi.org/10.1186/s12874-016-0200-9




How to Cite

Yusof, I. J., & Abdul Rahim, F. E. (2024). Kappa Fleiss Analysis: Evidence Of Content Validity for Formative Assessment Literacy Test for Teachers of Fundamental Subjects in Vocational Special School/ Analisis Fleiss Kappa: Evidens Kesahan Kandungan Ujian Literasi Pentaksiran Formatif Guru-Guru Subjek Asas Sekolah Menengah Pendidikan Khas Vokasional. Sains Humanika, 16(2), 9–16. https://doi.org/10.11113/sh.v16n2.2009