DETEL Identificación de textos elaborados por LLM

Ardila Barbosa, David Camilo; Carrillo Aranda, Dairo Javier; Ladino Perdomo, Vladimir

dc.contributor.advisor	Mejía Delgadillo, Gonzalo Enrique
dc.contributor.author	Ardila Barbosa, David Camilo
dc.contributor.author	Carrillo Aranda, Dairo Javier
dc.contributor.author	Ladino Perdomo, Vladimir
dc.date.accessioned	2024-06-06T20:32:08Z
dc.date.available	2024-06-06T20:32:08Z
dc.date.issued	2023-10-23
dc.identifier.uri	http://hdl.handle.net/10818/60271
dc.description	87 páginas	es_CO
dc.description.abstract	Generative language models have instigated a disruptive shift spanning across various sectors (OpenAI, 2022). These changes concurrently pose a challenge to the study of authorship, as generative models do not hold copyright, for two reasons. Firstly, they are not human entities to assume responsibility, and secondly, due to the nature of their training corpus (OpenAI, 2022), raising special significance within the academic context. In this study, we explore two experimental approaches for the binary classification of text generated by a Language Model (LM) and a human. These approaches are based on the field of stylometry and the feature extraction techniques employed in Natural Language Processing (NLP). To this end, a silver standard corpus or dataset was compiled from various sources, ensuring class balance. The dataset is composed of documents with distinct linguistic structures (fables, stories, essays, news reports, tweets, and poems) to diversify the vocabulary and the grammatical structure therein. The experimental approaches involve text classification via parameterization using TF-IDF, embedding, and feature extraction, proposing a taxonomy for the classification of linguistic features used in the classification process. These experimental approaches corroborate the findings of the existing literature (Fröhling y Zubiaga, 2021) (Dou y cols., 2021). Classification models such as decision trees, random forests, adaboost, and support vector classifiers (SVC), employed in LMs, and taking lexicogrammatical features as input, tend to outperform those based on statistical distributions like TF-IDF and vectorization approaches such as embedding. This superiority is likely due to their resistance to overfitting in the presence of exclusionary vocabulary within the corpus.	en
dc.description.abstract	Los modelos generativos de lenguaje han planteado un cambio disruptivo en áreas que abarcan diferentes sectores (OpenAI, 2022), estos cambios a su vez suponen un reto en el estudio de la autoría, pues los modelos de generación no tienen derechos de autor, ya que, no es un ser humano para asumir la responsabilidad y segundo por la naturaleza del corpus de su entrenamiento (OpenAI, 2022), lo que supone una especial relevancia en el contexto académico. En este trabajo se abordan dos líneas experimentales para la clasificación binaria de texto generado por un LLM y un humano, líneas que son abordadas desde el área de la estilometría y la extracción de características utilizadas en NLP. Para esto se recopila un corpus o data set silver standar de diferentes fuentes y clases balanceadas. Este data set es compuesto por documentos con estructuras lingüísticas distintas (fábulas, cuentos, ensayos, noticias, tweets y poemas) para diversificar el vocabulario, y la estructura gramatical de los mismos. Como líneas experimentales se toma la clasificación por parametrización del texto con tf-idf, embedding y extracción de características, proponiendo una taxonomía para la clasificación de las características lingüísticas usadas en la categorización. Estas líneas experimentales corroboran resultados de la literatura (Fröhling y Zubiaga, 2021) (Dou, Forbes,Koncel-Kedziorski, Smith, y Choi, 2021), en los cuales modelos de clasificación como decision tree, random forest, adaboost, svc, usados en llm, y cuyo input son características lexo gramaticales, funcionan mejor que los basados en distribuciones estadísticas como tf-idf y de vectorización, como el embedding, pues son propensos a un sobre ajuste, dada la presencia de vocabulario excluyente en el corpus.	es_CO
dc.format	application/pdf	es_CO
dc.language.iso	spa	es_CO
dc.publisher	Universidad de La Sabana	es_CO
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject.other	Procesamiento de Lenguaje Natural
dc.subject.other	Algoritmos clasificación
dc.subject.other	LLM, Taxonomía de textos
dc.subject.other	Estilometría
dc.subject.other	Atribución de Autoría
dc.subject.other	ChatGPT
dc.title	DETEL Identificación de textos elaborados por LLM	es_CO
dc.type	master thesis	es_CO
dc.type.hasVersion	publishedVersion	es_CO
dc.rights.accessRights	openAccess	es_CO
dcterms.references	Ahmed, H. (2018). The role of linguistic feature categories in authorship verification. Procedia Computer Science, 142, 214-221. doi: 10.1016/j.procs.2018.10.478
dcterms.references	Al-Khatib, M. A., y Al-qaoud, J. K. (2021). Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative study. Information Communication and Society, 24. doi: 10.1080/1369118X.2020.1716039
dcterms.references	Antici, F., Bolognini, L., Inajetovic, M. A., Ivasiuk, B., Galassi, A., y Ruggeri, F. (2021). Subjectivita: An italian corpus for subjectivity detection in newspapers. , 40-52. doi: 10.1007/978-3-030-85251-1_4
dcterms.references	Anwar, W., Bajwa, I. S., y Ramzan, S. (2019). Design and implementation of a machine learning-based authorship identification model. Scientific Programming, 2019. doi: 10.1155/2019/9431073
dcterms.references	Bartz, D. (2023, 2). As chatgpt’s popularity explodes, u.s. lawmakers take an interest. Descargado de https://www.reuters.com/technology/chatgpts-popularity-explodes-us -lawmakers-take-an-interest-2023-02-13/
dcterms.references	Bender, E. M., Gebru, T., McMillan-Major, A., y Shmitchell, S. (2021, 3). On the dangers of stochastic parrots. En (p. 610-623). ACM. doi: 10.1145/3442188.3445922
dcterms.references	Das, A., y Verma, R. M. (2020). Can machines tell stories? a comparative study of deep neural language models and metrics. IEEE Access, 8, 181258-181292. doi: 10.1109/ACCESS.2020.3023421de Villa, G. R. (2018). Introducción a word2vec (skip gram model). Descargado de https://gruizdevilla .medium.com/introducci%C3%B3n-a-word2vec-skip-gram-model-4800f72c871f
dcterms.references	Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N. A., y Choi, Y. (2021, 7). Is gpt-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text.
dcterms.references	Fröhling, L., y Zubiaga, A. (2021). Feature-based detection of automated language models: Tackling gpt-2, gpt-3 and grover. PeerJ Computer Science, 7. doi: 10.7717/PEERJ-CS.443
dcterms.references	Gaur, V., y Saunshi, N. (2022, 9). Symbolic math reasoning with language models. En (p. 1-5). IEEE. doi: 10.1109/URTC56832.2022.10002218
dcterms.references	HOLMES, D. I. (1998, 9). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13, 111-117. doi: 10.1093/llc/13.3.111
dcterms.references	Hutson, M. (2022, 11). Could ai help you to write your next paper? Nature, 611, 192-193. doi: 10.1038/ d41586-022-03479-w
dcterms.references	Jafariakinabad, F., y Hua, K. A. (2021, 11). Unifying lexical, syntactic, and structural representations of written language for authorship attribution. SN Computer Science, 2, 481. doi: 10.1007/s42979-021-00911-2
dcterms.references	Karani, D. (2018, 9). Introduction to word embedding and word2vec.
dcterms.references	Lagutina, K., Lagutina, N., Boychuk, E., Larionov, V., y Paramonov, I. (2021). Authorship verification of literary texts with rhythm features. En (Vol. 2021-January). doi: 10.23919/FRUCT50888.2021.9347649
dcterms.references	Lagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O., . . . Demidov, P. G. (2019). A survey on stylometric text features.. doi: 10.23919/FRUCT48121.2019.8981504
dcterms.references	Lee, P., Fyffe, S., Son, M., Jia, Z., y Yao, Z. (2023, 2). A paradigm shift from “human writing” to “machine generation” in personality test development: an application of state-of-the-art natural language processing. Journal of Business and Psychology, 38, 163-190. doi: 10.1007/s10869-022-09864-6
dcterms.references	Li, C., y Xing, W. (2021, 6). Natural language generation using deep learning to support mooc learners. International Journal of Artificial Intelligence in Education, 31, 186-214. doi: 10.1007/s40593-020-00235
dcterms.references	Manodnya, K. H., y Giri, A. (2022, 10). Gpt-k: A gpt-based model for generation of text in kannada. En (p. 534-539). IEEE. doi: 10.1109/ICCCMLA56841.2022.9989289
dcterms.references	Maxime. (2019, 1). What is a transformer? Inside Machine learning. Descargado de https://medium.com/ inside-machine-learning/what-is-a-transformer-d07dd1fbec04
dcterms.references	Misini, A., Kadriu, A., y Canhasi, E. (2022, 12). A survey on authorship analysis tasks and techniques. SEEU Review, 17, 153-167. doi: 10.2478/seeur-2022-0100
dcterms.references	Ni, J., Young, T., Pandelea, V., Xue, F., y Cambria, E. (2022). Recent advances in deep learning based dialogue systems: a systematic survey. Artificial Intelligence Review. doi: 10.1007/s10462-022-10248-8
dcterms.references	OpenAI. (2022, 11). Introducing chatgpt. Descargado de https://openai.com/blog/chatgpt
dcterms.references	Raafat, M. A., El-Wakil, R. A. F., y Atia, A. (2021, 5). Comparative study for stylometric analysis techniques for authorship attribution. En (p. 176-181). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/MIUCC52538.2021.9447600
dcterms.references	Rathod, S. (2022). Exploring author profiling for fake news detection. En (p. 1614-1619). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/COMPSAC54236.2022.00256
dcterms.references	Saini, A., Sri, M. R., y Thakur, M. (2021, 2). Intrinsic plagiarism detection system using stylometric features and dbscan. En (p. 13-18). IEEE. doi: 10.1109/ICCCIS51004.2021.9397187
dcterms.references	Segura-Bedmar, I., Ruz, L., y Guerrero-Aspizua, S. (2021, 3). Evaluation of a transformer model applied to the task of text summarization in different domains. Procesamiento del Lenguaje Natural, 66, 27-39. doi: 10.26342/2021-66-2
dcterms.references	Stamatatos, E., Rangel, F., Tschuggnall, M., Stein, B., Kestemont, M., Rosso, P., y Potthast, M. (2018). Overview of pan 2018. , 267-285. doi: 10.1007/978-3-319-98932-7_25
dcterms.references	Stokel-Walker, C. (2022, 12). Ai bot chatgpt writes smart essays — should professors worry? Nature. doi: 10.1038/d41586-022-04397-7
dcterms.references	Valenzuela, G. U. (2023, 3). desafío del uso de inteligencia artificial para la elaboración de la literatura ciéntífica: el caso de chatgpt, un debate abierto. Cuadernos Médico Sociales, 63, 27-31. doi: 10.56116/ cms.v63.n1.2023.1140
dcterms.references	van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., y Bockting, C. L. (2023, 2). Chatgpt: five priorities for research. Nature, 614, 224-226. doi: 10.1038/d41586-023-00288-7
dcterms.references	Wolf, T. (2018, 5). The current best of universal word embeddings and sentence embeddings. HuggingFace. Descargado de https://medium.com/huggingface/universal-word-sentence-embeddings -ce48ddc8fc3a
thesis.degree.discipline	Facultad de Ingeniería	es_CO
thesis.degree.level	Maestría en Analítica Aplicada	es_CO
thesis.degree.name	Magíster en Analítica Aplicada	es_CO