Mostrar el registro sencillo del ítem

dc.contributor.advisorMejía Delgadillo, Gonzalo Enrique
dc.contributor.authorArdila Barbosa, David Camilo
dc.contributor.authorCarrillo Aranda, Dairo Javier
dc.contributor.authorLadino Perdomo, Vladimir
dc.date.accessioned2024-06-06T20:32:08Z
dc.date.available2024-06-06T20:32:08Z
dc.date.issued2023-10-23
dc.identifier.urihttp://hdl.handle.net/10818/60271
dc.description87 páginases_CO
dc.description.abstractGenerative language models have instigated a disruptive shift spanning across various sectors (OpenAI, 2022). These changes concurrently pose a challenge to the study of authorship, as generative models do not hold copyright, for two reasons. Firstly, they are not human entities to assume responsibility, and secondly, due to the nature of their training corpus (OpenAI, 2022), raising special significance within the academic context. In this study, we explore two experimental approaches for the binary classification of text generated by a Language Model (LM) and a human. These approaches are based on the field of stylometry and the feature extraction techniques employed in Natural Language Processing (NLP). To this end, a silver standard corpus or dataset was compiled from various sources, ensuring class balance. The dataset is composed of documents with distinct linguistic structures (fables, stories, essays, news reports, tweets, and poems) to diversify the vocabulary and the grammatical structure therein. The experimental approaches involve text classification via parameterization using TF-IDF, embedding, and feature extraction, proposing a taxonomy for the classification of linguistic features used in the classification process. These experimental approaches corroborate the findings of the existing literature (Fröhling y Zubiaga, 2021) (Dou y cols., 2021). Classification models such as decision trees, random forests, adaboost, and support vector classifiers (SVC), employed in LMs, and taking lexicogrammatical features as input, tend to outperform those based on statistical distributions like TF-IDF and vectorization approaches such as embedding. This superiority is likely due to their resistance to overfitting in the presence of exclusionary vocabulary within the corpus.en
dc.description.abstractLos modelos generativos de lenguaje han planteado un cambio disruptivo en áreas que abarcan diferentes sectores (OpenAI, 2022), estos cambios a su vez suponen un reto en el estudio de la autoría, pues los modelos de generación no tienen derechos de autor, ya que, no es un ser humano para asumir la responsabilidad y segundo por la naturaleza del corpus de su entrenamiento (OpenAI, 2022), lo que supone una especial relevancia en el contexto académico. En este trabajo se abordan dos líneas experimentales para la clasificación binaria de texto generado por un LLM y un humano, líneas que son abordadas desde el área de la estilometría y la extracción de características utilizadas en NLP. Para esto se recopila un corpus o data set silver standar de diferentes fuentes y clases balanceadas. Este data set es compuesto por documentos con estructuras lingüísticas distintas (fábulas, cuentos, ensayos, noticias, tweets y poemas) para diversificar el vocabulario, y la estructura gramatical de los mismos. Como líneas experimentales se toma la clasificación por parametrización del texto con tf-idf, embedding y extracción de características, proponiendo una taxonomía para la clasificación de las características lingüísticas usadas en la categorización. Estas líneas experimentales corroboran resultados de la literatura (Fröhling y Zubiaga, 2021) (Dou, Forbes,Koncel-Kedziorski, Smith, y Choi, 2021), en los cuales modelos de clasificación como decision tree, random forest, adaboost, svc, usados en llm, y cuyo input son características lexo gramaticales, funcionan mejor que los basados en distribuciones estadísticas como tf-idf y de vectorización, como el embedding, pues son propensos a un sobre ajuste, dada la presencia de vocabulario excluyente en el corpus.es_CO
dc.formatapplication/pdfes_CO
dc.language.isospaes_CO
dc.publisherUniversidad de La Sabanaes_CO
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internacional*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subject.otherProcesamiento de Lenguaje Natural
dc.subject.otherAlgoritmos clasificación
dc.subject.otherLLM, Taxonomía de textos
dc.subject.otherEstilometría
dc.subject.otherAtribución de Autoría
dc.subject.otherChatGPT
dc.titleDETEL Identificación de textos elaborados por LLMes_CO
dc.typemaster thesises_CO
dc.type.hasVersionpublishedVersiones_CO
dc.rights.accessRightsopenAccesses_CO
dcterms.referencesAhmed, H. (2018). The role of linguistic feature categories in authorship verification. Procedia Computer Science, 142, 214-221. doi: 10.1016/j.procs.2018.10.478
dcterms.referencesAl-Khatib, M. A., y Al-qaoud, J. K. (2021). Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative study. Information Communication and Society, 24. doi: 10.1080/1369118X.2020.1716039
dcterms.referencesAntici, F., Bolognini, L., Inajetovic, M. A., Ivasiuk, B., Galassi, A., y Ruggeri, F. (2021). Subjectivita: An italian corpus for subjectivity detection in newspapers. , 40-52. doi: 10.1007/978-3-030-85251-1_4
dcterms.referencesAnwar, W., Bajwa, I. S., y Ramzan, S. (2019). Design and implementation of a machine learning-based authorship identification model. Scientific Programming, 2019. doi: 10.1155/2019/9431073
dcterms.referencesBartz, D. (2023, 2). As chatgpt’s popularity explodes, u.s. lawmakers take an interest. Descargado de https://www.reuters.com/technology/chatgpts-popularity-explodes-us -lawmakers-take-an-interest-2023-02-13/
dcterms.referencesBender, E. M., Gebru, T., McMillan-Major, A., y Shmitchell, S. (2021, 3). On the dangers of stochastic parrots. En (p. 610-623). ACM. doi: 10.1145/3442188.3445922
dcterms.referencesDas, A., y Verma, R. M. (2020). Can machines tell stories? a comparative study of deep neural language models and metrics. IEEE Access, 8, 181258-181292. doi: 10.1109/ACCESS.2020.3023421de Villa, G. R. (2018). Introducción a word2vec (skip gram model). Descargado de https://gruizdevilla .medium.com/introducci%C3%B3n-a-word2vec-skip-gram-model-4800f72c871f
dcterms.referencesDou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N. A., y Choi, Y. (2021, 7). Is gpt-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text.
dcterms.referencesFröhling, L., y Zubiaga, A. (2021). Feature-based detection of automated language models: Tackling gpt-2, gpt-3 and grover. PeerJ Computer Science, 7. doi: 10.7717/PEERJ-CS.443
dcterms.referencesGaur, V., y Saunshi, N. (2022, 9). Symbolic math reasoning with language models. En (p. 1-5). IEEE. doi: 10.1109/URTC56832.2022.10002218
dcterms.referencesHOLMES, D. I. (1998, 9). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13, 111-117. doi: 10.1093/llc/13.3.111
dcterms.referencesHutson, M. (2022, 11). Could ai help you to write your next paper? Nature, 611, 192-193. doi: 10.1038/ d41586-022-03479-w
dcterms.referencesJafariakinabad, F., y Hua, K. A. (2021, 11). Unifying lexical, syntactic, and structural representations of written language for authorship attribution. SN Computer Science, 2, 481. doi: 10.1007/s42979-021-00911-2
dcterms.referencesKarani, D. (2018, 9). Introduction to word embedding and word2vec.
dcterms.referencesLagutina, K., Lagutina, N., Boychuk, E., Larionov, V., y Paramonov, I. (2021). Authorship verification of literary texts with rhythm features. En (Vol. 2021-January). doi: 10.23919/FRUCT50888.2021.9347649
dcterms.referencesLagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O., . . . Demidov, P. G. (2019). A survey on stylometric text features.. doi: 10.23919/FRUCT48121.2019.8981504
dcterms.referencesLee, P., Fyffe, S., Son, M., Jia, Z., y Yao, Z. (2023, 2). A paradigm shift from “human writing” to “machine generation” in personality test development: an application of state-of-the-art natural language processing. Journal of Business and Psychology, 38, 163-190. doi: 10.1007/s10869-022-09864-6
dcterms.referencesLi, C., y Xing, W. (2021, 6). Natural language generation using deep learning to support mooc learners. International Journal of Artificial Intelligence in Education, 31, 186-214. doi: 10.1007/s40593-020-00235
dcterms.referencesManodnya, K. H., y Giri, A. (2022, 10). Gpt-k: A gpt-based model for generation of text in kannada. En (p. 534-539). IEEE. doi: 10.1109/ICCCMLA56841.2022.9989289
dcterms.referencesMaxime. (2019, 1). What is a transformer? Inside Machine learning. Descargado de https://medium.com/ inside-machine-learning/what-is-a-transformer-d07dd1fbec04
dcterms.referencesMisini, A., Kadriu, A., y Canhasi, E. (2022, 12). A survey on authorship analysis tasks and techniques. SEEU Review, 17, 153-167. doi: 10.2478/seeur-2022-0100
dcterms.referencesNi, J., Young, T., Pandelea, V., Xue, F., y Cambria, E. (2022). Recent advances in deep learning based dialogue systems: a systematic survey. Artificial Intelligence Review. doi: 10.1007/s10462-022-10248-8
dcterms.referencesOpenAI. (2022, 11). Introducing chatgpt. Descargado de https://openai.com/blog/chatgpt
dcterms.referencesRaafat, M. A., El-Wakil, R. A. F., y Atia, A. (2021, 5). Comparative study for stylometric analysis techniques for authorship attribution. En (p. 176-181). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/MIUCC52538.2021.9447600
dcterms.referencesRathod, S. (2022). Exploring author profiling for fake news detection. En (p. 1614-1619). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/COMPSAC54236.2022.00256
dcterms.referencesSaini, A., Sri, M. R., y Thakur, M. (2021, 2). Intrinsic plagiarism detection system using stylometric features and dbscan. En (p. 13-18). IEEE. doi: 10.1109/ICCCIS51004.2021.9397187
dcterms.referencesSegura-Bedmar, I., Ruz, L., y Guerrero-Aspizua, S. (2021, 3). Evaluation of a transformer model applied to the task of text summarization in different domains. Procesamiento del Lenguaje Natural, 66, 27-39. doi: 10.26342/2021-66-2
dcterms.referencesStamatatos, E., Rangel, F., Tschuggnall, M., Stein, B., Kestemont, M., Rosso, P., y Potthast, M. (2018). Overview of pan 2018. , 267-285. doi: 10.1007/978-3-319-98932-7_25
dcterms.referencesStokel-Walker, C. (2022, 12). Ai bot chatgpt writes smart essays — should professors worry? Nature. doi: 10.1038/d41586-022-04397-7
dcterms.referencesValenzuela, G. U. (2023, 3). desafío del uso de inteligencia artificial para la elaboración de la literatura ciéntífica: el caso de chatgpt, un debate abierto. Cuadernos Médico Sociales, 63, 27-31. doi: 10.56116/ cms.v63.n1.2023.1140
dcterms.referencesvan Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., y Bockting, C. L. (2023, 2). Chatgpt: five priorities for research. Nature, 614, 224-226. doi: 10.1038/d41586-023-00288-7
dcterms.referencesWolf, T. (2018, 5). The current best of universal word embeddings and sentence embeddings. HuggingFace. Descargado de https://medium.com/huggingface/universal-word-sentence-embeddings -ce48ddc8fc3a
thesis.degree.disciplineFacultad de Ingenieríaes_CO
thesis.degree.levelMaestría en Analítica Aplicadaes_CO
thesis.degree.nameMagíster en Analítica Aplicadaes_CO


Ficheros en el ítem

Thumbnail

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

Attribution-NonCommercial-NoDerivatives 4.0 InternacionalExcepto si se señala otra cosa, la licencia del ítem se describe como Attribution-NonCommercial-NoDerivatives 4.0 Internacional