Mostrar el registro sencillo del ítem
DETEL Identificación de textos elaborados por LLM
dc.contributor.advisor | Mejía Delgadillo, Gonzalo Enrique | |
dc.contributor.author | Ardila Barbosa, David Camilo | |
dc.contributor.author | Carrillo Aranda, Dairo Javier | |
dc.contributor.author | Ladino Perdomo, Vladimir | |
dc.date.accessioned | 2024-06-06T20:32:08Z | |
dc.date.available | 2024-06-06T20:32:08Z | |
dc.date.issued | 2023-10-23 | |
dc.identifier.uri | http://hdl.handle.net/10818/60271 | |
dc.description | 87 páginas | es_CO |
dc.description.abstract | Generative language models have instigated a disruptive shift spanning across various sectors (OpenAI, 2022). These changes concurrently pose a challenge to the study of authorship, as generative models do not hold copyright, for two reasons. Firstly, they are not human entities to assume responsibility, and secondly, due to the nature of their training corpus (OpenAI, 2022), raising special significance within the academic context. In this study, we explore two experimental approaches for the binary classification of text generated by a Language Model (LM) and a human. These approaches are based on the field of stylometry and the feature extraction techniques employed in Natural Language Processing (NLP). To this end, a silver standard corpus or dataset was compiled from various sources, ensuring class balance. The dataset is composed of documents with distinct linguistic structures (fables, stories, essays, news reports, tweets, and poems) to diversify the vocabulary and the grammatical structure therein. The experimental approaches involve text classification via parameterization using TF-IDF, embedding, and feature extraction, proposing a taxonomy for the classification of linguistic features used in the classification process. These experimental approaches corroborate the findings of the existing literature (Fröhling y Zubiaga, 2021) (Dou y cols., 2021). Classification models such as decision trees, random forests, adaboost, and support vector classifiers (SVC), employed in LMs, and taking lexicogrammatical features as input, tend to outperform those based on statistical distributions like TF-IDF and vectorization approaches such as embedding. This superiority is likely due to their resistance to overfitting in the presence of exclusionary vocabulary within the corpus. | en |
dc.description.abstract | Los modelos generativos de lenguaje han planteado un cambio disruptivo en áreas que abarcan diferentes sectores (OpenAI, 2022), estos cambios a su vez suponen un reto en el estudio de la autoría, pues los modelos de generación no tienen derechos de autor, ya que, no es un ser humano para asumir la responsabilidad y segundo por la naturaleza del corpus de su entrenamiento (OpenAI, 2022), lo que supone una especial relevancia en el contexto académico. En este trabajo se abordan dos líneas experimentales para la clasificación binaria de texto generado por un LLM y un humano, líneas que son abordadas desde el área de la estilometría y la extracción de características utilizadas en NLP. Para esto se recopila un corpus o data set silver standar de diferentes fuentes y clases balanceadas. Este data set es compuesto por documentos con estructuras lingüísticas distintas (fábulas, cuentos, ensayos, noticias, tweets y poemas) para diversificar el vocabulario, y la estructura gramatical de los mismos. Como líneas experimentales se toma la clasificación por parametrización del texto con tf-idf, embedding y extracción de características, proponiendo una taxonomía para la clasificación de las características lingüísticas usadas en la categorización. Estas líneas experimentales corroboran resultados de la literatura (Fröhling y Zubiaga, 2021) (Dou, Forbes,Koncel-Kedziorski, Smith, y Choi, 2021), en los cuales modelos de clasificación como decision tree, random forest, adaboost, svc, usados en llm, y cuyo input son características lexo gramaticales, funcionan mejor que los basados en distribuciones estadísticas como tf-idf y de vectorización, como el embedding, pues son propensos a un sobre ajuste, dada la presencia de vocabulario excluyente en el corpus. | es_CO |
dc.format | application/pdf | es_CO |
dc.language.iso | spa | es_CO |
dc.publisher | Universidad de La Sabana | es_CO |
dc.rights | Attribution-NonCommercial-NoDerivatives 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | * |
dc.subject.other | Procesamiento de Lenguaje Natural | |
dc.subject.other | Algoritmos clasificación | |
dc.subject.other | LLM, Taxonomía de textos | |
dc.subject.other | Estilometría | |
dc.subject.other | Atribución de Autoría | |
dc.subject.other | ChatGPT | |
dc.title | DETEL Identificación de textos elaborados por LLM | es_CO |
dc.type | master thesis | es_CO |
dc.type.hasVersion | publishedVersion | es_CO |
dc.rights.accessRights | openAccess | es_CO |
dcterms.references | Ahmed, H. (2018). The role of linguistic feature categories in authorship verification. Procedia Computer Science, 142, 214-221. doi: 10.1016/j.procs.2018.10.478 | |
dcterms.references | Al-Khatib, M. A., y Al-qaoud, J. K. (2021). Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative study. Information Communication and Society, 24. doi: 10.1080/1369118X.2020.1716039 | |
dcterms.references | Antici, F., Bolognini, L., Inajetovic, M. A., Ivasiuk, B., Galassi, A., y Ruggeri, F. (2021). Subjectivita: An italian corpus for subjectivity detection in newspapers. , 40-52. doi: 10.1007/978-3-030-85251-1_4 | |
dcterms.references | Anwar, W., Bajwa, I. S., y Ramzan, S. (2019). Design and implementation of a machine learning-based authorship identification model. Scientific Programming, 2019. doi: 10.1155/2019/9431073 | |
dcterms.references | Bartz, D. (2023, 2). As chatgpt’s popularity explodes, u.s. lawmakers take an interest. Descargado de https://www.reuters.com/technology/chatgpts-popularity-explodes-us -lawmakers-take-an-interest-2023-02-13/ | |
dcterms.references | Bender, E. M., Gebru, T., McMillan-Major, A., y Shmitchell, S. (2021, 3). On the dangers of stochastic parrots. En (p. 610-623). ACM. doi: 10.1145/3442188.3445922 | |
dcterms.references | Das, A., y Verma, R. M. (2020). Can machines tell stories? a comparative study of deep neural language models and metrics. IEEE Access, 8, 181258-181292. doi: 10.1109/ACCESS.2020.3023421de Villa, G. R. (2018). Introducción a word2vec (skip gram model). Descargado de https://gruizdevilla .medium.com/introducci%C3%B3n-a-word2vec-skip-gram-model-4800f72c871f | |
dcterms.references | Dou, Y., Forbes, M., Koncel-Kedziorski, R., Smith, N. A., y Choi, Y. (2021, 7). Is gpt-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. | |
dcterms.references | Fröhling, L., y Zubiaga, A. (2021). Feature-based detection of automated language models: Tackling gpt-2, gpt-3 and grover. PeerJ Computer Science, 7. doi: 10.7717/PEERJ-CS.443 | |
dcterms.references | Gaur, V., y Saunshi, N. (2022, 9). Symbolic math reasoning with language models. En (p. 1-5). IEEE. doi: 10.1109/URTC56832.2022.10002218 | |
dcterms.references | HOLMES, D. I. (1998, 9). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13, 111-117. doi: 10.1093/llc/13.3.111 | |
dcterms.references | Hutson, M. (2022, 11). Could ai help you to write your next paper? Nature, 611, 192-193. doi: 10.1038/ d41586-022-03479-w | |
dcterms.references | Jafariakinabad, F., y Hua, K. A. (2021, 11). Unifying lexical, syntactic, and structural representations of written language for authorship attribution. SN Computer Science, 2, 481. doi: 10.1007/s42979-021-00911-2 | |
dcterms.references | Karani, D. (2018, 9). Introduction to word embedding and word2vec. | |
dcterms.references | Lagutina, K., Lagutina, N., Boychuk, E., Larionov, V., y Paramonov, I. (2021). Authorship verification of literary texts with rhythm features. En (Vol. 2021-January). doi: 10.23919/FRUCT50888.2021.9347649 | |
dcterms.references | Lagutina, K., Lagutina, N., Boychuk, E., Vorontsova, I., Shliakhtina, E., Belyaeva, O., . . . Demidov, P. G. (2019). A survey on stylometric text features.. doi: 10.23919/FRUCT48121.2019.8981504 | |
dcterms.references | Lee, P., Fyffe, S., Son, M., Jia, Z., y Yao, Z. (2023, 2). A paradigm shift from “human writing” to “machine generation” in personality test development: an application of state-of-the-art natural language processing. Journal of Business and Psychology, 38, 163-190. doi: 10.1007/s10869-022-09864-6 | |
dcterms.references | Li, C., y Xing, W. (2021, 6). Natural language generation using deep learning to support mooc learners. International Journal of Artificial Intelligence in Education, 31, 186-214. doi: 10.1007/s40593-020-00235 | |
dcterms.references | Manodnya, K. H., y Giri, A. (2022, 10). Gpt-k: A gpt-based model for generation of text in kannada. En (p. 534-539). IEEE. doi: 10.1109/ICCCMLA56841.2022.9989289 | |
dcterms.references | Maxime. (2019, 1). What is a transformer? Inside Machine learning. Descargado de https://medium.com/ inside-machine-learning/what-is-a-transformer-d07dd1fbec04 | |
dcterms.references | Misini, A., Kadriu, A., y Canhasi, E. (2022, 12). A survey on authorship analysis tasks and techniques. SEEU Review, 17, 153-167. doi: 10.2478/seeur-2022-0100 | |
dcterms.references | Ni, J., Young, T., Pandelea, V., Xue, F., y Cambria, E. (2022). Recent advances in deep learning based dialogue systems: a systematic survey. Artificial Intelligence Review. doi: 10.1007/s10462-022-10248-8 | |
dcterms.references | OpenAI. (2022, 11). Introducing chatgpt. Descargado de https://openai.com/blog/chatgpt | |
dcterms.references | Raafat, M. A., El-Wakil, R. A. F., y Atia, A. (2021, 5). Comparative study for stylometric analysis techniques for authorship attribution. En (p. 176-181). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/MIUCC52538.2021.9447600 | |
dcterms.references | Rathod, S. (2022). Exploring author profiling for fake news detection. En (p. 1614-1619). Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/COMPSAC54236.2022.00256 | |
dcterms.references | Saini, A., Sri, M. R., y Thakur, M. (2021, 2). Intrinsic plagiarism detection system using stylometric features and dbscan. En (p. 13-18). IEEE. doi: 10.1109/ICCCIS51004.2021.9397187 | |
dcterms.references | Segura-Bedmar, I., Ruz, L., y Guerrero-Aspizua, S. (2021, 3). Evaluation of a transformer model applied to the task of text summarization in different domains. Procesamiento del Lenguaje Natural, 66, 27-39. doi: 10.26342/2021-66-2 | |
dcterms.references | Stamatatos, E., Rangel, F., Tschuggnall, M., Stein, B., Kestemont, M., Rosso, P., y Potthast, M. (2018). Overview of pan 2018. , 267-285. doi: 10.1007/978-3-319-98932-7_25 | |
dcterms.references | Stokel-Walker, C. (2022, 12). Ai bot chatgpt writes smart essays — should professors worry? Nature. doi: 10.1038/d41586-022-04397-7 | |
dcterms.references | Valenzuela, G. U. (2023, 3). desafío del uso de inteligencia artificial para la elaboración de la literatura ciéntífica: el caso de chatgpt, un debate abierto. Cuadernos Médico Sociales, 63, 27-31. doi: 10.56116/ cms.v63.n1.2023.1140 | |
dcterms.references | van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., y Bockting, C. L. (2023, 2). Chatgpt: five priorities for research. Nature, 614, 224-226. doi: 10.1038/d41586-023-00288-7 | |
dcterms.references | Wolf, T. (2018, 5). The current best of universal word embeddings and sentence embeddings. HuggingFace. Descargado de https://medium.com/huggingface/universal-word-sentence-embeddings -ce48ddc8fc3a | |
thesis.degree.discipline | Facultad de Ingeniería | es_CO |
thesis.degree.level | Maestría en Analítica Aplicada | es_CO |
thesis.degree.name | Magíster en Analítica Aplicada | es_CO |