CORPORA-BASED ANALYSIS OF SPECIALISED TEXTS FOR TRANSLATION TRAINING: TERMS AND NEOLOGISMS
DOI:
https://doi.org/10.32782/2522-4077-2025-213-23Keywords:
corpus-based research, corpus, professional text, core vocabulary, term, neologism, Sketch Engine, CQL universal query languageAbstract
Language corpora are one of the most effective tools of applied linguistics, which are actively used in various fields of human life. The automated selection, compilation and analysis of text corpora of virtually unlimited size open up new perspectives not only for linguistic research, but also for professionals who use such data to solve practical problems. Corpus-based methods have great potential for improving language teaching, including translation, as they allow for the accurate and targeted selection of specialised linguistic materials necessary for mastering the lexical minimum, peculiarities of usage and translation of key language units, as well as for identifying current language trends in a particular field.Among the tools for working with corpora, Sketch Engine stands out as one of the most powerful, as it does not only analyse existing corpora but also creates your own, including multilingual ones. This makes it possible to quickly and efficiently research professional texts, identify key terminology and common phrases, analyse translation strategies, and create training materials for future translators. The use of the CQL query language allows improving search accuracy and obtaining more relevant linguistic data.The given article, which is a part of a larger study, discusses such an important function of Sketch Engine for searching, analysing and selecting lexical material as term recognition and extraction using the built-in Sketch Engine Keywords tool. This tool not only allows to identify terms and term combinations in professional texts with high accuracy, but also to compare the frequency of use of such words and combinations in both the studied and the reference corpora, which significantly increases the efficiency of search in general and linguistic analysis of selected units in particular. Another aspect of this study is the methodology of corpus search for neologisms and rarely used words. The latter is a challenge for corpus-based text analysis, as there are no universal search formulas or even principles for finding such vocabulary, which, however, is an important component of professional texts.The study is based on a corpus of English-language legal texts related to the IT sector, including licence agreements and contracts.
References
Bucur Ana-Maria, Dincă Andreea, Chitez Madalina, Rogobete Roxana. Automatic Extraction of the Romanian Academic Word List: Data and Methods. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing. 2023. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria. pp. 234–241.
Domhan T., Hasler E., Tran K., Trenous S., Byrne B., Hieber F. The Devil Is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2022). 2022. Association for Computational Linguistics. pp. 1840–1851. https://doi.org/10.18653/v1/2022.naacl-main.136
Akkoyunlu Aslı, Kilimci Abdurrahman. Application of Corpus to Translation Teaching: Practice and Perceptions. International Online Journal of Education and Teaching. 2017. Vol. 4. pp. 369–396.
Lusta A., Demirel Ö., Mohammadzadeh B. Language Corpus and Data Driven Learning (DDL) in Language Classrooms: A Systematic Review. Heliyon. 2023. Vol. 9. e22731. 10.1016/j.heliyon.2023.e22731.
Culpeper J., Demmen J. Keywords. In: Biber D., Reppen R. (Eds.). The Cambridge Handbook of English Corpus Linguistics. Cambridge University Press. 2015. pp. 90–105. DOI: 10.1017/CBO9781139764377.006
Moreno-Ortiz, A. Making Sense of Large Social Media Corpora. An Open Accesss Publication. Palgrave Macmillan. 2024. 192 p. DOI: 10.1007/978-3-031-52719-7
Peñas, A., Verdejo, F., & Gonzalo, J. Corpus-Based Terminology Extraction Applied to Information Access. UCREL Technical Papers, 13. Presented at the Corpus Linguistics 2001 conference, Lancaster University, United Kingdom. pp. 458–465.
Cabré Castellví M.T., Estopà Bagot R., Vivaldi Palatresi J. Automatic Term Detection: A Review of Current Systems. Terminology. 2001. Vol. 7(2). pp. 53–88. DOI: 10.1075/term.7.2.07cab
Van Eck N.J., Waltman L., Noyons E.C.M., Buter R.K. Automatic Term Identification for Bibliometric Mapping. Scientometrics. 2010. Vol. 82(3). pp. 581–596. DOI: 10.1007/s11192-010-0173-0
Hengchen, S., Tahmasebi, N., Schlechtweg, D., & Dubossarsky, H. Challenges for Computational Lexical Semantic Change. In N. Tahmasebi, L. Borin, A. Jatowt, Y. Xu, & S. Hengchen (Eds.), Computational Approaches to Semantic Change. Language Science Press. 2021. pp. 341–372. DOI: 10.5281/zenodo.5040322
Tahmasebi N., Borin L., Jatowt A., Xu Y., Hengchen S. (Eds.). Computational Approaches to Semantic Change. Language Science Press. 2021. DOI: 10.5281/zenodo.5040302.
Afentoulidou V., Christofidou A. It's a Long Way to a Dictionary: Towards a Corpus-Based Dictionary of Neologisms. EURALEX Proceedings. 2021. Vol. 2. pp. 597–606.
Anokhina T., Kobyakova I., Schvachko S. Innovative Methodology for Teaching European Studies Using a Corpus Approach. Philological Treatises. 2023. Vol. 15. No. 2. pp. 7–16.
Matvieieva S. A., Lemish N. Ye., Zernetska A. A., Babych V. I., Torgovets M. S. English-Ukrainian Parallel Corpus: Prerequisites for Building and Practical Use in Translation Studies. Studies about Languages. 2022. Vol. 1. pp. 61–74.
Lemish N. Ye., Aleksieieva O. M., Denysova S. P., Matvieieva S. A., Zernetska A. A. Linguistic Corpora Technology as a Didactic Tool in Training Future Translators. Information Technologies and Learning Tools. 2020. Vol. 79. No. 5. pp. 242–259.
Kilgarriff A., Baisa V., Bušta J., Jakubíček M., Kovář V., Michelfeit J., Rychlý P., Suchomel V. The Sketch Engine: Ten Years On. Lexicography. 2014. Vol. 1(1). pp. 7–36. DOI: 10.1007/s40607-014-0009-9






