CILC2016: Papers with Abstracts

Papers
The Effects of Structure and Proficiency on Determiner Number (Dis)agreement in the Nigerian English Noun Phrase Mayowa Akinlotan Abstract. The paper investigates the extent to which the grammatical number (dis)agreement hypothesis in the New Englishes (Platt et al 1984, Gorlach 1998, Mesthrie et al 2008) manifests in the determiner system of the Nigerian English, and how the variables of proficiency, text type, register, structural complexity, and syntactic form influence scenarios found. Applying principle of accountability (Labov 1972, Tagliamonte 2012), together with test statistic on data drawn from the Nigeria-ICE, we showed that in Nigerian determiner system, grammatical number is likely to agree (98%) with the head noun of the noun phrase than to disagree (2%). Also, the disagreement is mainly influenced by complexity and proficiency. This number irregularity is more likely to occur with the use of quantifier or demonstrative than with indefinite article. We argue that this scenario suggests a manifestation of fossilisation by transferring from the syntactically unique determiner systems of the local Nigerian languages to Nigerian English.
'A lo mejor and igual'. Epistemic and Non-Epistemic Meanings María José Barrios Sabador Abstract. This paper looks into the use of a lo mejor and igual (‘maybe’, ‘perhaps’) in oral texts of the Corpus de Referencia del Español Actual where the speaker doesn’t show any lack of commitment to the proposition. In some of these texts, the speaker gives an example to substantiate his position; in others, he talks about everyday actions – characterized by showing an effective state of affairs and, as such, not subject to doubt –. To account for these uses – not described in grammar – we provide some explanatory hypotheses based on the primary meaning of these exponents, their stage of grammaticalization and their pragmatic motivation.
How Many People in the Making of Sloane 770? A Corpus-Based Approach Jessica Carmona-Cejudo Abstract. During the transcription of a late Middle English manuscript on Medicine (London, British Library, MS Sloane 770), a series of orthographic variations appeared, several of which seemed to be arranged following a predictable pattern. Should this prove correct, it may be a clue to posit the existence of two or more scribes who were involved in the copying of the codex, or else of the dialect of the MS being an example of Mischsprache that combines the dialects of the exemplar MS and that of the scribe. To ascertain whether the MS was written by more than one copyist or whether it is an example of the coexistence of two different dialects, a morphologically lemmatized corpus was built. This paper will present the results obtained after studying that corpus in order to verify either the original hypothesis is linguistically and scientifically based or not.
A Pilot Study on the Use of Pronouns in Oral Productions of Spanish as a Second Language Kim Collewaert and An Vande Casteele Abstract. In this paper, some preliminary results on the use of pronouns in oral discourse of language learners of Spanish will be discussed. The article mainly focuses on the use of different kinds of personal pronouns and the pro-drop phenomenon, namely the existence of a null subject, typical of the Spanish language. The absence of an explicit subject due to a rich verbal conjugation opposes Spanish to other languages, such as French, English and Dutch, where an explicit subject pronoun is obligatory. As to investigate the use of the pronouns by language learners of Spanish, we compiled a corpus of oral productions of second language learners of Spanish who are all native speakers of Dutch and also learned French and English, which means that for them the pro-drop phenomenon is new. We will investigate which kinds of pronouns are used in which syntactic contexts and indicate in what contexts the use of a pronoun is not required. Next to this, we observe in our learners’ corpus an unnecessary repetition of proper names and an over-use of personal pronouns as subjects. This can be related to the concept of "over-explicitation" or "overspecification", whereby learners of a second language tend to use more explicit forms than necessary.
The Meaning of Hashtags Matters: Detecting Hashtags Such as #JeSuisKouachi Daniel Couto Vale and Adjan Hansen-Ampah Abstract. In this paper, we model how the contexts of situation and of discourse restrict meaning potential and apply the resulting model in automatic linguistic analysis of hashtags for tasks of sentiment analysis. With our model, we are able to assign different meanings to \#JeSuisCharlie and \#JeSuisKouachi and variations of them.
Graded Acceptance in Corpus-Based English-to-Spanish Machine Translation Evaluation Mario Crespo Miguel and Marta Sanchez-Saus Laserna Abstract. Traditionally, texts provided by machine translation have been evaluated with a binary criterion: right or wrong. However, in certain cases it is difficult to provide a clear-cut division between fully acceptable and fully unacceptable texts. In this paper we have selected group of different bilingual, human-translated English-to-Spanish pairs of sentences from parallel corpora (Linguee) and a group of machine translated texts with problematic linguistic phenomena in English-to-Spanish translation: polysemy, semantic equivalents, passive, anaphora, etc. We presented the translations to a group of native speakers that evaluated them in different levels of acceptability. Results show the degree of applicability of this approach.
'They are going tomorrow, isn't it?' On the Use of Tag Questions in Indian English and Hong Kong English Miriam Criado Peña Abstract. Tag questions in standard British English (BrE) follow a standard pattern consisting of an operator and a subject. This operator generally coincides with the preceding statement while the auxiliary “do” is the choice when the operator is absent. More importantly, a negative tag is generally attached to a positive statement and vice versa (i.e. you know her, don’t you?) (Quirk et al. 1985: 810). The Asian varieties of English are an exception insofar as apparently no standard rule is observed. The present paper investigates the use and distribution of regular and irregular tag questions in Indian English and Hong Kong English with the following objectives: a) to analyze the distribution of the construction of regular and irregular tag questions across these varieties; b) to assess their frequency across speech and writing, text types included; and c) to evaluate the sociolinguistic variation, if any. For the purpose, the Indian and Hong Kong components of the International Corpus of English (ICE-Ind and ICE-HK) will be used as sources of analysis.
An Introduction to Using Corpora with EFL Learners Mark Donnellan Abstract. The use of corpora in the classroom represents an innovative way to enable English language learners to undertake independent study of lexical and grammatical patterns; however, only a limited amount of investigation into the use of corpora with students exists. This paper will first briefly introduce pertinent literature, which will give a basic overview of corpus linguistics. The paper will report on the use of the British National Corpus (BNC) and other corpus tools with students in a semester long course in a Japanese university with advanced EFL learners. These students undertook a series of tasks and projects, which allowed them to achieve the overall course goal of being able to conduct independent research into lexical and grammatical patterns. In order to assess student progress and to gather student opinions about the course and the use of corpora, data was collected in two ways: pre and post CEFR style student self-assessments, and a course reflection and evaluation survey. The results indicated that the students had progressed throughout the semester and that they had a largely positive opinion of the course. In concluding the paper, suggestions for teachers wishing to use corpora and corpus tools with their students will be offered.
Bilingual Parallel Corpora for Linguistic Research Irene Doval Reixa Abstract. In this paper it will reflect on the specific needs of the linguistic research regarding the construction of bilingual parallel corpora and primarily on the conclusions to be drawn for their design, compilation and domains. A research group of the university in Santiago is currently building a bilingual parallel corpus (Corpus PaGeS) consisting of original texts in German and Spanish together with their translations into the other language, as well as German and Spanish translations from a third language. This corpus was originally intended for linguistic research purposes, specifically, the analysis of the expression of the spatial relations. Initially a brief survey of some significant existing related corpora is performed, and their limitations for linguistic studies are outlined. The different issues that were taken into account for the design of the corpus will be explained, such as type of texts, domains, regional language variety or quality and direction of translations. After describing the manual preparation process of the texts to make the documents suitable for further processing it is explained the manual and automatic annotation procedure: the metadata, and the automatically linguistic annotation. Then the process of sentence alignment and the manual review of the alignment are described and finally the next steps of future work are outlined
An Interaction Approach Between Services for Extracting Relevant Data From Tweets Corpora Mehdy Dref and Anna Pappa Abstract. We present a system based on the need of special infrastructure adequate to software agents to operate, to compose and make sense from the contents of the Web resources through the development of a multi-agent system oriented services interactions. Our method follows the different construction ontology techniques and updates them by extracting new terms and integrate them to the ontology.It is based on the detection phrases via the ontological database DBPedia. The system treats each syntagme extracted from the corpus of messages and verifies whether it is possible to associate them directly to a DBPedia knowledge. In case of failure, these service agents interact with each other in order to find the best possible answer to the problem, by operating directly in the phrase, trying to semantically modify it, until the association with ontological knowledge becomes possible. The advantage of our approach is its modularity : it is both possible to add / modify / delete a service or define a new one, and then influence the outcome product. We could compare the results extracted from a heterogeneous body of messages from the Twitter social network with Tagme method, based mainly on storage and annotation of encyclopaedic corpus.
CORALSE: Design of a Corpus of Spanish Sign Language Ana-María Fernández Soneira, Inmaculada C. Báez Montero and Eva Freijeiro Ocampo Abstract. The approval of the law for the recognition of Sign Languages and its subsequent development (together with the laws enacted by the regional governments and the work of universities and institutions such as CNLSE) has changed the landscape of the research activity carried out in the field of SL in Spain. In spite of these social advances, a corpus of Spanish Sign Language (LSE) has not yet been compiled. The average Sign Language corpus is traditionally composed of collections of annotated or tagged videos that contain written material aligned with the main Sign Language data. The compiling project presented here, CORALSE, proposes: 1) to collect a representative number of samples of language use; 2) to tag and transcribe the collected samples and build an online corpus; 3) to advance in the description of the grammar of LSE; 4) to provide the scientific background needed for the development of materials for educational purposes; and 5) to advance in the development of different types of LSE.
On the Way To the Relevant Grammatical Tagset for Tatar National Corpus Alfiia Galieva, Bulat Khakimov and Airat Gatiatullin Abstract. The development of the metalanguage for annotation is one of the topical issues in modern corpus linguistics. One of the main problems in the development of a grammatical tagset for the Tatar National Corpus is to identify the inventory level of inflectional categories and to create an optimal metalanguage of description. We discuss the factors that complicate the process of grammatical annotation for Turkic corpora in general, including the need to overcome the influence of the Indo-European grammatical tradition in the description of the phenomena of Turkic languages, the lack of generally accepted standards for corpus annotation, the lack of a common metalanguage used to describe grammatical categories of Turkic languages, poor differentiation of word-building and form-building in Turkic languages, etc. In the course of work on the system of grammatical annotation of the Tatar Corpus, we made an inventory of grammatical categories of the Tatar language and developed a metalanguage for describing them. Currently, the developed grammatical tagset contains 93 tags. Tags for parts of speech and grammatical categories were created to meet the worldwide standards, primarily the Leipzig glossing rules.
The Structure of Spanish Verbless Sentences Oscar Garcia-Marchena Abstract. Spanish verbless utterances in the CORDE corpus have been classified in a taxonomy and annotated for distribution frequency and syntactic properties (part of speech of the head, structure and syntactic type). This work has allowed to note that Spanish verbless utterances are a non-negligible part of oral utterances: they amount to around 19% of the 63,000 utterances from the corpus, both in root and subordinate contexts. Among these verbless utterances, fragments are significantly more frequent as roots than verbless clauses, but they are both equally rare in subordination.
A French Weblog Corpus for New Insights on Blog Post Tagging Ivan Garrido-Marquez, Laurent Audibert, Jorge García-Flores, François Lévy and Adeline Nazarenko Abstract. The rapid evolution and informational growth of blogs requires enhanced functionality for searching, navigating and linking content. This paper presents the French Blog Annotation Corpus \textsc{FLOG}, intended to provide a research testbed for the study of annotation practices, and specifically tagging and categorizing blog posts. The corpus covers a ten year time span of blog posts on cooking, law, video games and technology. Statistical analysis of the corpus suggests that tag annotation of posts is more frequent than category attribution, but on the other hand categories provide a richer semantic structure for post classification and search. The review of the state of the art on automatic tag suggestion shows that tag suggestion tools are not of widespread use yet between bloggers, which might be a consequence of methods that do not take into account the past tagging history of the blog, the context of the post within the blog and the tagging pattern of each blog author.
A Corpus Analysis of Sentence-Initial "Lo que". A Constructionist Approach Anja Hennemann Abstract. The present paper is concerned with sentence-initial Lo que ‘What’ – a so-called “copulativa enfática de relativo” (NGLE 2009: 3024). The focus marker Lo que is considered a construction because its form, function and even meaning are not strictly predictable from its component parts. The work with the CREA and CORDE shows that the construction’s frequency has increased over time so that the construction can nowadays be said to occur “with sufficient frequency” (Goldberg 2006: 5; cf. also Hilpert’s concept of constructional change, 2013).
The Poetic Word of Fernando de Herrera: An Approach Through Corpus And Computational Linguistics Laura Hernández Lorenzo Abstract. Great advances in Corpus Linguistics have led to new approaches in Literary Studies. This paper applies these new tools to the analysis of Golden Age Spanish poetry written by Fernando de Herrera, the author of Anotaciones a Garcilaso de la Vega (1580) and one of the greatest poets of his time. Through a keyword method combined with lexical concordances, we will try to overview principal characteristics and differences between subgenres in Herrera’s poetry, dealing with the poems he published in life (known as H) and getting results which help in the academic debate about this poet’s works and style.
Sentence Length and NP Complexity of General and Medical Written Academic and Media Texts. An Analysis Using a Trained Syntactic Parser. Carlos Herrero Zorita and Antonio Moreno-Sandoval Abstract. The main objective of this work is to perform a comparative analysis of sentence and main noun phrases complexity in two different types of discourses, written media and academic prose, using a trained syntactic parser (Stanford PCFG Parser). For this purpose, we have selected three written sources: a general media corpus, a medical media subcorpus and a medical academic prose subcorpus. From a total of more than 160000 sentences, we have carefully selected as the study sample a total of 300, which have been morphologically and syntactically annotated. Influenced by other studies related to syntax and statistics, our hypothesis is that NPs from academic prose and written media will contain four or more words, and those belonging to academic prose will be larger than the latter. The NPs studied are those that perform the main functions of the clause: subject, object (direct and indirect), attribute and time expressions. The results show a confirmation of our hypothesis. The academic subcorpus has the longest sentences and more complex NPs than the other texts. On the other hand, written media corpora achieve smaller NPs but their results are quite similar.
SEPAME2: The First Longitudinal Corpus for Greek as an L2 Maria Iakovou, Olga Dima, Irianna Vasiliadi-Linardaki, Sofia-Nefeli Kitrou, Flora Vlachou, Marina Koutsoubou, Tatiana Katsina, Stavrialena Perrea, Froso Pappa, Xristina Kostakou and Maria Kavvadia Abstract. SEPAME2 is the first attempt to design and implement a longitudinal corpus of different L1 learners of Greek as an L2. It supports the idea that the best way to learn a language is by being “pushed” to use it in different circumstances/registers and by taking advantage of personalized feedback modes, so that the language becomes not only the result of the learning process, but also the source of further metalinguistic reflection. In this preliminary presentation, main design principles as well as future implications of the SEPAME2 project are discussed.
Use of That-Clauses After Reporting Verbs in Asian Learners’ Speech and Writing: Frequency, Verb Type, and That-Omission Shin'Ichiro Ishikawa Abstract. That-clauses after reporting verbs (VTHAT) are widely used in L2 and L1 English. Previous studies have examined their frequency, common reporting verbs, and omission of the complementizer, but how varied learners use VTHAT in their writing and speech and how they differ from native speakers in usage of VTHAT has not been wholly elucidated. Therefore, using the International Corpus Network of Asian Learners of English (ICNALE), we compared the uses of VTHAT by six groups of Asian learners of English (ALE) at different L2 proficiency levels and English native speakers (ENS). Our analyses have revealed that ALE use VTHAT less often than ENS, omit the complementizer more often both in speech and writing, and tend to use reporting verbs such as “think,” “believe,” “agree,” and “know.”
Corpus Analysis of Discourse Markers in Corporate Reports Involving Climate Change Oleksandr Kapranov Abstract. This article involves a corpus-assisted quantitative analysis of discourse markers (further in the article - DMs) identified in the climate change sections of corporate annual reports by British Petroleum and the Royal Dutch Shell corporations. Corporate discourse involving climate change has been amply elucidated from the linguistic macro-perspective (Koteyko, 2012; Livesey, 2002), whilst the discursive micro-perspective still receives little attention. The present corpus-assisted study seeks to elucidate corporate discourse on climate change from the micro-perspective by identifying DMs in climate change sections of annual reports by British Petroleum and the Royal Dutch Shell corporations. Additionally, the novel aspect of the present study involves a juxtaposition of the to-be-identified DMs in annual reporting by these two corporations. The corpus of the study involves climate change sections of the annual reports by British Petroleum and the Royal Dutch Shell Group within the time frame from 2010 until 2015. The corpus has been analysed in WordSmith (Scott, 2012). Results of the data analysis indicate that the most frequent DMs used in climate change discourse by British Petroleum involve and (M = 4,2%), as (0,9%), also (M = 0,4%), likely (0,3%), and but (M = 0,15%), while DMs identified in the Royal Dutch Shell Group’s climate change discourse comprise and (M = 2%), but (M = 0,15%), also (M = 0,6%), such (M = 0,2%), however (M = 0,2%), accordingly (M = 0,1%), furthermore (M = 0,16%), further ( M = 0,1%), and therefore (M = 0,1%). These findings are further presented and discussed in detail in the article.
Grammatical Disambiguation in the Tatar National Corpus Bulat Khakimov, Ramil Gataullin and Rinat Gilmullin Abstract. This paper concerns the issues of grammatical ambiguity in the Tatar National Corpus and the possiblities for automation of the disambiguation process in the corpus. Grammatical ambiguity is widely represented in agglutinative languages like Turkic or Finno-Ugric. In order to build the grammatically disambiguated subcorpus, wе have developed a special software module which searches for ambiguous tokens in the corpus, collects statistical information and allows creating and implementing the formal disambiguation rules for different ambiguity types. Disambiguation in the corpus is based on the context-oriented classification of ambiguity types which has been carried out on statistical corpus data in the Tatar language for the first time. We can say that we use the corpus as a source of our research and at the same time as a destination for implementing the results. Estimated cumulative effect of disambiguation of the identified frequent ambiguity types in the Tatar National Corpus can be up to 50%.
Supracorpora Databases as Corpus-Based Superstructure for Manual Annotation of Parallel Corpora Mikhail Kruzhkov Abstract. This paper presents a new type on corpus-based information resource: supracorpora databases (SCDBs). SCDBs are designed to enhance functionality of linguistic corpora by supporting customizable manual annotation of linguistic items, including multi-word items. This is similar to query result categorization functions available in some corpora and to functions provided by some of the standalone corpus annotation tools, although many features supported by SCDBs are more sophisticated (e.g. they allow for detailed annotation of multi-word linguistic items, including specification of main words and immediate context). More importantly still, SCDBs allow researchers to create annotated translation correspondences (TCs) in parallel corpora. Aggregation of searchable TCs in a SCDB represents a unique information resource that facilitates creation of new explicit knowledge about cross-linguistic correspondences and translation models. An overview of four SCDBs developed up to date is also included in this paper.
Designing and Validating an Annotation Model of Dynamic Modality for English and Spanish: Issues and Problems Julia Lavid, Marta Carretero and Juan Rafael Zamorano Abstract. In this paper we set forth an annotation model for dynamic modality in English and Spanish, given its relevance not only for contrastive linguistic purposes, but also for its impact on practical annotation tasks in the Natural Language Processing (NLP) community. An annotation scheme is proposed, which captures both the functional-semantic meanings and the language-specific realisations of dynamic meanings in both languages. The scheme is validated through a reliability study performed on a randomly selected set of one hundred and twenty sentences from the MULTINOT corpus, resulting in a high degree of inter-annotator agreement. We discuss our main findings and give attention to the difficult cases as they are currently being used to develop detailed guidelines for the large-scale annotation of dynamic modality in English and Spanish.
The World War I Poets as a Brand and the Corpus-Derived Empiricism of Their Subtext William Louw and Marija Milojkovic Abstract. Scientific approaches to the explication of poetry have been around since antiquity. However, whilst the apparent deeds of mankind at war are often full of pagan sentiments such as ‘Dulce et decorum est pro patria mori…’, protests at the very notion of war are often brushed aside by society. But can we rely upon the logical empiricism of the whole language, as represented by a reference corpus, to be the latent apologist for our deepest sentiments and to express them by the probability of induction alone? In common with the business theme of our conference, this paper will examine the brand of the War Poets, and in particular of some of Wilfred Owen’s ‘first lines’, as we recall that in business, the ethos of a ‘… brand of something such as a way of thinking or behaving is a particular kind of it (Cobuild English Language Dictionary, Second Edition, 1995). In this paper, the collected subtexts of the War Poets will be used to create their brand, beyond what used to be called analogue induction, beyond the information given (Bruner, 1974).
Review Web Pages Collector Tool for Thematic Corpus Creation Lisa Medrouk, Anna Pappa and Jugurtha Hallou Abstract. We present a method of automaticaly extracting and gathering specific data text from web pages, creating a thematic corpus of reviews for opinion mining and sentiment analysis. The internet is an immense source of machine-readable texts \cite{mcenery1996} suitable for linguistic corpus studies\cite{Fletcher04}\cite{Kilgarriff2003}. Though, specific tools of web information extraction research domain as well as from the NLP do not include an open source system able to provide a thematic corpus according to an end-user request\cite{Sharoff2006}.\\ The need of use natural texts as databank for opinion mining and sentiment analysis is increased since the expansion of the digital interaction between users and blogs, forums and social networks.\\ The RevScrap system is designed to provide an intuitive, easy-to-use interface able to extract specific information from accurate web pages returned by search engine's request and create a corpus composed by comments, reviews, opinions, as expressed by users' experience and feedback. The corpus is well structured in xml documents, reflected Singler's design criteria\cite{sinclair01}..
A Corpus-Attested View of Business English Metaphors Marija Milojkovic Abstract. Louw’s (2009) idea to create a corpus-attested dictionary of literary terms may initially involve analysing uncontested examples of irony, antithesis and the like, in corpus terms. The paper analyses non-literal expressions in a Guardian business text against the background of the corpus-attested definition of metaphor, arrived at through the detailed analysis of two metaphors in Yeats’s ‘The Circus Animals’ Desertion’ by means of Louw’s Contextual Prosodic Theory (CPT) (Milojkovic 2016). Given that the delexical expressions in the Guardian text are not meant to convey any meanings other than explicit, and that their relexicalisation may be achieved only through other delexical expressions, the paper suggests that they be called delexical rather than metaphorical.
Extracting Domain-Specific Features for Sentiment Analysis Using Simple NLP Techniques: Running Shoes Reviews Antonio Moreno-Ortiz, Chantal Pérez-Hernández and Cristian Gómez-Pascual Abstract. This paper is a first attempt at designing a procedure to derive a domain-specific lexicon (both single words and multiword expressions) from an opinion corpus of specialized language. We use a corpus of reviews of running shoes as case study, compiled for this particular purpose. The main goal is to obtain a first approximation to the task of automatically extracting domain-specific expressions of sentiment to be used by our sentiment analysis software, Lingmotif.
Categories and Genres in CHET and CECHeT Isabel Moskowich and Begoña Crespo Abstract. This paper describes one of the concerns of corpus compilers when gathering samples of texts. In particular, it explores how to classify such samples in wider categories in the case of the Corpus of English Chemistry Texts (CECheT), one of the subcorpus of the Coruña Corpus of English Scientific Writing. To this end, authors have revised the literature to find (and try to solve) the terminological mess that includes laves such as genre, text-type and textual category. These laves have been widely related either to the form or the function of the text. In this paper the idea of “communicative format” is used to bring together form and function as they are seen as intermingled in texts at all levels.
Building Corpus-Based Semantic Classifications of Some Tatar Affixes Olga Nevzorova, Alfiia Galieva and Dzhavdet Suleymanov Abstract. This study is aimed at exploring the semantic properties of Tatar affixes. Turkic languages have complicated morphology and syntax, which is a challenge for language processing. The fundamental principle of inflection and derivation in Tatar, as well as in other Turkic languages, is agglutination, when the stem joins postpositive affixes in a strictly determined order. The Tatar language has affixes of different types: a) derivational affixes expressing only lexical meaning and forming new words; b) inflectional affixes changing the word form (for example, case affixes); c) affixes serving as means of derivation as well as inflection. The current study is devoted to the ambiguous Tatar –lık polyfunctional affix which may be joined to nominal, adjectival and verbal stems and form derivatives of different types depending on contextual environment, the meaning of the stem and the composition of the affixal chain of a derivative. -Lık affix is a productive affix in modern Tatar which builds nominal, adjectival and verbal derivatives. The answer to the question of the number of the types of derivatives and word forms produced with -lık affix is not trivial, and different researchers distinguish different types of derivatives. Based on a thorough analysis of Tatar derivatives containing - lık affix we identified some empirical features of these constructs and then performed their manual and automatic classification. Four classes were distinguished. For our experiments we used data from the Tatar National Corpus “Tugan Tel” (http://corpus.antat.ru). The results obtained may be used for disambiguation in Tatar National Corpus and for analyzing other Tatar ambiguous affixes.
The Potential of Corpus Based Collocation Instruction on the Awareness Levels of the Turkish EFL Students in Terms of Reading Performances Ali Sukru Ozbay Abstract. It is an old consensus by now that languages all throughout the world consist of prefabricated chunks or multi-word combinations which are important for EFL learners in their efforts to perceive and produce language of native speakers in the form of combinations or chunks. The combinative nature of English language lends itself in various ways and sometimes they are called as “collocations” which constitute the biggest part of these chunks. In this respect, it is understandable that collocation learning plays a significant role for EFL learners. Thus, the primary purpose of this research is to explore whether corpus based explicit collocation instruction will help the EFL students gain awareness of collocations. Another purpose is to reveal the extent to which EFL learners recognize collocations in different contexts. The final purpose is to observe whether this informed exposure will result in better reading performances in English. The research reported on an experimental study regarding the effect of corpus-based explicit collocation instruction on EFL students' reading performance. The data for the study were obtained through pre-test and post- test scores and interview which included open-ended questions. Tertiary level EFL students (n=50) from the English department of a middle sized university in the Eastern Black Sea region in Turkey participated in the study and the study lasted for eight weeks (spring term). The control group (n = 25) received in-class reading instruction and the experimental group (n = 25) integrated collocations into their reading processes. The study investigated whether there were any differences between the experimental and the control groups in terms of gaining awareness of collocations and exhibiting better reading performances after corpus-based explicit collocation instruction is delivered on a scheduled-order. Based on the analyses of students' reading scores, the main findings showed that the experimental group showed significant improvement when compared to the control group. Both post-test scores and the answers of the students participated in the interview proved that corpus-based explicit collocation instruction had positive effect on the awareness level and reading performances of EFL students. The study, therefore, concludes that English as a Foreign Language learners' use of collocations or word combinations has potential to create more effective reading performance.
Computerized Corpus Based Investigation of the Use of Multi-Word Combinations and the Developmental Stages by Tertiary Level EFL Learners Ali Sukru Ozbay and Mustafa Naci Kayaoğlu Abstract. It is an old consensus by now that languages all throughout the world consist of prefabricated chunks or multi-word combinations which are important for EFL learners in their efforts to perceive and produce language of native speakers in the form of combinations or chunks. The combinative nature of English language lends itself in various ways and sometimes they are called as “collocations” which constitute the biggest part of these chunks. In this respect, it is understandable that collocation learning plays a significant role for EFL learners. Thus, the primary purpose of this research is to explore whether corpus based explicit collocation instruction will help the EFL students gain awareness of collocations. Another purpose is to reveal the extent to which EFL learners recognize collocations in different contexts. The final purpose is to observe whether this informed exposure will result in better reading performances in English. The research reported on an experimental study regarding the effect of corpus-based explicit collocation instruction on EFL students' reading performance. The data for the study were obtained through pre-test and post- test scores and interview which included open-ended questions. Tertiary level EFL students (n=50) from the English department of a middle sized university in the Eastern Black Sea region in Turkey participated in the study and the study lasted for eight weeks (spring term). The control group (n = 25) received in-class reading instruction and the experimental group (n = 25) integrated collocations into their reading processes. The study investigated whether there were any differences between the experimental and the control groups in terms of gaining awareness of collocations and exhibiting better reading performances after corpus-based explicit collocation instruction is delivered on a scheduled-order. Based on the analyses of students' reading scores, the main findings showed that the experimental group showed significant improvement when compared to the control group. Both post-test scores and the answers of the students participated in the interview proved that corpus-based explicit collocation instruction had positive effect on the awareness level and reading performances of EFL students. The study, therefore, concludes that English as a Foreign Language learners' use of collocations or word combinations has potential to create more effective reading performance.
'it is proper ſubſerviently, to inquire into the nature of experimental chemiſtry': Difficulties in Reconciling Discipline-Based Characteristics and Compilation Criteria in the Selection of Samples for CECheT. Luis Miguel Puente Castelo and Leida Maria Monaco Abstract. This paper discusses the compilation of the beta version of CECheT, the subcorpus devoted to Chemistry in the Coruña Corpus of English Scientific Writing, and reflects on the difficulties faced during this process and how they were overcome. The historical context of science will be examined, particularly that of chemistry, and how this affects the process of compilation. Attention will also be paid to the compilation criteria used in the whole Coruña Corpus, including those regarding the appropriateness of authors and text samples (Moskowich, 2012), and how these criteria have been applied to the compilation of CECheT in order to make it representative of the practices of the discipline at the time. Finally, the corpus will be described briefly, looking at a series of parameters: the topics of the texts, the size of samples, their chronological distribution, as well as the geographical origin and sex of the authors represented.
Bi-Texting Your Food: Helping the Gastro Industry Reach the Global Market Rosa Rabadan, Hugo Sanjurjo-González and Veronica Colwell Abstract. The technologization of cross-linguistic communication and the expansion of the learning of for- eign languages has helped create new, non- linguist users. Corpus-based applications offer a way of responding to these new challenges. This presentation focuses on the design and building process of the BiTeX app, designed to write recipes in both En and Es through controlled language choices. The starting point is a custom-made, rhetorically and POS annotated, En-Es comparable corpus of recipes containing 135,912 words in the En and 145,449 in the Es subcorpus respectively. The BiText prototype has been developed using MongoDB 1, Express 2, Node.js 3 and jQuery 4, which allow for multiple concurrent connections to be handled without I/O blocks. BiTeX aims at helping improve international communication in the restaurant and catering community, as well as boosting collateral business niches in including recipe books, tourist-oriented websites, etc.
The Emerging Parties' Manifestos for the 2015 Spanish General Elections: a Comparative Analysis of Lexical Choices Hanna Skorczynska Abstract. The study looks into the lexical choices made in the political manifestos of two new political parties participating in the 2015 Spanish general election: Ciudadanos and Podemos, which share similar programmatic goals and aim to reach voters dissatisfied with the dominating two big parties: the conservative PP and the socialist PSOE. The two manifestos are compared using the ‘compare corpora’ tool of Sketch Engine, and the keywords and their collocational patterns are analyzed with a special focus on evaluative adjectives. Finally, the two manifestos are also compared to the Spanish presidents’ inaugural speeches. The results suggest that the manifestos rely on distinct lexical choices, pointing to differences in the ideological stance. The manifesto of Podemos clearly breaks away from the traditional political discourse in Spain, while Ciudadanos is more conventional in this sense.
A Corpus-Driven Approach to Sentiment Analysis of Patient Narratives Keith Stuart, Ana Botella and Imma Ferri-Miralles Abstract. This paper describes the linguistic analysis of a corpus of patient narratives that was used to develop and test software to carry out sentiment analysis on the aforementioned corpus. There is a growing body of research on the relationship between sentiment analysis, social media (for example, Twitter) and health care, but less research on sentiment analysis of patient narratives (being longer and more complex texts). The motivation for this research is that patient narratives of experiences of the National Health Service (NHS) in the UK provide rich data of the treatment received. The corpus threw up some unexpected results that may be of benefit for researchers of sentiment analysis. The linguistic problems encountered have been divided into three sections: the noisy nature of large corpora; the idiomatic nature of language; the nature of language in the clinical domain. This article gives an overview of the project and describes the linguistic problems that arose out of the project, which tried to find a means to automate the analysis of patient feedback on health services.
Seeking the Portuguese Vocabulary Profile Shintaro Torigoe Abstract. This paper reports on a pilot study from the Portuguese Vocabulary Profile project. In this pilot study, a vocabulary list for learners of Portuguese was developed by analysing learner corpora, an approach inspired by CEFR-based wordlists, such as the English Vocabulary Profile. A draft wordlist was constructed from two learner corpora of L2 Portuguese, the Corpora do PLE and the Corpus de PEAPL2. The draft wordlist was then compared to the LMCPC, a wordlist derived from a million-word native speaker corpus, in order to investigate differences between learners and native speakers and to identify aspects of the wordlist needing improvement. The pilot study indicated that the use of Portuguese by the Intermediate and Advanced learner is quite different from that of native speakers and that learner’s language use was affected by data collection tasks and learning environments.
Corpus Based Lexical Semantic Analysis of Minimal Pairs of Deverbal Adjectives with the Spanish '-dor/-nte' Suffixes Ryo Tsutahara Abstract. This paper shows the semantic differences and similarities between Spanish active deverbal adjectives with the -dor and -nte suffixes. Minimal pairs of derivatives with the suffixes will be quantitatively analyzed, as the patterns of modification are the center of the interest. This study concludes that the derivatives’ modification patterns are parallel with the denotation patterns of nominal derivatives with the same suffixes.
Japanese L1 Speakers Blogging in Spanish: Motivation, Topics and Linguistic Properties M. Pilar Valverde Ibañez Abstract. Web blogs are a particular type of text because of their ambivalent nature: they are private and public, speech and writing, monologue and dialogue at the same time. They can be a good source of information about learners’ motivation to write in a foreign language and their favorite topics, and they have particular linguistic properties. To investigate these questions, we constructed a blog corpus made up of 2,125 texts coming from 48 Spanish blogs written by 43 Japanese L1 speakers.
Methodological Considerations on the Design and Compilation of a Spanish Learner Corpus An Vande Casteele and Kim Collewaert Abstract. As a corpus is a representation of the linguistic reality, it is important to have homogeneous, quantifiable and valid data. This article aims at discussing the issue of elaborating a corpus of oral data from language learners of Spanish. We hereby do not merely focus on the data collection, but also on the difficulties that arise regarding the experimental design, the selection of the participants, the elaboration of a transcription model and the analysis of the data. The discussion will be based upon our own research project, for which oral samples from Spanish language learners of different proficiency levels have been collected in order to be analysed cross-sectionally. Furthermore, this article focuses on the oral experiment specifically designed for this project, similar to those of previous studies on similar subjects. Next to this, we will also discuss the procedure used for the transcription of the data and finally, a codification system will be elaborated.
Semantic Structure, Speech Units and Facial Movements: Multimodal Corpus Analysis of English Public Speaking Miharu Fuyuno, Yuko Yamashita, Takeshi Saitoh and Yoshitaka Nakajima Abstract. This study examines connections between the semantic structure and speech units, and characteristics of facial movements in EFL learners’ public speech. The data were obtained from a multimodal corpus of English public speaking constructed from digital audio and video data of an official English speech contest held in a Japanese high school. Evaluation data of contest judges were also included. For the audio data, speech pauses were extracted with an acoustic analysis software, and the spoken content (text) of each speech unit embedded between two pauses was then annotated. The semantic structures of the speech units were analysed based on segmental chunks of clauses. Motion capturing was applied on video data; forty-two tracking points were set on each speaker's eyes, nose, mouths and face lines. The results indicated: (1) Speakers with higher evaluations showed a similar semantic structure pattern in their speech units. It was also confirmed as similar to that for NSE samples. (2) Horizontal facial movements and the angles of face rotations were extracted from motion capturing. The result is expected to be useful for defining a facial movement model that effectively describes good eye contacts in public speaking.