Piczo

Log in!
Stay Signed In
Do you want to access your site more quickly on this computer? Check this box, and your username and password will be remembered for two weeks. Click logout to turn this off.

Stay Safe
Do not check this box if you are using a public computer. You don't want anyone seeing your personal info or messing with your site.
Ok, I got it
Back To Home Page
language testing 2007


Selected   Readings on Language Testing

Edited by:
Professor: Muftah S. Lataiwish
University of Garyounis
Faculty of Arts
Department of English
2007

Table of Contents

Muftah Lataiwish     Testing and Assessment

Carol A. Puhl   Continuous Assessment in the ESL Classroom

Augustin Simo Bobda       Testing Pronunciation

S. Kathleen, Doshisha & Kenji, Doshisha   Testing Communicative Competence

John M. Norris   Purposeful Language Assessment:   Selecting the Right Alternative Test
Kassim Shaaban Assessment of Young Learners

Joel Murray   Creating Placement Tests
Muftah Lataiwish ed. Formative v.s. Summative Evaluation






Testing and Assessment
Dr. Muftah S. Lataiwish
University of Garyounis.
March 2005




Students will always remember the shock of receiving their course results when they were still twelve or thirteen years old. Some knew it was not going to be high, but to come bottom of the class was very upsetting. It was all made worse by the fact that the English teacher read the names and results to the whole class, from first to last place. The humiliation was complete. Students can have very negative reactions towards tests and it is no surprise when they too may have had experiences like this. The purpose of this article is to illustrate a number of issues such as testing and assessment, test writing , and types of test questions.
Why testing doesn't work
There are many arguments against using tests as a form of assessment: Some students become so nervous that they cannot perform and do not give a true account of their knowledge or ability. Other students can do well with last minute cramming despite not having worked throughout the course . Once the test has finished, students can just forget all that they had learned . Students become focused on passing tests rather than learning to improve their language skills.   However, testing is certainly not the only way to assess students, but there are many good reasons for including a test in your language course.   A test can give the teacher valuable information about where the students are in their learning and can affect what the teacher will cover next. They will help a teacher to decide if his/her teaching has been effective and help to highlight what needs to be reviewed. Testing can be as much an assessment of the teaching as the learning. Tests can give students a sense of accomplishment as well as information about what they know and what they need to review. ( Fulcher ,1996, 45)
Illustrated in his article "Testing Tasks: Issues in Task Design and
the Group Oral." that   in the 1970's and 1980's   students in   intensive English EFL programs at West Virginia University (USA) were taught in an unstructured conversation courses. They complained that even though they had a lot of time to practice communicating, they felt as if they had not learned anything. Not long afterwards a testing system was introduced and helped to give them a sense of satisfaction that they were accomplishing things. Tests can be extremely motivating and give students a sense of progress. They can highlight areas for students to work on and tell them what has and has not been effective in their learning. Tests can also have a positive effect in that they encourage students to review material covered on the course.   Generally, at university   we experienced this first hand. We always learned the most before an exam. Tests can encourage students to consolidate and extend their knowledge. Tests are also a learning opportunity after they have been taken. The feedback after a test can be invaluable in helping a student to understand something he/she could not do during the test. Thus the test is a review in itself.
Making testing more productive
Despite all of these strong arguments for testing, it is very important to bear in mind the negative aspects we looked at first and to try and minimize the effects.   First, teachers should try to make the test a less intimidating experience by explaining to the students the purpose of the test and stress the positive effects it will have. Many may have very negative feelings left over from previous bad experiences.   Second, give the students plenty of notice and teach some revision classes beforehand. Third, tell the students that they will take into account their work on the course as well as the test result. Next, teachers should be sensitive when they hand out the results. I usually go through the answers fairly quickly, highlight any specific areas of difficulty and give the students their results on slips of paper. Moreover, I emphasize that   individuals should compare their results with their own previous scores not with others in the class. Finally, it is very important to remember that tests also give teachers valuable information on how to improve the process of evaluation. Questions such as:
"Were the instructions clear?"   "Are the test results consistent with the work that the students have done on the course. Why/why not?"  
"Did I manage to create a non-threatening atmosphere?"
All of this will help the teacher to improve the evaluative process for next time.
Alternatives to testing
Using only tests as a basis for assessment has obvious drawbacks. They are 'one-off' events that do not necessarily give an entirely fair account of a student's proficiency. As we have already mentioned, some people are more suited to them than others. There are other alternatives that can be used instead of or alongside tests. (Ahmann, & Glock, 1981, 74).   Illustrated the following:
Principles of tests and Measurements
• Continuous assessment
Teachers give grades for a number of assignments over a period of time. A final grade is decided on a combination of assignments.
• Portfolio
A student collects a number of assignments and projects and presents them in a file. The file is then used as a basis for evaluation.
• Self-assessment
The students evaluate themselves. The criteria must be carefully decided upon beforehand.
• Teacher's assessment
The teacher gives an assessment of the learner for work done throughout the course including classroom contributions.
Overall, I think that all the above methods have strengths and limitations and those tests have an important function for both students and teachers. By trying to limit the negative effects of tests we can ensure that they are as effective as possible. I do not think that tests should be the only criteria for assessment, but that they are one of many tools that we can use. I feel that choosing a combination of methods of assessment is the fairest and most logical approach.

Test writing
If you think taking tests is difficult then you should try writing them! Writing a good test is indeed quite a challenge and one that takes patience, experience and a degree of trial and error. There are many steps you can take to ensure that your test is more effective and that test writing becomes a learning experience.
The elements of a good test
Harris (1969), Scannell, & Tracy, (1975), Heaton, (1990), and Madsen (2001) have all agreed that the elements of a good test include the content of the test questions regarding the basic language skills and its components mainly grammar and vocabulary. Comprehensible and simple directions, ease of administration and scoring the test can be also considered significant factors in testing.   A good test will give us a more reliable indication of our students' skills and it ensures that they do not suffer unfairly because of a poor question. How can we be sure that we have produced a good test?   One way is very simply to think about how we feel about it afterwards. Do the results reflect what we had previously thought about the skills of the students? Another simple way is to ask the students for some feedback. They will soon tell you if they felt a question was unfair or if a task type was unfamiliar.
Validity and Reliability of a test
A good test also needs to be valid. It must test what it is meant to test. A listening test that has very complicated questions afterwards can be as much of a test of reading as listening. Also a test that relies on cultural knowledge cannot measure a student's ability to read and comprehend a passage. A test should also be reliable. This means that it should produce consistent results at different times. If the test conditions stay the same, different groups of students at a particular level of ability should get the same result each time.   A writing test may not be reliable as the marking may be inconsistent and extremely subjective, especially if there are a number of different markers. Thus to try and ensure the test is more reliable it is essential to have clear descriptors of what constitutes each grade.   In an oral interview it is important to ensure that the examiner maintains the same attitude with all the candidates. The test will be less reliable if he is friendly with some candidates but stern with others. You should try to ensure that the test conditions are as consistent as possible.
The affect and other features of a good test
We must also bear in mind the affect of our tests. Has the test caused too much anxiety in the students? Are the students familiar with the test types in the exam?   If a student has never seen a cloze passage before she/he may not be able to write a test that reflects her/his true ability. The solution to this is to try and reduce the negative effects by using familiar test types and making the test as non-threatening as possible. Other features of a good test are that there is a variety of test types and that it is as interesting as possible.   A variety of test types will ensure that the students have to stay focused and minimize the tiredness and boredom you can feel during a repetitive test.   Finding reading passages that are actually interesting to read can also help to maintain motivation during a test. A test should also be as objective as possible, providing a marking key and descriptors can help with this.
Assessing difficulty
Another important feature of a good test is that it is set at an appropriate level. You can only really find this out by giving the test and studying the results. Basically if everyone gets above 90% you know it is too easy or if everyone gets less than 10% it is obviously too difficult. For tests that are not so extreme you will need to do some analysis of your test. You can do this by analyzing the individual items for difficulty.   In order to do this mark all of the tests and divide them into three equal groups, high, middle and low.   Make a note for each item of how many candidates got the answer correct from the high and the low group (leave aside the middle group). To find the level of difficulty you need to do a quick calculation. Take one question and add the number of students from the high group who have the correct answer to the number from the low group.   Then divide this by the total number of people from both groups (high and low). It is thought that if over 90% of candidates get the answer right it is too easy. If fewer than 30% get it right it is too difficult.   Also bear in mind that if most of the answers are in the 30's and 40's it would be best to rewrite the test. It is the same if most of the answers are in the 80's and 90's.   The final step is to reject the items that are too easy or difficult.
Always bear in mind though that the difficulty of an item may relate to whether it has been covered in class or it may give an indication of how well it was understood. Such test analysis can give us information about how effective our teaching has been as well as actually evaluating the test. Evaluating tests carefully can ensure that the test improves after it is taken and can give us feedback on improving our test writing.
Below is a suggested procedure for writing a test.
1. Decide what kind of test it is going to be (achievement, proficiency)
2. Write a list of what the test is going to cover
3. Think about the length, layout and the format
4. Find appropriate texts
5. Weight the sections according to importance/time spent etc.
6. Write the questions
7. Write the instructions and examples
8. Decide on the marks
9. Make a key
10. Write a marking scheme for less objective questions
11. Pilot the test
12. Review and revise the test and key
13. After the test has been taken, analyze the results and decide what can be kept / rejected.
Test question types
In the above section of test writing I looked at some of the difficulties of writing good tests and how to make tests more reliable and useful. I will now go on to look at testing and elicitation and in particular some different question types and their functions, advantages and disadvantages.

• Types of test
• Types of task
o Multiple choice
o Transformation
o Gap-filling
o Matching
o Cloze
o True / False
o Open questions
o Error correction and recognition
• Other techniques
Types of test
Before writing a test it is vital to think about what it is you want to test and what its purpose is. We must make a distinction here between proficiency tests, achievement tests, diagnostic tests and prognostic tests. Harris, Heaton, and Madsen have all agreed that :
• A proficiency test is one that measures a candidates overall ability in a language, it is not related to a specific course.
• An achievement test on the other hand tests the students' knowledge of the material that has been taught on a course.
• A diagnostic test highlights the strong and weak points that a learner may have in a particular area.
• A prognostic test attempts to predict how a student will perform on a course.
There are of course many other types of tests. It is important to choose elicitation techniques carefully when you prepare one of the aforementioned tests.
Types of task
There are many elicitation techniques that can be used when writing a test. Below are some widely-used types with some guidance on their strengths and weaknesses. Using the right kind of question at the right time can be enormously important in giving us a clear understanding of our students' abilities, but we must also be aware of the limitations of each of these task or question types so that we use each on appropriately.
Multiple Choices
Choose the correct word to complete the sentence.
Manchester United is ________________today for being one of Britain's most famous football clubs.
a) recommended b) reminded c) recognized d) remembered
In this question type there is a stem and various options to choose from. The advantages of this question type are that it is easy to mark and minimizes guess work by having multiple distracters. The disadvantage is that it can be very time-consuming to create. Effective multiple choice items are surprisingly difficult to write. Also it takes time for the candidate to process the information which leads to problems with the validity of the exam. If a low level candidate has to read through lots of complicated information before they can answer the question, you may find you are testing their reading skills more than their lexical knowledge. Multiple choice can be used to test most things such as grammar, vocabulary, reading, listening etc., but you must remember that it is still possible for students to just 'guess' without knowing the correct answer.
Transformation
Complete the second sentence so that it has the same meaning as the first.
'Do you know what the time is, Ali?' asked Adam.
Adam asked Ali __________ (what) _______________ it was.
This time a candidate has to rewrite a sentence based on an instruction or a key word given. This type of task is fairly easy to mark, but the problem is that it does not test understanding. A candidate may simply be able to rewrite sentences to a formula. The fact that a candidate has to paraphrase the whole meaning of the sentence in the example above however minimizes this drawback. Transformations are particularly effective for testing grammar and understanding of form. This would not be an appropriate question type if you wanted to test skills such as reading or listening.
Gap-filling
Complete the sentence.
Check the exchange ______________ to see how much your money is worth.
The candidate fills the gap to complete the sentence. A hint may sometimes be included such as a root verb that needs to be changed, or the first letter of the word etc. This usually tests grammar or vocabulary. Again this type of task is easy to mark and relatively easy to write. The teacher must bear in mind though that in some cases there may be many possible correct answers. Gap-fills can be used to test a variety of areas such as vocabulary, grammar and are very effective in testing listening for specific words.
Matching
Match the word on the left to the word with the opposite meaning.
heavy old
young tall
dangerous light
short ugly
Beautiful safe
With this question type, the candidate must link items from the first column to items in the second. This could be individual words, words and definitions, parts of sentences, pictures to words etc. Whilst it is easy to mark, candidates can get the right answers without knowing the words, if he/she has most of the answers correct he/she knows the last one left must be right. To avoid this, have more words than is necessary.   Matching exercises are most often used to test vocabulary.
Cloze
Complete the text by adding a word to each gap.
This is the kind _____ test where a word _____ omitted from a passage every so often. The candidate must _____ the gaps, usually the first two lines are without gaps.
This kind of task type is much more integrative as candidates have to process the components of the language simultaneously. It has also been proved to be a good indicator of overall language proficiency. The teacher must be careful about multiple correct answers and students may need some practice of this type of task. Cloze tests can be very effective for testing grammar, vocabulary and intensive reading.
True / False
Decide if the statement is true or false.
Germany won the world cup in 1966. T/F
Here the candidate must decide if a statement is true or false. Again this type is easy to mark but guessing can result in many correct answers. The best way to counteract this effect is to have a lot of items. This question type is mostly used to test listening and reading comprehension.
Open questions
Answer the questions.
Why did Nada and Dina borrow the book ?
Here the candidate must answer a simple questions after a reading or listening or as part of an oral interview. It can be used to test anything. If the answer is open-ended it will be more difficult and time consuming to mark and there may also be a an element of subjectivity involved in judging how 'complete' the answer is, but it may also be a more accurate test.   These question types are very useful for testing any of the four skills, but less useful for testing grammar or vocabulary.
Error correction
Find the mistakes in the sentence and correct them.
All if we students must have an identification card in order to sit for the test.
Errors must be found and corrected in a sentence or passage. It could be an extra word, mistakes with verb forms, words missed etc. One problem with this question type is that some errors can be corrected in more than one way.   Error correction is useful for testing grammar and vocabulary as well as readings and listening.
Error Recognition
Circle the incorrect word or phrase in the following statement:
The problem that our instructor gave us was different than  
                              A                           B                                 C
that given to the other class.
  D  
The examinee is required to indicate which of several underlined parts of a sentence is unacceptable for formal written English. Or to indicate that the sentence contains no "error". This type is useful for testing writing and grammar.
In short, there are of course many other elicitation techniques such as translation, essays, dictations, ordering words/phrases into a sequence and sentence construction (He/go/school/yesterday). It is important to ask yourself as a teacher what exactly you are trying to test, which techniques suit this purpose best and to bear in mind the drawbacks of each technique. Awareness of this will help you to minimize the problems and produce a more effective test.
References:
Ahmann, J.S. & Glock, M. (1981). Evaluating Student Progress:
Principles of Tests and Measurements, 6th ed. Boston: Allyn and Bacon, Inc.
Fulcher, Glenn. ( 1996) . "Testing Tasks: Issues in Task Design and
the Group Oral." Language Testing 13, 1, 23 - 51.

Gronlund, N.E. (1981). Measurement and Evaluation in Teaching.
New York:   Macmillan.
Harris, David. (1969) Testing English as a Second Language. New
York : McGraw-Hill Book Company.
Heaton, J.B. (1990). Classroom Testing. New York: Longman.
Madsen, Harold. (2001)   Techniques in Testing   Oxford: OUP.
Ornstein, A. C., & Lasley, T., J. II. (2000). Strategies for Effective
Teaching. 3rd edition. Boston: McGraw Hill.
Scannell, Dale and D.B. Tracy. (1975) Testing and Measurement in the
Classroom. Boston: Houghton Mifflin Company.

---------------------------------------------------------------------------------------

Prof. Muftah S. Lataiwish   Department of English Faculty of Arts University of Garyounis. email: lataiwish@garyounis.edu  


Testing Communicative Competence
S. Kathleen, Kitao Doshisha Kenji Kitao, Doshisha
Testing language has traditionally taken the form of testing knowledge about language, usually the testing of knowledge of vocabulary and grammar. However, there is much more to being able to use language than knowledge about it. Dell Hymes proposed the concept of communicative competence. He argued that a speaker can be able to produce grammatical sentences that are completely inappropriate. In communicative competence, he included not only the ability to form correct sentences but to use them at appropriate times. Since Hymes proposed the idea in the early 1970s, it has been expanded considerably, and various types of competencies have been proposed. However, the basic idea of communicative competence remains the ability to use language appropriately, both receptively and productively, in real situations.
The Communicative Approach to Testing
What Communicative Language Tests Measure
Communicative language tests are intended to be a measure of how the testees are able to use language in real life situations. In testing productive skills, emphasis is placed on appropriateness rather than on ability to form grammatically correct sentences. In testing receptive skills, emphasis is placed on understanding the communicative intent of the speaker or writer rather than on picking out specific details. And, in fact, the two are often combined in communicative testing, so that the testee must both comprehend and respond in real time. In real life, the different skills are not often used entirely in isolation. Students in a class may listen to a lecture, but they later need to use information from the lecture in a paper. In taking part in a group discussion, they need to use both listening and speaking skills. Even reading a book for pleasure may be followed by recommending it to a friend and telling the friend why you liked it.
The "communicativeness" of a test might be seen as being on a continuum. Few tests are completely communicative; many tests have some element of communicativeness. For example, a test in which testees listen to an utterance on a tape and then choose from among three choices the most appropriate response is more communicative than one in which the testees answer a question about the meaning of the utterance. However, it is less communicative than one in which the testees are face- to-face with the interlocutor (rather than listening to a tape) and are required to produce an appropriate response.
Tasks
Communicative tests are often very context-specific. A test for testees who are going to British universities as students would be very different from one for testees who are going to their company's branch office in the United States. If at all possible, a communicative language test should be based on a description of the language that the testees need to use. Though communicative testing is not limited to English for Specific Purposes situations, the test should reflect the communicative situation in which the testees are likely to find themselves. In cases where the testees do not have a specific purpose, the language that they are tested on can be directed toward general social situations where they might be in a position to use English.
This basic assumption influences the tasks chosen to test language in communicative situations. A communicative test of listening, then, would test not whether the testee could understand what the utterance, "Would you mind putting the groceries away before you leave" means, but place it in a context and see if the testee can respond appropriately to it.
If students are going to be tested over communicative tasks in an achievement test situation, it is necessary that they be prepared for that kind of test, that is, that the course material cover the sorts of tasks they are being asked to perform. For example, you cannot expect testees to correctly perform such functions as requests and apologies appropriately and evaluate them on it if they have been studying from a structural syllabus. Similarly, if they have not been studying writing business letters, you cannot expect them to write a business letter for a test.
Tests intended to test communicative language are judged, then, on the extent to which they simulate real life communicative situations rather than on how reliable the results are. In fact, there is an almost inevitable loss of reliability as a result of the loss of control in a communicative testing situation. If, for example, a test is intended to test the ability to participate in a group discussion for students who are going to a British university, it is impossible to control what the other participants in the discussion will say, so not every testee will be observed in the same situation, which would be ideal for test reliability. However, according to the basic assumptions of communicative language testing, this is compensated for by the realism of the situation.
Evaluation
There is necessarily a subjective element to the evaluation of communicative tests. Real life situations don't always have objectively right or wrong answers, and so band scales need to be developed to evaluate the results. Each band has a description of the quality (and sometimes quantity) of the receptive or productive performance of the testee.
Examples of Communicative Test Tasks
Speaking/Listening
Information gap. An information gap activity is one in which two or more testees work together, though it is possible for a confederate of the examiner rather than a testee to take one of the parts. Each testee is given certain information but also lacks some necessary information. The task requires the testees to ask for and give information. The task should provide a context in which it is logical for the testees to be sharing information.
The following is an example of an information gap activity.
Student A
You are planning to buy a tape recorder. You don't want to spend more than about 80 pounds, but you think that a tape recorder that costs less than 50 pounds is probably not of good quality. You definitely want a tape recorder with auto reverse, and one with a radio built in would be nice. You have investigated three models of tape recorder and your friend has investigated three models. Get the information from him/her and share your information. You should start the conversation and make the final decision, but you must get his/her opinion, too.
(information about three kinds of tape recorders)
Student B
Your friend is planning to buy a tape recorder, and each of you investigated three types of tape recorder. You think it is best to get a small, light tape recorder. Share your information with your friend, and find out about the three tape recorders that your friend investigated. Let him/her begin the conversation and make the final decision, but don't hesitate to express your opinion.
(information about three kinds of tape recorders)
This kind of task would be evaluated using a system of band scales. The band scales would emphasize the testee's ability to give and receive information, express and elicit opinions, etc. If its intention were communicative, it would probably not emphasize pronunciation, grammatical correctness, etc., except to the extent that these might interfere with communication. The examiner should be an observer and not take part in the activity, since it is difficult to both take part in the activity and evaluate it. Also, the activity should be tape recorded, if possible, so that it could be evaluated later and it does not have to be evaluated in real time.
Role Play. In a role play, the testee is given a situation to play out with another person. The testee is given in advance information about what his/her role is, what specific functions he/she needs to carry out, etc. A role play task would be similar to the above information gap activity, except that it would not involve an information gap. Usually the examiner or a confederate takes one part of the role play.
The following is an example of a role play activity.
Student
You missed class yesterday. Go to the teacher's office and apologize for having missed the class. Ask for the handout from the class. Find out what the homework was.
Examiner
You are a teacher. A student who missed your class yesterday comes to your office. Accept her/his apology, but emphasize the importance of attending classes. You do not have any extra handouts from the class, so suggest that she/he copy one from a friend. Tell her/him what the homework was.
Again, if the intention of this test were to test communicative language, the testee would be assessed on his/her ability to carry out the functions (apologizing, requesting, asking for information, responding to a suggestion, etc.) required by the role.
Testing Reading and Writing
Some tests combine reading and writing in communicative situations. Testees can be given a task in which they are presented with instructions to write a letter, memo, summary, etc., answering certain questions, based on information that they are given.
Letter writing. In many situations, testees might have to write business letters, letters asking for information, etc.
The following is an example of such a task.
Your boss has received a letter from a customer complaining about problems with a coffee maker that he bought six months ago. Your boss has instructed you to check the company policy on returns and repairs and reply to the letter. Read the letter from the customer and the statement of the company policy about returns and repairs below and write a formal business letter to the customer.
(the customer's complaint letter; the company policy)
The letter would be evaluated using a band scale, based on compliance with formal letter writing layout, the content of the letter, inclusion of correct and relevant information, etc.
Summarizing. Testees might be given a long passage--for example, 400 words--and be asked to summarize the main points in less than 100 words. To make this task communicative, the testees should be given realistic reasons for doing such a task. For example, the longer text might be an article that their boss would like to have summarized so that he/she can incorporate the main points into a talk.
The summary would be evaluated, based on the inclusion of the main points of the longer text.
Testing Listening and Writing/Note Taking
Listening and writing may also be tested in combination. In this case, testees are given a listening text and they are instructed to write down certain information from the text. Again, although this is not interactive, it should somehow simulate a situation where information would be written down from a spoken text.
An example of such a test is as follows.
You and two friends would like to see a movie. You call the local multiplex theater. Listen to their recording and fill in the missing information in the chart so that you can discuss it with your friends later.
Theater Number         Movie           Starting Times

  1                 Air Head

  2                                 4:00, 6:00, 8:00

  3                                   4:35, 6:45, 8:55

  4                 Off Track
Summary
Communicative language tests are those which make an effort to test language in a way that reflects the way that language is used in real communication. It is, of course, not always possible to make language tests communicative, but it may often be possible to give them communicative elements. This can have beneficial backwash effects. If students are encouraged to study for more communicative tasks, this can only have a positive effect on their language learning.
________________________________________


Creating Placement Tests
by Joel Murray
Many educators around the world find themselves in growing schools or programs with more and more students needing placement into appropriate classes. Many schools use tests that are intended for other purposes--e.g., the Secondary Level English Proficiency (SLEP) Test, the Test of English as a Foreign Language (TOEFL), or the Michigan. However, these can be inaccurate for placing students in classes. Some schools have no placement procedures, and students are assigned to classes haphazardly.

    Many ESL/EFL teachers face the task of creating placement tests despite their lack of experience with testing beyond the tests they give in their own classrooms. Many educators don't know how to begin or proceed in creating placement tests that yield results that are reliable (i.e., that are consistent from one administration of the test to another), valid (i.e., that measure what you want them to measure and not something else) and accurate (i.e., that place students into the appropriate levels with little or no error). There are several important steps to creating good placement tests.
Assemble an Assessment Team
The first step in creating a placement test is assembling an assessment team composed of parties interested in the placement process. The concerns of individual stakeholders should be addressed to help define the purpose of the test and make decisions about it. The assessment team should include administrators responsible for students and the curriculum, coordinators responsible for implementing the curriculum, and teachers representing a cross-section of classes. The team should also include one element that is often missing in the design of placement tests: the test-takers themselves. As Bradshaw (1990, 27) notes, "there seems to be no reason why some degree of collection of test-takers' and test-users' reactions cannot be included as part of the design of any new test."
Focus on the Test-Takers
Once the assessment team has been assembled, they should describe or characterize the test-takers who will be taking the placement test. Characteristics taken into account should be those relevant to the test-takers' identity and cultural background: age, gender, first language, country of origin, place of residence, languages already learned, current stage of learning, reasons for taking the test or for entering the school, personal and professional interests, and amount of background knowledge. Describing the test-takers can help with the choice of appropriate test material and test techniques.
Define the Test Objectives
After the test-takers have been characterized, the objectives for the placement test should be outlined using the following questions: What is the aim of the test? What should be tested? Why? How should it be tested? These questions can be answered by looking at the curriculum, the students, the type of classes, and so on. Hughes (1989, 48) has stated that this stage is essential "to make oneself perfectly clear about what it is one wants to know and for what purpose."
Choose a Test Type
The objectives having been defined, the team should decide the type of placement test to be used and what it should contain. First, should the test be direct or indirect or a combination of the two? Direct testing requires the test-taker to perform the skill to be measured (e.g., having test-takers write something in order to see how well they write) while indirect testing measures something dependent upon an underlying skill (e.g., having test-takers answer multiple-choice comprehension questions to see how well they have understood something to which they have listened).
    Second, should the test be discrete point or integrative? Discrete point testing involves testing one thing at a time, item by item, (e.g., a multiple-choice grammar test on the present perfect) while integrative testing involves testing the combination of many elements in the completion of a task (e.g., a test involving the writing of a grammatically accurate, well-planned, unified, cohesive paragraph). The distinction between discrete point and integrative testing is, according to Hughes (1989, 17), "not unrelated to that between indirect and direct testing," with discrete point tests being almost always indirect, and integrative tests direct.
    Finally, should the test be norm- or criterion-referenced? A norm-referenced test is one in which the amount of knowledge or material known by each test-taker is compared with that known by other test-takers, with the aim of spreading the test-takers out along a continuum of general abilities or proficiencies so that differences among them are reflected in the scores (Brown 1995), while a criterion-based test is one in which the assessment of the amount of knowledge or material known by each test-taker is compared with a level of achievement or a set of criteria.
    Which type of test should one choose? Unfortunately, the answer is not always straightforward and depends upon what is to be included in the test. First, for placement tests involving speaking and writing, direct testing is best because these two productive skills provide something that can be directly observed and measured‹utterances and written material. For placement tests involving listening and reading, indirect testing should be used as these two receptive skills yield something that can neither be directly observed nor measured but can only be inferred‹comprehension of an utterance or a reading passage.
    Second, because of the relationship between direct/indirect testing and discrete/integrative testing, for tests involving grammar or listening, discrete point testing should be used. The rationale for this recommendation is that the underlying ability--grammatical structure or listening comprehension--is difficult to test efficiently by other means. The ability is thus best assessed by breaking it down into its individual elements, such as knowledge of the past progressive or the understanding of the gist of a lecture, and by assessing enough of these elements to indicate the test-taker's underlying ability. For speaking or writing, integrative testing is best because the successful completion of the speaking or writing task--making a comprehensible utterance or writing a sentence--does not lend itself well to being broken down and the sum of a number of individual elements may not represent the whole of the ability. For tests involving reading, a combination of discrete point and integrative testing is best, as the underlying ability--understanding a written passage--is difficult to assess accurately by using exclusively one or the other form of measurement. This ability is thus best evaluated both by separating it into components (by, for example, having multiple-choice questions that focus on reference words, vocabulary in context, and the like) and by using a combination of elements (by, for example, requiring the test-taker to write a summary or a paraphrase of the passage).
    Finally, the placement test should be a norm-referenced test, for norm-referenced tests, as stated earlier, compare what is known by each test-taker with that known by other test-takers. Because, according to theory, most scores should be close to the average with fewer scores out near the periphery (think of the bell curve here), norm-referenced tests will allow you to define distinct levels of performance and to make distinctions between individual performances (McNamara 2000, 63)--exactly what a placement test should do. In order for a placement test to be norm-referenced, the test has to be normed; that is, it must be trialed with a representative population of test-takers in order to obtain the scores representative of each level of placement at an institution or in an educational situation. Once that is done, the results can be used to set cut-off points (the lower and upper end of ranges of scores) that will allow students to be placed at different levels.
Choose Test Content
Concerning what the placement test should contain, it seems logical that the test should reflect the curriculum of the school, yet in practice this has not always been the case. As Brown (1989, 66) remarked, "we decided to develop a placement battery that would be related in content to the curriculum of our institute--a proposal that struck us as strangely novel." Nonetheless, the answers to a number of questions can help determine what should be included in a placement test:
* How many sections should the test contain?
* How long should the sections be?
* How should the sections be differentiated?
* How many items should there be in each section?
* What is the target situation for the test and how could it be simulated?
* What text types (written and/or spoken) should be chosen?
* What language skills should be tested?
* What aspects of the language should be tested?
* What tasks should be required?
* What test methods should be used?
* What is the curriculum of the classes into which the placement test is placing test- takers?
Create the Test
Once the test type and content have been chosen, the test itself can be created. The test questions should be carefully crafted: the questions should be as easy as possible for the test-takers to understand and process. Unnecessarily complex questions only increase the cognitive load on the test-takers and do nothing to contribute to measuring what you seek to measure. In addition, tests should be relatively short--not too long for the test-takers to answer nor too long for the scorers to mark.
Develop Scoring Guides
After the test is created, scoring guides should be developed. These guides are intended for the markers of the placement test to help them assess test-takers' performance reliably and consistently. Of course, scoring guides are not necessary for multiple-choice questions, which simply require an answer key. Rather, scoring guides are applicable to subjective or open-ended questions--for example, those requiring short essay answers or oral interaction. Scoring guides can be either holistic or analytical, with holistic meaning that one score is assigned to an answer (e.g., a mark out of 10), analytic meaning that separate scores for a number of different aspects are assigned to an answer (e.g., a mark out of 10 for grammar, 10 for content, and 10 for structure, for a total of 30 marks).
    Each type of scoring guide has a number of advantages and disadvantages. One advantage of holistic scoring is that it can be quick for the markers. Two disadvantages are the fact that the scoring scale must be well conceived and that there should be more than one scorer in order to ensure reliability. There are at least two advantages of analytical scoring: markers have to consider certain aspects of the test-taker's performance that they might otherwise overlook and the results can be used for diagnostic purposes. One disadvantage is that analytical marking is time-consuming. The type of scoring guide that should be used depends on each testing situation. If time is not at a premium, analytical scoring is the better choice.
Test the Test
Having created the scoring guide, the placement test itself should be tested. In other words, before the test is implemented, it should be trialed with a set of students who are representative of the "real" test-takers. The trial will probably reveal problems with the test that could not be identified during its creation. For example, too many items on the test may be too difficult or too easy, open-ended test items may confuse test-takers, writing tasks may result in less response than expected because of poor or somehow insufficient wording, or multiple-choice items may be ambiguous or open to disagreement.
    The results of the trial should be analyzed. For multiple-choice test items, you should calculate the facility value (the level of difficulty of each test item) and the discrimination index (the ability of each item to discriminate between test-takers who did well and those who did not). For advice on how to do so, see Alderson et al. (1995, 80-86), Hughes (1989, 161-162) or Harrison (1983, 127-133). For open-ended or subjective items, you should decide whether the items resulted in what you intended and whether the scoring guide, if applicable, is serviceable. Analyzing the results should yield important information about the test items and the test itself. For instance, if certain test items are too easy, they should perhaps be discarded or moved toward the beginning of the test. If others are too difficult, perhaps they should be re-worded or moved toward the end of the test.
Train the Testers
Once the results of the trial have been analyzed and the test has been altered accordingly, the people who are going to administer and score the placement test should be trained. The test administrators, who may be support people but not educators, should receive instruction in delivering the test consistently and correctly with little or no variation from administration to administration. Alderson, Clapham, and Wall (1995, 115) maintain that it is important that ³[test] administrators understand the nature of the test they will be conducting, the importance of their own role and the possible consequences for candidates if the administration is not carried out correctly.² Scorers should be provided with scoring guides and opportunities to calibrate their assessments of subjective test items. Weir (1990, 86) believes that this step is important, noting that ³considerable attention shouldŠbe paid to the development of relevant and adequate scoring criteria and examiners must be trained and standardized in the use of these.²
    In conclusion, creating a placement test does not have to be an intimidating or overwhelming task for those instructors who have little or no experience in testing. By following the suggestions listed here, you should be able to create a placement test that is reliable, valid and accurate.
References
Alderson, J.C., C. Clapham and D. Wall. 1995. Language test construction and evaluation. Cambridge: Cambridge University Press.
Bradshaw, J. 1990. Test-takers¹ reactions to a placement test. Language Testing 7(1):13-30.
Brown, J. D. 1989. Improving ESL placement tests using two perspectives. TESOL Quarterly 23(1): 65-83. Brown, J. D. 1995. The elements of language curriculum: A systematic approach to program development. New York: Heinle & Heinle.
Harrison, A. 1983. A language testing handbook. London: Macmillan Press.
Hughes, A. 1989. Testing for language teachers. Cambridge: Cambridge University Press.
McNamara, T. 2000. Language testing. Oxford: Oxford University Press.
-----------------------------------

Assessing interactive oral skills in EFL contexts
by Jason Beale, MEd (TESOL)
Introduction
1. The purpose of assessment
  1.1   Testing general proficiency
  1.2   Educational placement & diagnosis
  1.3   Formative and summative assessment
  1.4   Testing for special purposes
2. Establishing assessment criteria
  2.1   The importance of validity
  2.2   The components of language use
  2.3   Specifying performance criteria
  2.4   Global and analytic rating scales
3. Choosing the best test format
  3.1   Interview tasks
      3.1.1   Structured interviews
      3.1.2   Unstructured interviews
  3.2   Role play tasks
      3.2.1   Structured role plays (information gap)
      3.2.2   Unstructured role plays
4. Special issues
  4.1   Practicality
  4.2   Bias for best
  4.3   Marking
Conclusion
Bibliography
Introduction
There are many English language teachers working in EFL contexts overseas. Their work often requires the quick assessment of a student's oral ability, usually during a brief initial interview or even in the very first class. This can help determine the choice of class material and the overall aims of a course of instruction. Informal assessment also continues throughout any teaching program, as a way of ensuring that desired outcomes are being achieved and students' needs are being met.
Such informal assessment is clearly a central part of language teaching. It is no less important than the formal testing of achievement, or the testing of employment and academic-related proficiency. It follows that all teachers in EFL contexts, whatever their positions and duties, ought to have a basic understanding of the principles underlying assessment of oral language skills.
1. The purpose of assessment
Before designing oral assessment tasks there needs to be a clear idea of the purpose of assessment. This is essential because the same degree of detail is not required in every testing situation. The purpose of the test will determine the overall shape of the assessment criteria to be used.
1.1 Testing general proficiency
The assessment of general proficiency is independent of a particular syllabus, and provides a broad view of a person's language ability. Ideally it focuses on fundamental oral skills, as well as on common communicative functions. Tasks such as summarising technical data, or describing statistics, require a grasp of fundamental skills of course, but they are clearly limited to academic or employment settings. They are formal tasks requiring particular language and presentation skills.
The term 'proficiency' refers to the practical use of language as a whole. It is therefore best assessed directly by eliciting extended samples of interactive language use in realistic contexts. The indirect assessment of oral language, through controlled response to single test items, has limited value as an indicator of real-life oral proficiency.
Unfortunately there is no such thing as a definitive test of general oral ability that can be applied in any situation. The standard of 'native-like' proficiency is only a convenient abstraction - one that ignores the personal and cultural differences that make communication real and complex. In EFL contexts, such as Japan, testees are often quite unfamiliar with Western cultural references and modes of behaviour, and so the design of test items needs to be as culturally neutral as possible without being too vague.
1.2 Educational placement and diagnosis
Assessment for educational placement is used to place the student in a suitable level for learning. It does not require the same degree of detail as a general proficiency test, but involves matching a student against fairly broad criteria in a band scale. Each 'band' describes the minimum level of ability needed for each stage of instruction. The most basic band scale would consist of only three levels: beginner, intermediate, and advanced.
Diagnostic assessment provides more detailed information on a learner's strengths and weaknesses. It requires descriptive analysis, both by impressionistic description and by rating specific aspects of language use. Such information is valuable for tailoring lessons more closely to learners' needs, and as a standard for evaluating progress at a later stage.
1.3 Formative and summative assessment
Formative assessment indicates a learner's ongoing progress during a course. It need not involve testing under formal conditions, but may simply consist of various impressions and notes that the teacher takes while observing students engage in communicative tasks. Summative assessment on the other hand is the formal measurement of a learner's achievement at the end of a unit or course of instruction. This involves matching student achievement with the stated objectives of the course. Summative assessment is different from general proficiency testing in that the assessment tasks of the former are based on the representative sampling of a syllabus.
1.4 Testing for special purposes
Outside of a specific syllabus of instruction, many people sit self-contained language tests that are recognized by higher education and employment bodies (ie. TOEIC and TOEFL). The design of assessment tasks and choice of language is intended to reflect the skills and knowledge needed in special contexts of work or study.
These kinds of tests are used as high-grade filters that discriminate between learners and rank them against a sliding scale. The main purpose of assessment here is to identify candidates for access to limited opportunities such as scholarships and promotions. As such they are not particularly suitable for assessing an individual's particular level of proficiency in detail.
2. Establishing assessment criteria
According to the Australian Oxford Mini Dictionary a criterion is a "principle or standard by which (a) thing is judged." To test oral language skills there need to be such criteria to act as guidelines for judgement. These should describe the various levels of performance in a way that can be tested both logically and consistently. The last two points are often called 'validity' and 'reliability' in the literature on assessment.
2.1 The importance of validity
Validity has been described as "the single most critical element in constructing foreign language tests" (Nakamura 1995: 126). A valid test has a recognizable logic to it that makes the test a meaningful tool of assessment. The most fundamental kind of validity relates to the underlying theory of language on which the test is constructed (construct validity). This influences the sampling of language material and tasks (content validity), which in turn has an effect on the appearance of the test to the teachers and learners who use it (face validity).
Construct validity requires a set of principles that can adequately describe real-life language use. In the case of oral language skills this is not such a simple matter. Speaking may seem to be a general-purpose ability, but it occurs under many contexts and conditions, and for many reasons. Each has its own characteristics and demands, especially when seen as an interactive skill. In the last few decades a great deal of effort has been made to describe language use as an interactive or communicative system. Canale and Swain's (1980) model of 'communicative competence' is certainly the best known example in the literature on applied linguistics.
2.2 The components of language use
Grammar, vocabulary, and pronunciation all fall under the general category of grammatical (or linguistic) competence in Canale and Swain's influential model. These are the basic skills, traditionally taught and tested in isolation from a communicative context. Yet in order to predict real language use successfully, higher level skills and knowledge also need to be considered.
A second category called discourse competence concerns the way language is conventionally shaped in different communicative contexts. Describing a suspect during a police interview, for example, requires more than basic grammatical skills - it involves selecting, organising and linking elements together to create a structured and coherent whole. Canale and Swain distinguish a third category called sociocultural competence; which covers the cultural forms of speech deemed appropriate in a particular community.
Weir (1993), drawing from Bygate, conveniently includes both discourse and sociocultural aspects of language use under the single heading "routine skills". These are "frequently recurring ways of structuring speech, such as descriptions, comparisons, instructions, telling stories", and includes the patterns of interactional language use seen in such things as "buying goods in a shop, or telephone conversations, interviews, meetings, discussions, decision making, etc" (Weir 1993: 32)
Canale and Swain's fourth category is strategic competence, which covers the various techniques people use to manage and enhance communication. This category is covered by Weir under the heading "improvisation skills" (1993: 32-4). Communication is a faulty and chaotic process and speakers need to be able to improvise when their conventional language routines fail. This includes both the "negotiation of meaning" in various ways to enhance understanding, as well as the "management of interaction" to establish "who is going to speak next and what the topic is going to be" (turn taking and topic initiation).
2.3 Specifying performance criteria
As the preceding section has shown, interactive oral skills involve different categories of practical knowledge (or know-how), each one effectively building on the next. First, basic grammatical & linguistic knowledge (core skills), then discourse & sociocultural knowledge (routine skills), and finally strategic knowledge (improvisational skills). Having this general framework is helpful in identifying the various components of oral ability that can be assessed. Yet deciding what weighting to give each category of skill is still not a straightforward matter.
Improvisational skills are useful in every general context. For example, "Excuse me, what did you say?", or its equivalent, is an essential phrase. In particular contexts, such as business negotiation, there is a greater need for highly developed improvisational skills. In choosing or designing specific performance criteria for an oral test it is important to decide which of these categories are important and to what extent at each level of a candidate's ability. Different criteria will produce different results. As noted by Brown, "if each group were to develop its own assessment framework..., they may, in fact, through the inclusion or weighting of specific criteria, produce schemes which lead to quite different evaluations of candidates ability." (cited in Turner 1998: 198) The assessment criteria need to be related to the actual purpose of the test. This is sometimes called systemic validity. It requires close consultation with the relevant educational and employment bodies to help determine in detail what they intend the assessment instrument to achieve.
2.4 Global rating scales
Performance criteria is usually displayed in a rating scale. A global or wholistic scale provides a general description of ability, in which the various components of language use are grouped together in a single 'band' descriptor:
Band 6: Competent Speaker.
Is able to maintain the theme of dialogue, to follow topic switches and to use and appreciate main attitude markers. Stumbles and hesitates at times but is reasonably fluent otherwise. Some errors and inappropriate language but these will not impede exchange of views. Shows some independence in discussion with ability to initiate. (Carroll cited in Weir 1993: 44)
Global descriptors are not always so brief as this. The Australian Second Language Proficiency Ratings (ASLPR) scale, developed by Ingram and Wylie in 1982, uses an A4 page to present each band descriptor in considerable detail. This allows for increased accuracy of identification, but at the cost of flexibility of assessment. Detailed global scales effectively dictate what combination of skills is to be recognized at each level, although in practice the particular features "may not co-occur in actual student performance" (Turner 1998: 200).
2.5 Analytic rating scales
The term analysis strictly refers to the breaking down of an object into its constituent parts or aspects. This is the opposite of synthesis or the putting together of parts to make a whole. Although the general components of oral language use are those discussed above in 2.2, there are various ways in which this "cake" of abilities can be sliced for assessment. Following are examples of assessment categories from four different analytic rating scales:
FLUENCY, PRONUNCIATION, GRAMMAR, COMPREHENSIBILITY
- Speaking Proficiency English Assessment Kit (SPEAK), Educational Testing Service, USA (Clankie 1995: 124)
FLUENCY, ACCURACY, COMPREHENSION, COMMUNICATIVE ABILITY
- Placement rating scale, Nova conversation school, Japan (unpublished)
ATTITUDE & CONFIDENCE, EXPRESSIVENESS (pronunciation, intonation & volume), BODY LANGUAGE, UNDERSTANDABILITY (for the listener, is the message delivered clearly?), COMMUNICATIVE ABILITY (can the speaker say what he/she wants to say?)
- Negotiated performance profile, Tokyo Denki University, Japan (McClean 1995: 142-3)
FLUENCY, GRAMMATICAL ACCURACY, INTELLIGIBILITY, APPROPRIATENESS, ADEQUACY OF VOCABULARY FOR PURPOSE, RELEVANCE AND ADEQUACY OF CONTENT
- Test in English for Educational Purposes (TEEP), Associated Examining Board, England (Weir 1993: 43-44)
Within each category, different levels of ability need to be distinguished clearly using descriptive language that can be matched against test results. With clear criteria determined by the overall purpose of assessment (systemic validity) and founded on a clear theory of language use (construct validity), it is possible to choose relevant assessment tasks. The choice of relevant tasks is an important step in itself, for as shown in one study of interview-format discourse (cited in Turner 1998: 195), "some of the supposed characteristics of intermediate versus advanced learners represented in the rating scales were not substantiated in the actual performance of intermediate and advanced learners."
3. Choosing the best test format
Since Canale and Swain presented their model of communicative competence twenty years ago (see 2.1-2.2), the communicative approach has spread into both teaching and testing methodology. According to Weir (1988: 82) communicative testing is purposive, interesting, motivating, interactive, unpredictable and realistic.
Assessing interactive language means by definition that there is someone else actively taking part. The person being tested is not only producing language, but is also responding in a communicative way with another interlocutor. This is quite different from non-interactive stimulus response tasks. Techniques that use written or visual prompts to elicit language samples are very straightforward and time-efficient to administer, and can also help to gauge the general educational level of the student. The SPEAK test of oral proficiency is one example of a test composed mostly of non-interactive tasks (Clankie 1995). Unfortunately they fulfil very few of the qualities of communicative testing listed above.
There are many kinds of oral assessment task that can be used - one writer listing over sixty variations (Underhill 1987). In essence there are two general approaches that meet the criteria for interactive assessment. These are interview and role play.
3.1 Interview tasks
Interview tasks are a direct test of language use; that is, "they measure oral skills by having the examinees actually speak" (Turner 1998: 194). Even so, the ostensible context remains that of a language test. Beyond making the candidate feel at ease, there is no attempt to simulate a non-test setting. Interview tasks thus represent a compromise solution to the problem of how to control something that is inherently unpredictable.
3.1.1 Structured interviews
A structured interview composed of set questions has many advantages. It can be reliably used to determine someone's general level in terms of grammatical knowledge, vocabulary, pronunciation and fluency. It can also be used to find out how well the candidate can structure a short narrative, and to what degree they can express more complex points of view. It is relatively cost and time efficient to administer, and if the interview is recorded properly then marking can be a fairly reliable standardised procedure.
A common interview structure has four stages (Nagata 1995):
1. a friendly warm up
2. a level check to determine the candidate's overall ability in terms of the criteria
3. challenging probes to find where performance drops
4. a final wind down at a less challenging level
In EFL contexts, such as Japan, the structured interview is readily accepted by test users since it mirrors the often formal social relationship that exists between teacher and student. This high face validity makes it a popular method of oral assessment, despite its limitations as a measure of real-life oral ability.
The st