Blind test of Loudspeaker Q113

The blind test of Q113 speakers is in full swing. Here we describe the test in more detail, because how is it actually performed? It's also a peek into a world not normally accessible.
One of my friends told me the other day about a special restaurant in Berlin where the food is served in complete darkness and the waiters are blind. One of the ideas of the restaurant is to give the eyes a break and make the guests more thoughtful and aware of their other senses. For example, you should be able to easily hear the texture of the soup and your fingertips should be really good at feeling what's in the salad. One of the menus consists of surprises where guests have to guess what they are going to eat. The friends try one of these menus. They guess anything from fish to veal or chicken, but don't feel safe. Nothing in the taste leads with certainty to anything in particular. The interesting thing about this unscientific example is that a taste experience seems to get overwhelming support from sight. And it's interesting that you can be in doubt about what you're eating when your eyes aren't helping. Does the same apply to sound and speakers?Four speakers in the Q113 project were judged some time ago by a large listening panel to be reasonably different. The speakers' sound quality was rated as ranging from "nicely above average" to "truly top class", but test participants always knew which speakers they were rating. Now the test results are being verified in a blind test at DELTA SenseLab and I'm the first to try it myself. When you have no idea which speakers are playing, there's nothing to do but judge the sound quality against your honest opinion. Now, few would claim that they are deliberately distorting reality for themselves. But few would argue that they don't have a preference for certain types of construction or, conversely, a scepticism about others. There are undoubtedly many factors that come into play when picking a favourite, but how much really has to do with the performance of the favourite? Are there things we can hear that we can't measure? Do our ears perhaps prefer something that we are not aware of? The blind test provides a pure and unvarnished assessment of the sonic differences between two speakers. You are not presented with anything other than the sound, thus removing attention from things like price, design, construction, etc. I won't entertain with my own listening impressions while the test is still on, but instead pass the floor to Torben from DELTA SenseLab for a behind-the-scenes look at the exciting test.



Instruction. Prior to the blind test, the test participant reads a three-page manual explaining what the test is about and how to operate the equipment's interface.


Torben Holm Pedersen from DELTA SenseLab now speaks

When conducting a listening test, a number of choices have to be made to ensure an appropriate compromise between providing relevant, representative and significant results on the one hand, and ensuring that the test does not become an unwieldy and overly time-consuming affair on the other. The combinations of the many variables in a test can easily make it bloat. Ideally, for each of the 4 loudspeakers, one would test, for example, five programme materials (e.g. 3 different music genres and male and female speech) three different volume levels (background level, medium volume and high volume) two different locations (at the wall and free in the room). Furthermore, one could easily find ten characteristics or attributes one wanted to be assessed.


Full-factorial test requires unrealistic 10 hours of listening

In a so-called full-factorial test, this would mean that each listener would have to make 4x5x3x2x10 ratings, i.e. a total of 1200 ratings, each typically taking half a minute, i.e. about 10 hours of listening per person. If one wanted to further compare all 12 (incl. order) combinations of two speakers it would become quite unmanageable. In order not to make the test too extensive, we have opted for pairwise comparison, which is probably the test type that provides the best resolution and could thus reveal the smallest differences between the speakers. Another option is what can be called "intensive monadic listening", where you listen intensively to each speaker for, say, half an hour and rate each of the selected attributes without comparison to other speakers. However, this requires a lot of listening and programming and probably gives less resolution than pairwise comparison.



Screenshot. On the left, the test taker selects which attribute is being assessed. On this, the test taker places an arrow reflecting the test taker's rating.


The speakers are compared to a reference

Since a comparison of the different Q113 speakers is relevant, we have chosen an intermediate between the two test types mentioned, namely comparative listening with a reference. Here each stereo set (placed optimally in relation to the listening position and directed towards it according to the ITU-R BS.1116-1 standard) is compared with a set of reference loudspeakers with a fixed position in the room. The reference loudspeakers act as "anchors" for the ratings and are ultimately excluded from the ratings.


The test participant controls the test via a keyboard

Omskiftningen mellem reference- og testhøjttaler styres af lytteren uden andre hørbare effekter end de forskelle, som højttalerne giver anledning til (Fade-out + fade-in sekvens <100ms). Der er tastaturgenveje så man ikke behøver at se på skærmbilledet når man f.eks. ønsker at skifte mellem testhøjttaleren og referencehøjttaler. For hver af de syv attributter vist til venstre foretages en sammenligning med referencehøjttaleren. Det angives hvor meget mere eller mindre testhøjttaleren besidder den aktuelle attribut end testhøjttaleren. Den valgte attribut er nærmere defineret i boksen til højre (se skærmbillede fra computer ). Hvis man synes at en speciel del af musikeksemplet er særligt egnet til bedømmelsen af den aktuelle attribut, kan der zoomes ind på den del. Hvert lydeksempel er ca. 15 sekunder langt. Når alle attributter er bedømt for en given højttalertype holder lytteren en kort pause uden for lytterummet mens en ny højttaler placeres i samme position som den foregående. Rækkefølgen af højttalerne er forskellig for forsøgspersonerne.



One at a time. Here, the participants of the specially trained listening panel sit one at a time and compare the sound from the speakers.


The tested speakers play equally loud

In listening tests, the speaker that plays slightly louder than the others is often preferred. Therefore, before the test, the speakers were adjusted to the same linear sound pressure level in the listening position using bandpass filtered (80-14,000 Hz) pink noise. The previously mentioned sound examples were chosen for the following reasons: Artist: Jennifer Warnes, Album: Famous Blue Raincoat, Track: Bird on a Wire This has been called the best recording ever. In particular, it provides an opportunity to judge attack, source separation, spatial reproduction and localisation. Artist: Karina Gauvin, Luc Beausejour, Album: Bach: Little Notebook for Anna-Magdelena, Track: Willst du dein Hertz mir schenken: The voice must be pure with a crisp rendition of the harpsichord. Furthermore, a good rendering of space. Artist: Paula Cole, Album: This Fire, Track: Tiger The number is special in having a powerful five string bass guitar that goes down to 31 Hz and is especially suited for judging the bass rendition. The Paula Cole track was cranked up to what the speakers could handle. The other two tracks are adjusted to a lower volume to give variation in level. The resulting levels in the listening position with the reference speaker were as follows:




Blind test. The speakers hide behind the thin fabric and the test participant can only assess the sound.


Working to select relevant attributes

Sound can be described in many ways. The report "The Semantic Space of Sound" (available for download from www.madebydelta.com) collects more than 600 words describing sound. Only a small proportion of these are suitable for describing reproduced sound, where it is not the sound itself that is being described, but only the change to the original sound caused by the speakers. We are currently working on selecting, defining and categorising relevant attributes for this purpose from a subset of about 90 words (some with some overlap).


Seven attributes are assessed on Q113

After informal sampling, we have selected the following seven attributes that we believe are relevant to elucidating differences between the Q113 loudspeakers: treble strength The relative strength of the treble, i.e. the bright tones (high frequencies) from e.g. cymbals, triangle and s sounds (whistling sounds). Bass strength The relative strength of the bass, i.e. the low notes (frequencies) e.g. male voices, bass, bass drum. Not to be confused with bass range, which indicates the extension of the bass. Present The impression that an instrument or vocal is emerging from the soundscape and is present in the room. Localisation Can instruments and voices be clearly located and separated in the spatial soundscape? How precisely are each sound source located in space? Attack How good is the speaker's ability to reproduce transients/impulse sounds? Are instrumental responses from bass drum and bass accurately and distinctly reproduced? Is the bass tight and punchy or soft and thumping? Less accurate can be heard by the impact spreading out in time and the climax of the impact fading out. Distortion Are there impure sounds or murmurs such as hissing, scratching, clipping or distortion coming from the speaker? If no distortion is audible, place the response on the centerline corresponding to the Reference Envelopment Does the reproduced sound image surround you and give the sense of space around you? The sense of more Envelopment than Reference is judged to the right of the centerline.



Control room. Here, DELTA specialists manage the test process and save the participant's answers for later processing.


Test takers must be able to repeat their own assessments

All attributes are rated separately for each speaker set and each music sample. The test is well underway and will be run over the next few weeks. We have currently had 12 listeners through the test and a picture of the results is beginning to emerge. It is important to consider the qualifications of the listeners when processing the results. Based on the results, for each attribute we will rate the ability of the listeners to: * to discriminate between speakers* to be consistent with the rest of the panel and to be able to repeat their own judgementsWe have not yet decided whether to base the final statistics on all the results except for a small number of unsure listeners or whether to go for the judgements of, say, the 10 "sharpest" listeners.
