PR# 14518 Routine {THREAD_CONTROL}.join not thread-safe
Problem Report Summary
Submitter: prestoat2000
Category: Runtime
Priority: Medium
Date: 2008/06/24
Class: Bug
Severity: Serious
Number: 14518
Release: 6.2.73753
Confidential: No
Status: Analyzed
Responsible:
Environment: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.8.1.13) Gecko/20080328 Firefox/2.0.0.13
Solaris 10 on SPARC
Synopsis: Routine {THREAD_CONTROL}.join not thread-safe
Description
Based on code inspection only, it looks like routine {THREAD_CONTROL}.join is not thread-safe. There seem to be two problems. Below is my reasoning (apologies in advance if I'm wrong). 1. First problem: {THREAD_CONTROL}.join calls `thread_wait', which is a blocking C external that is run-time routine `eif_thr_wait. At entry to `eif_thr_wait', status of GC status of calling thread is EIF_THREAD_BLOCKED. Routine `eif_thr_wait' calls RT_GC_PROTECT(thread_object), which translates to a call to `epush'. Epush may have to call `st_extend', which calls `eif_rt_xmalloc' with GC_OFF. Eif_rt_xmalloc calls `allocate_free_list', which locks `eif_free_list_mutex' and then calls `allocate_free_list_helper'. This latter routine looks at the free list and may modify it as well. Since the thread status is still EIF_THREAD_BLOCKED, a call to `eif_synchronize_gc' by another thread could complete and that thread could start GC. GC can call `rel_core' which can call `free_chunk' which can call `chunk_coalesc' which can access and modify the free list. So we could have two threads modifying the free list at the same time, one holding `eif_free_list_mutex' and the other under GC synchronization. The problem seems to be that `eif_thr_wait' can do things with Eiffel-controlled memory while its status is EIF_THREAD_BLOCKED, which is a no-no. 2. Second problem: Even if the call to RT_GC_PROTECT(thread_object) does not have to call `eif_rt_xmalloc', there is a problem. Since the thread status is EIF_THREAD_BLOCKED, a GC may be in progress while `eif_thr_wait' executes. So `eif_access(Current)' may compute a value, then GC may move the object, then the old location could be stored int `thread_object'. There are similar race conditions farther down in the routine.
To Reproduce
Problem Report Interactions
I don't have any other ideas. If the thread object cannot move (if it is frozen early on), then there is no problem. If it can move, then it seems like the solution is to have the run-time think the thread is in Eiffel code (via EIF_EXIT_C) whenever the thread can try to access `thread_object'. However, it seems to me that after each EIF_EXIT_C call, you must call RTGC or equivalent. Otherwise, a GC could be in progress in another thread and then there is a race condition where `thread_object' is evaluated, then the thread object is moved by the thread doing the GC cycle and `thread_object' is updated, then the thread executing `eif_thr_wait' evaluates the rest of the expression *(EIF_BOOLEAN *) (thread_object + offset) == EIF_FALSE and incorrectly uses the old location.
The safe solution ignoring the RT_GC_PROTECT call which we could do otherwise to avoid the first issue you mentioned (i.e. not marking the external blocking but inserting ourself the calls to EIF_ENTER_C and EIF_EXIT_C in the `eif_thr_wait' routine directly) is to ensure that a thread object cannot move. Do you have any other idea?