| View Issue Details [ Jump to Notes ] | [ Print ] | ||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||
| 0008056 | CMake | CTest | public | 2008-11-07 10:45 | 2009-09-14 21:22 | ||||
| Reporter | Roscoe A. Bartlett | ||||||||
| Assigned To | Bill Hoffman | ||||||||
| Priority | normal | Severity | major | Reproducibility | always | ||||
| Status | closed | Resolution | fixed | ||||||
| Platform | OS | OS Version | |||||||
| Product Version | |||||||||
| Target Version | Fixed in Version | ||||||||
| Summary | 0008056: ctest -j N does not work correctly with expected failing tests | ||||||||
| Description | I just checkout, built and installed the most recent version of CMake on 2008/11/07. I tested the -j N option with ctest and it mostly works correctly but it does not work correctly for tests that are expected to fail (i.e. they return non-zero value) but there is a REGEX given to determine success. Here is one example. I one test defined with a helper macro as: =========================================================== TRILINOS_ADD_TEST( UnitTest_BadUnitTest NAME UnitTest_BadUnitTest_end_result_failed DIRECTORY UnitTest ARGS "--teuchos-suppress-startup-banner" NUM_MPI_PROCS 1 COMM serial mpi PASS_REGULAR_EXPRESSION "End Result: TEST FAILED" ) =========================================================== In the macro TRILINOS_ADD_TEST(...), the argument PASS_REGULAR_EXPRESSION "End Result: TEST FAILED" is added as test property as you would expect. Here is the output from the test: =========================================================== Start testing: Nov 07 07:52 MST ---------------------------------------------------------- Changing directory into: /home/rabartl/PROJECTS/Trilinos.base/BUILDS/CMAKE/GCC-4.1.2/SERIAL_OPT/packages/teuchos/test 7/109 Testing: Teuchos_UnitTest_BadUnitTest_end_result_failed 7/109 Test: Teuchos_UnitTest_BadUnitTest_end_result_failed Command: "/home/rabartl/PROJECTS/Trilinos.base/BUILDS/CMAKE/GCC-4.1.2/SERIAL_OPT/packages/teuchos/test/UnitTest/UnitTest_BadUnitTest.exe" "--teuchos-suppress-startup-banner" Directory: /home/rabartl/PROJECTS/Trilinos.base/BUILDS/CMAKE/GCC-4.1.2/SERIAL_OPT/packages/teuchos/test "Teuchos_UnitTest_BadUnitTest_end_result_failed" start time: Nov 07 07:52 MST Output: ---------------------------------------------------------- *** *** Unit test suite ... *** Sorting tests by group name then by test name ... Running unit tests ... 0. Int_Assignment_UnitTest ... [Passed] 1. Int_BadAssignment_UnitTest ... i2 = 5 == i1 = 4 : failed [FAILED] Location: /home/rabartl/PROJECTS/Trilinos.base/Trilinos/packages/teuchos/test/UnitTest/Failing_UnitTest.cpp:7 2. Int_Basic_UnitTest ... [Passed] 3. VectorInt_OutOfRangeAt_UnitTest ... p=0: *** Caught standard std::exception of type 'std::out_of_range' : vector::_M_range_check [FAILED] Location: /home/rabartl/PROJECTS/Trilinos.base/Trilinos/packages/teuchos/test/UnitTest/Failing_UnitTest.cpp:15 4. someFunc_ComplexDouble_test1_UnitTest ... [Passed] 5. someFunc_ComplexFloat_test1_UnitTest ... [Passed] 6. someFunc_double_test1_UnitTest ... [Passed] 7. someFunc_float_test1_UnitTest ... [Passed] Summary: total = 8, run = 8, passed = 6, failed = 2 End Result: TEST FAILED ---------------------------------------------------------- "Teuchos_UnitTest_BadUnitTest_end_result_failed" end time: Nov 07 07:52 MST "Teuchos_UnitTest_BadUnitTest_end_result_failed" time elapsed: 00:00:00 ---------------------------------------------------------- End testing: Nov 07 07:52 MST =========================================================== Note that the above test gives the correct string, but 'ctest -j 8' returned that it failed. Another problem with 'ctest -j N' is that it does not respect the '-W' option to widen the output as shown by: =========================================================== [rabartl@godel SERIAL_OPT]$ ctest -W 100 -j 8 -E TrilinosDep -R Teuchos_UnitTest Start processing tests Test project /home/rabartl/PROJECTS/Trilinos.base/BUILDS/CMAKE/GCC-4.1.2/SERIAL_OPT 20: 20/120 Testing Teuchos_UnitTest_BadUnitTest_I***Failed 21: 21/120 Testing Teuchos_UnitTest_BadUnitTest_I***Failed 19: 19/120 Testing Teuchos_UnitTest_BadUnitTest_e***Failed 16: 16/120 Testing Teuchos_UnitTest_UnitTests ... Passed 18: 18/120 Testing Teuchos_UnitTest_BadUnitTest_e***Failed 17: 17/120 Testing Teuchos_UnitTest_BadUnitTest_f***Failed 17% tests passed, 5 tests failed out of 6 =========================================================== The simultaneous serial testing feature is an important feature for Trilinos in order to speed up pre-checking testing, to give faster continuous integration feedback, and to better utilize sparse testing machines. | ||||||||
| Tags | No tags attached. | ||||||||
| Attached Files | |||||||||
| Relationships | |
| Relationships |
| Notes | |
|
(0014063) Roscoe A. Bartlett (reporter) 2008-11-07 17:54 |
The issue of not respecting the -W option and having different behavior with respect to pass/fail tests would seem to suggest that CTest has some duplication issues. Does CTest need to be refactored to remove some duplication (that is not being consistently maintained as is the problem with all duplication)? |
|
(0014404) Bill Hoffman (manager) 2008-12-18 22:03 |
I have fixed the pass/fail regex stuff. I have not yet fixed the -W stuff. It is not so much a duplication issue. The -j stuff runs a ctest process for each test. To do this efficiently, it parses all the tests, then creates a temporary cache file to store the test information so that the sub-ctests can load it quick. It was not saving all of the test information, it now is. (also, I put a comment in place to make sure it stays up-to-date.) For the -W stuff, ctest should pass the -W to the sub ctest and all should be good. |
|
(0014437) Roscoe A. Bartlett (reporter) 2008-12-24 14:03 |
Bill, I have updated my CVS copy of CMake and 'ctest -jN' does now seem to run all of the tests correctly. However, the parallel performance does not look very good. Here is a set of experiments that I ran: Here, I want to look at the runtime savings that running with -jN on godel for different values of N (1 to 8). 1) Running serial debug tests without -jN: $ time ctest -W 100 ... real 3m48.493s user 3m42.992s sys 0m8.495s 2) Running serial debug tests with -j2: $ time ctest -W 100 -j2 ... 100% tests passed, 0 tests failed out of 248 real 2m25.877s user 3m50.630s sys 0m34.104s 3) Running serial debug tests with -j4: $ time ctest -W 100 -j4 ... 100% tests passed, 0 tests failed out of 248 real 1m47.034s user 3m51.593s sys 0m41.900s 4) Running serial debug tests with -j8: $ time ctest -W 100 -j8 ... 100% tests passed, 0 tests failed out of 248 real 1m40.348s user 3m52.744s sys 0m57.647s These are not good speedups by even parallel MPI standards. I suspect that the algorithm used to implement parallel runs is not very good or something. Here is the table of times and speedups for using 'ctest -jN' with the entire serial debug test suite of 248 tests: # Processors (-jN) Wallclock time (sec) Speedup Efficiency 1 228 - 1.00 2 145 1.57 0.79 4 107 2.13 0.53 8 100 2.28 0.29 This is terrible parallel efficiency, even by MPI standards. Moving from 1 to 2 processors is worthwile but moving to 4 processers is not and moving to 8 is a waste of processors. I can only assume that CTest is using a bad algorithm for scheduling the jobs. How does parallel compiling with -jN with GNU Make compare on the same machine? Let's run an experiment of just compiling the Teuchos library in parallel. There are 60 *.o files get get built by this. Here are the raw timing experiments: 1) Compiling Teuchos_libs with -j1: $ make clean $ time make Teuchos_libs -j1 ... real 0m37.736s user 0m29.996s sys 0m5.852s 2) Compiling Teuchos_libs with -j2: $ make clean $ time make Teuchos_libs -j2 ... real 0m18.684s user 0m30.148s sys 0m5.917s 3) Compiling Teuchos_libs with -j4: $ make clean $ time make Teuchos_libs -j4 ... real 0m9.977s user 0m30.417s sys 0m6.146s 4) Compiling Teuchos_libs with -j8: $ make clean $ time make Teuchos_libs -j8 ... real 0m5.534s user 0m30.774s sys 0m6.688s Here is the table of times and speedups for using 'make Teuchos_libs -jN': # Processors (-jN) Wallclock time (sec) Speedup Efficiency 1 37.7 - 1.00 2 18.7 2.02 1.01 4 10.0 3.77 0.94 8 5.5 6.82 0.85 As you can see, we are getting almost perfect speedup and parallel efficiency on four processors and very good speedup on 8 processors. This shows that GNU Make has a reasonably algorithm for controlling the parallel builds. What this shows to me is that CTest is not using a very good parallel algorithm for running the tests. What algorithm is CTest using the schedule the jobs and what can we do to improve this? |
|
(0014438) Roscoe A. Bartlett (reporter) 2008-12-24 14:08 |
Also, I see that -jN and -T memcheck do not work together. Since memory testing takes a lot more time than just testing, we would save a lot of our testing resources if we could get good parallel scalability in running memory tests. |
|
(0014439) Bill Hoffman (manager) 2008-12-24 15:35 |
I will have to take a look. The algorithm is pretty simple. I can think of a few things that might be the problem... 1. I run a new ctest process for each job, that might be adding extra overhead. 2. I am using process execution and no threads so I have to keep checking for output or the end of the process in a loop. When a process ends, it starts a new one. If you want to have a look the code is all in CTest/cmCTestMultiProcessHandler.cxx and CTest/cmProcess.cxx This code is pretty new. I don't even think it will work with dashboards yet. Right now it runs the tests and that is about it. |
|
(0014440) Bill Hoffman (manager) 2008-12-24 15:46 |
You might want to try changing this parameter: int pipe = p->CheckOutput(.1, out, err); That is the timeout for each check for output on the running tests. |
|
(0014452) Bill Hoffman (manager) 2008-12-29 18:11 |
I am having a hard time reproducing this. I created a test in CMake, Tests/JCTest. It does something like this: foreach(f 1 2 3 4 5 6 7 8 9 10 11 12 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30) add_test(TestTime${f} TestTime 50000000) endforeach(f) #include <stdio.h> #include <stdlib.h> TestTime.cxx int main(int ac, char** av) { float d = 10.0; for(int i =0; i < atoi(av[1]); i++) { d *= .2; } printf("%f", d); } If I run this on an 8 core machine, I get these results: # time TestTime 50000000 0m3.411s (running just the test program) # no -j 1m38.422 # -j1 1m40.538s # -j2 0m53.607s # -j4 0m28.833s # -j6 0m20.268s # -j8 0m14.677s It seems to be scaling very well. 3.4 * 29 98.6 98.6 / 8 12 Only 2 seconds slower than optimal. Can you try running my test on your machine? Just checkout a recent CVS CMAke, then do this: cd Tests/JCTest mkdir b cd b cmake .. ctest -j 8 You may have to modify CMakeLists.txt number of iterations passed into TestTime based on your machine speed. I did not want to use a sleep, I wanted to make sure the machine was really busy. |
|
(0014886) Bill Hoffman (manager) 2009-02-10 14:24 |
-W is now working with -j |
|
(0014921) Roscoe A. Bartlett (reporter) 2009-02-13 00:52 |
Sorry I did not respond sooner but these got sent to my Junk Mail folder and I just happened to see them :-) I will try this out again the next time I update my CVS version. |
|
(0017514) Bill Hoffman (manager) 2009-09-14 21:22 |
This should be fixed now, and all the code is the same for j1 and jN so it should not happen again... |
| Notes |
| Issue History | |||
| Date Modified | Username | Field | Change |
| 2008-11-07 10:45 | Roscoe A. Bartlett | New Issue | |
| 2008-11-07 17:54 | Roscoe A. Bartlett | Note Added: 0014063 | |
| 2008-12-16 16:59 | Bill Hoffman | Status | new => assigned |
| 2008-12-16 16:59 | Bill Hoffman | Assigned To | => Bill Hoffman |
| 2008-12-18 22:03 | Bill Hoffman | Note Added: 0014404 | |
| 2008-12-24 14:03 | Roscoe A. Bartlett | Note Added: 0014437 | |
| 2008-12-24 14:08 | Roscoe A. Bartlett | Note Added: 0014438 | |
| 2008-12-24 15:35 | Bill Hoffman | Note Added: 0014439 | |
| 2008-12-24 15:46 | Bill Hoffman | Note Added: 0014440 | |
| 2008-12-29 18:11 | Bill Hoffman | Note Added: 0014452 | |
| 2009-02-10 14:24 | Bill Hoffman | Note Added: 0014886 | |
| 2009-02-13 00:52 | Roscoe A. Bartlett | Note Added: 0014921 | |
| 2009-09-14 21:22 | Bill Hoffman | Note Added: 0017514 | |
| 2009-09-14 21:22 | Bill Hoffman | Status | assigned => closed |
| 2009-09-14 21:22 | Bill Hoffman | Resolution | open => fixed |
| Issue History |
| Copyright © 2000 - 2018 MantisBT Team |