home Mail List
Info
Info
Meetings
Goals
Upcoming
Projects
FAQ
Security
Links

[Date Prev][Date Next] [Chronological] [Thread] [Top]

[NMLUG] Python tight loop causing massive CPU barfage



Sorry, make that 4000 directories and 735,000 files in 9 
minutes.

On Wed, 09 Feb 2005 18:15:04 -0500
  Peter Espen <peter@espen.net> wrote:
> 
> On a slower (800MHz Pentium III) linux system with a 
>flatter directory heirarchy, the previous python code 
>searched 4000 directories and 175,000 files in 9 minutes:
> 
> {'dirs': 4000, 'files': 735000, 'matches': 1878000}
> 
> real    8m59.230s
> user    8m14.010s
> sys     0m38.450s
> 
> You can tell from the large matches it's a disk with 
>nothing but mail files.
> 
> 
> On Wed, 09 Feb 2005 17:29:48 -0500
>  Peter Espen <peter@espen.net> wrote:
>> 
>> The following python code searched a heiarchy of 28,517 
>>directories and 222,766 files.  For each "regular" file 
>>found, it opened and searched the first 1024 lines for 
>>any line beginning with 'Subject'.
>> 
>> Ran in 7 minutes 13 seconds on a fast linux system.
>> 
>> import os, os.path
>> import stat
>> import re
>> import sys
>> 
>> def foo( total, dirname, filelist):
>> 
>>     total['dirs'] += 1
>> 
>>     for mf in filelist:
>>        fn = dirname + '/' + mf
>>        try:
>>            mode = os.stat(fn)[stat.ST_MODE]
>>        except:
>>            continue
>>   
>>        myre = re.compile('^Subject')
>> 
>>        if stat.S_ISREG(mode):
>> 
>>            total['files'] += 1
>> 
>>            mf_fd = os.open( fn, os.O_RDONLY )
>>            the_file = os.fdopen(mf_fd, "r", 1024)
>>            the_lines = the_file.readlines(1024)
>>            for line in the_lines:
>>                m = myre.match(line)
>>                if m:
>>                    total['matches'] += 1
>>            the_file.close()
>> 
>> total = {}
>> total['dirs'] = 0
>> total['files'] = 0
>> total['matches'] = 0
>> 
>> os.path.walk('.', foo, total)
>> 
>> print total
>> 
>> Output:
>> 
>> {'dirs': 28517, 'files': 222766, 'matches': 402}
>> 
>> real    7m13.328s
>> user    0m26.100s
>> sys     0m10.020s
>> 
>>> On Tue, 08 Feb 2005 20:44:17 -0700
>>>  Paul Tietjens <paul.tietjens@moriarty.k12.nm.us> wrote:
>>>> I have a python script that essentially opens a few 
>>>>thousand (between 70,000 and 230,000 or so) files, reads 
>>>>the first 1024 bytes and looks for a string match.
>>>> 
>>>> The goal is to search an entire partition full of 
>>>>Maildirs for specific emails.
>>>> 
>>>> I want the process to happen as fast as possible.  So 
>>>>far, it takes around 21 minutes - but there's a snag. 
>>>> While this script is running, every other process on the 
>>>>machine becomes sluggish to the point of 
>>>>nonresponsiveness.
>>>> 
>>>> No amount of playing with nice and priority levels seems 
>>>>to help.
>>>> 
>>>> What has helped, is a small sleep() in the loop - but 
>>>>that raises the amount of time taken to complete the 
>>>>tasks fairly rapidly (from 21 minutes to over an hour).
>>>> 
>>>> In the end, I set up a goofy sort of throttling that 
>>>>alters the amount of time sleep()ing by the average load.
>>>> 
>>>> Is there a better way to do this?  I'm not much of a 
>>>>coder, and I know there are a couple on this list - so 
>>>>any tips offered, no matter how nebulous, would be great.
>>>> 
>>>> Thanks in advance!
>> _______________________________________________
>> NMLUG mailing list
>> NMLUG@nmlug.org
>> http://www.nmlug.org/mailman/listinfo/nmlug
> _______________________________________________
> NMLUG mailing list
> NMLUG@nmlug.org
> http://www.nmlug.org/mailman/listinfo/nmlug



Please send sugestions and comments to webmaster@nmlug.org.
Valid XHTML 1.1! Valid CSS! Powered by Debian Powered by Apache