Project Kashmir #2 : WAC I485 Case Status

It's great !!

As I wrote before, my program was a kind of prototype that was created rapidly,
then it's been added some features later,
so it become difficult to maintain.
Also, some manual works are required to create reports.

If you had some difficulty at Project Anaconda, let me know.
-kashmir
Originally posted by jokerpoker_us
Actually Project Anaconda is very much alive. Its running right now and doing a complete scan . So a project Anaconda report
is due this week. I had lot of technical glitches to figure out but
I have the whole thing working right now. As I iron out the
creases and get the whole automation in to place I believe I
can run report every week.

It has not been a very good month for me. I have spent days on Project Anaconda and like most of the things in my life I expect
it to be successful. And so I will continue to work on it very hard
ignoring all criticism. Its easy to say stupid things and its very
difficult to do certain things.

Please bear with me as I work on Project Anaconda to automate
it fully. It gets better every day. And once I finish it off completely
you won't have reason to make stupid ass comments.
 
I believe a lot in automation. I know I am so close on Project
Anaconda and then we have this shutdown at my company.

I am working but the load will be light. And I swear I am going
to nail down the minor issues in Anaconda over this week.

Meanwhile I am running a full scan for wac02 right now. So
there will be a report this week for sure.
 
Kashmir,
How much time does a full scan for your program takes?

Here are some of the things I did for Project Anaconda

1. Set up my own Fedora linux server which can run all
the time. I don't trust Windoze and I won't be using my
office hardware.
2. Project Anaconda is stored in a CVS repository on this
Linux server. In future I could give guest access to this
CVS repository.
3. I develop Anaconda on my laptop but I checkin all code
in to the repository. I have then scripts on my linux server
which run multiple ant targets to fetch data for various
series. I run about 10 processes which handle about
30 case series each.
4. Project Anaconda relies heavily on free proxy servers.
It automatically switches proxy servers when the requests
are exceeded or if the proxy servers are down.
5. Project Anaconda is written as a API and a open source
project with proper Javadoc so that it can be handed over
to people when the time comes. It is being managed like
any other Jakarta project with its CVS repository and I will
add bugzilla later on.

Technical difficulties.

Proxies!! How to find good proxies? Sheesh that is hard.
Free proxies are hard to find and unreliable. Project Anaconda
switches to a proxy and tests it first. It also keeps a stat
on the proxies used so that we can identify better proxies.

Making proxies work for https sucks big time in Java. I feel like
kicking the sun engineers in their balls. I tried using all kinds of
libraries like http unit and httpclient but they suck big time on
using proxies for secure http.

Some proxies can just hang. You can specify a time out starting
with jdk 1.4.2 but that does not work for https connection. It
sucks!!!..

Finally the stuff Kashmir gave me is what it works. It is simple
but functional. And that is still at my core though i would love
to have httpunit or something like that in its place.

The proxy stuff kills me. To scan 36000 cases in wac02 it will
take about 36 good proxies. Assuming that some proxies will
already be used by other people we need more like twice as
many. Again speed of proxies can affect your run time. Assuming
about 10 seconds per case it will take about 100 hours to do
a full linear scan . Which is like four days!!!. You can do stuff in
parallel and bring this down to 1 day. I am just quoting this
so smart ass people can realize that this is not a simple problem
to solve.

Just the script to identify best proxies can run for days and can
hang because connection time out in java does not work for
secure https.

Plus I got to keep the API organized and also maintain my linux
machine. It will get better as I set it up.

And then INS can always change their site as they did few weeks
back. Kinda screws you badly.

Trust me this is not a simple problem to solve. And to automate
it also requires that lot of additonal stuff be done. I am very
close to achieving nirvana on this thing. Project Anaconda will
run automatically and produce a report on my machine weekly.
That is the aim. And I will get there. I need your support people.
Not smart ass comments. That is all I ask.
 
Hi, jokerpoker_us;
> How much time does a full scan for your program takes?

It usually takes 5-8 hours, but it really depends on proxy servers.
Recently, the list of case numbers is devided into 3-4 parts,
and one process per devided list is launched from command line manually, rather than thread, due to unreliability of proxy servers.

> Technical difficulties.
> Proxies!! How to find good proxies?

Your difficulty is exactly what I have been facing,
and it is still my difficulty, too.

I don't have much time to reply right now,
but I have one important suggestion.

Please keep all scanned data including full description of case status.
It would become very helpful for your future project.
-kashmir
 
So you divide the task in to four processes launched separately
from command line. Which is what I do. I actually do about
10 processes. But I remember you saying that only 1000
lookups are allowed per proxy. Does your command line
program take a list of proxies or just one proxy?

Assuming 36000 cases in year 2002 if you use four processes
then each will require 9 proxies. No? Because one proxy
handles only 1000 cases. Answer at your own convenience.

My organization of data is slightly different. I will explain here.

The valid numbers are stored in a directory structure explained
below.

Workspace \ numbers \ wac \ I485 \ 02 \ 001 \ validnumbers.txt
Workspace \ numbers \ wac \ I485 \ 02 \ 002 \ validnumbers.txt
Workspace \ numbers \ wac \ I485 \ 02 \ 003 \ validnumbers.txt
.
.
Workspace \ numbers \ wac \ I485 \ 02 \ 290 \ validnumbers.txt

the report is stored in similar fashion

Workspace \ reports \ wac \I485 \ 02 \ 001 \ firstreport.csv
Workspace \ reports \ wac \I485 \ 02 \ 002 \ firstreport.csv
Workspace \ reports \ wac \I485 \ 02 \ 003 \ firstreport.csv
.
.
Workspace \ reports \ wac \ I485 \ 02 \ 290 \ firstreport.csv

The workspace directory becomes a database directory with
all the reports. For each series each run creates one report.
Then you can do some cool shit with the database you have
built because anaconda has api to read this database.

However the report.csv does not include the complete text
message which i am sure I can add. I just thought it might
not be needed. As I make my run I name them first or
second or third. This creates reports with proper names
in the right directories. So this gives my api ability to read
single series or add them up or do all kinds of cool things.

I will keep your suggestion in mind and make the change.
 
Originally posted by jokerpoker_us

Trust me this is not a simple problem to solve. And to automate
it also requires that lot of additonal stuff be done. I am very
close to achieving nirvana on this thing. Project Anaconda will
run automatically and produce a report on my machine weekly.
That is the aim. And I will get there. I need your support people.
Not smart ass comments. That is all I ask.

Hi Joker_Poker,

First of all thanks for your efforts and time for doing this. Thanks to Kashmir for starting this.

Now regarding the problems you are facing, I would like to throw couple of suggestion that popped in my mind. They may not be right, I've not thought much about them yet.

1. Is it possible to form a grid with other volunteers who has High Speed connection at home. This solves the proxy problem. If you have sufficient volunteers you dont need to use proxies. And the total time also reduces.

2. Can we fake ip addresses. I think its possible, though I dont yet know how?

May be we can discuss in detail about these suggestion if you think they are valid.

Keep it up.
 
Originally posted by rk4gc
1. Is it possible to form a grid with other volunteers who has High Speed connection at home. This solves the proxy problem. If you have sufficient volunteers you dont need to use proxies. And the total time also reduces.


Count me in ...........
 
WAC01,02&03 I485 status table as of 12/15/2003 CSV

Now, it includes WAC-03-001 thru 024 for ND October 2002, too.
-kashmir
 
Originally posted by rk4gc
...
1. Is it possible to form a grid with other volunteers who has High Speed connection at home. This solves the proxy problem. If you have sufficient volunteers you dont need to use proxies. And the total time also reduces.
...
Hi, rk4gc;
Thanks for your suggestion.
Actually, the next version of Project Kashmir would have this kind of architecture.

TO: lareds, askgc, and other future participants;
I will let you know once it's ready.

Anyway, the current priority should be the first version of Project Anaconda getting started up stably.
-kashmir
 
Hi, jokerpoker_us;
> ... But I remember you saying that only 1000 lookups are allowed per proxy.
> Does your command line program take a list of proxies or just one proxy?
The list of proxies is provided in properties file, and initially loaded.
Once the number of accesses reached the limit or an exception was catched, my program automatically switches a proxy.

> Assuming 36000 cases in year 2002 if you use four processes then each will require 9 proxies.
> No? Because one proxy handles only 1000 cases.
> Answer at your own convenience.
Actuallly, due to unstableness of proxy servers, at least 60-80 proxies must be prepared.
The list of proxies is managed offline manually now... I hate it.

> My organization of data is slightly different. I will explain here.
> ...
> The workspace directory becomes a database directory with all the reports. For each series each run creates one report.
> Then you can do some cool shit with the database you have built because anaconda has api to read this database.
It looks cool !!

> However the report.csv does not include the complete text message which i am sure I can add. I just thought it might not be needed. ...
> I will keep your suggestion in mind and make the change.
I strongly recommend to implement this feature.
It would provide rich information later.
-kashmir
 
Hi, jokerpoker_us;
> ...
> Technical difficulties.
> ...
> Trust me this is not a simple problem to solve.
> And to automate it also requires that lot of additonal stuff be done.
> I am very close to achieving nirvana on this thing.
> Project Anaconda will run automatically and produce a report on my machine weekly.
> That is the aim. And I will get there. I need your support people. Not smart ass comments. That is all I ask.

I had experienced most of your difficulties, too.
I am plesed to hear that you are very close to achieve.
Good luck,
-kashmir
 
rk4gc
The way to kind of fake ip is to go through proxies. That
unfortunately is the most viable solution. And that is where
the problem lies. Kashmir and I address this problem by going
through a list of proxies. And finding good proxies is just
a little bit hard. But I don't see why lot of us cannot help in
finding good proxies. There are thousands on the net. The
trick is to find the ones which are consistently available and
speedy enough. I don't think we need a bunch of people to
run the same program. I can throw sufficient hardware at it.
In fact I do have lot of hardware to throw at this problem.
Now the only thing is to find good solid reliable speedy
proxies.

If you have not noticed this is a big problem in itself. We have
to get a group effort going to find reliable and fast proxies. Its
not hard. We just need 100 good proxies. A number which seems
daunting but easy to achieve. You can search the net for
hundreds of proxies. The trick is to find ones which are consistently available.

Kashmir, I see that you are doing the same thing I am. Swapping
proxies. Thats cool. I do one more thing. I have a program which
tests the proxies by making 3-5 calls to a known server. It then
ranks the proxies based on their speed. This list is then stored
for later use.

When anaconda runs it takes a proxy from the list and checks it
for speed too. If the proxy is slow than a set limit it will skip to the
next proxy. This helps cut down lot of time. That helps anaconda
run faster. It still has to test every proxy. And testing proxies
can consume lot of time. If my list had solid reliable proxies
Anaconda can do a sweet job. But right now I don't have good
list of proxies. I have to build in to anaconda some intelligence
whereby it starts recommending good proxies over time but
that is going to be hard.

Kashmir, Can we also have some sort of collabaration? So that
if you run the script and if you send me your data file I can read
them in to project Anaconda and create a project Anaconda file.
Project Anaconda is already set to handle Kashmir files.
This way we can save each other a run and if you make a run I
simply take your data and create my csv which takes 2 minutes.
If we can have that then we can save ourselves some proxies
too. If and when I come up with good proxies I will share them
with you.

Kashmir, One more question. Does your process hang at times
while fetching data. I assume that this is a network hang. JDK
1.4.2 provides a option to time out hung network connections
but I have noticed that the solution they offer does not work
for https connections which is what USCIS site is. I am wondering
if you have the same problem. I am now writing a Project
Anaconda module which works kind of like Apache. It will treat
scans as threads. Any scan that takes really long time will be
terminated. Each scan thread will handle at the most 5 series.
The number of scan threads will be configurable. This I think
will fix the hang issues. But just wondering if you face these
too.


So people think seriously about testing and identifying proxies.
 
Last edited by a moderator:
Originally posted by jokerpoker_us
Kashmir, One more question. Does your process hang at times
while fetching data. I assume that this is a network hang. JDK
1.4.2 provides a option to time out hung network connections
but I have noticed that the solution they offer does not work
for https connections which is what USCIS site is. I am wondering
if you have the same problem. I am now writing a Project
Anaconda module which works kind of like Apache. It will treat
scans as threads. Any scan that takes really long time will be
terminated. Each scan thread will handle at the most 5 series.
The number of scan threads will be configurable. This I think
will fix the hang issues. But just wondering if you face these
too.
It happens sometimes maybe once or twice per month.
I simply kill the process manually when it happens.

Actually, I planed to implement similar thing, but I had no time because of Project Ocean.
I think it's a good idea.
-kashmir
 
Originally posted by jokerpoker_us
If you have not noticed this is a big problem in itself. We have
to get a group effort going to find reliable and fast proxies. Its
not hard. We just need 100 good proxies. A number which seems
daunting but easy to achieve. You can search the net for
hundreds of proxies. The trick is to find ones which are consistently available.
This part is not so simple.
In my feeling, it is impossible to have 100 good reliable proxies always available.
Even a very good proxy might not be available sometimes.
Also, a proxy server would reject your request if you would use it repeatedly.
If it happens, you have to remove it from the list.
Actually, Project Kashmir had prepared the list of 300+ proxies, but now, it has only 100- proxies.

I recommend to manage the list of proxies flexibly.
-kashmir
 
Hi Kashmir,
rk4gc set me thinking. I think we should involve more people.
I think my inspiration for this is Michael Crichton's new novel
"Prey"

There is a way to do that. Then this can become a group
effort. I got this idea right now in the shower so it must
be good.

Create program which scans only one series. So there will
be one program per series. So for wac 02 001 there will one
program and for wac 02 002 there will be another and so on.

Each program or predator will have only one goal. To scan
only one series. The valid numbers for each series will be
encrypted and hardcoded in a class inside the program. This
way it is simple. Each predator tracks only one prey. Once it
finishes its task it will e-mail/ftp the data to a common email/site

See the beauty of this? We just post 290 programs for 290
days of 02. Nobody can find the valid numbers. So the data
is safe. Every person who wishes to see his series in the
weekly report will volunteer to run the report on a particular
day. The data will be automatically mailed to a site where it
will be assembled in to a report. People will have no reason to
complain. If they do not do their job there will be no report for
their series.

And we don't need proxies. Such a program can written with Kashmir core that I have in couple of days. I can write a perl script which creates encrypted valid number classes for 290
days. If 290 people run it then we will all be done like in half
an hour? :) Each program will be jar file named wac02xxx.jar
and so on.

Advantages.
1. Every program or predator is a very simple program.
2. The program does not need configuration.
3. The program does not need proxies.
4. Program is a single jar file to be run under jdk.
5. Data is encrypted so the people running the
program do not see the data.
6. Think about it. If USCIS was to educe number of hits
from 1000 to 250 this might still work.

Thing is clear. People do not get the data. The admin does.
What do you think of this idea? Any more suggestions?
 
Last edited by a moderator:
Originally posted by jokerpoker_us
Kashmir, Can we also have some sort of collabaration? So that
if you run the script and if you send me your data file I can read
them in to project Anaconda and create a project Anaconda file.
Project Anaconda is already set to handle Kashmir files.
This way we can save each other a run and if you make a run I
simply take your data and create my csv which takes 2 minutes.
If we can have that then we can save ourselves some proxies
too. If and when I come up with good proxies I will share them
with you.
As the short-term solution, I may send you my scanned data.
However, what I really want you to take over is scanning part with proxies.
It may be what you don't want to do.

My proposal is:
devide the list of I485 case numbers to two parts.
Each project scans each part, then send the scanned data to each other.
I need the following requirement for scanned data:
each case status includes full description and scanned timestamp like my TSV file.
I don't care the format.
For the ratio, I may start with 10:0 this week, then 9:1 next week like that, depending your progress.
-kashmir
 
Sure Kashmir,
I think I can do better than that. I can provide you with a
Kashmir project file for like 10,000 cases. You just have to
tell me which series to scan and I will provide you data
with those series in max 24 hours.

On the side I will also try the distributed computing idea.
It has worked for many a programs where it was impossible
to do things on one computer. I will see what comes out of
it.
 
Hi, jokerpoker_us;
I think it is a great idea, and
as I posted before, the next version of Project Kashmir has similar architecture with some modifications.
You would see a prototype soon.

However, I'd like you to focus to complete the Project Anaconda at first.
Once it's completed, you can replace the data input module later.
-kashmir
Originally posted by jokerpoker_us
Hi Kashmir,
rk4gc set me thinking. I think we should involve more people.
I think my inspiration for this is Michael Crichton's new novel
"Prey"

There is a way to do that. Then this can become a group
effort. I got this idea right now in the shower so it must
be good.

Create program which scans only one series. So there will
be one program per series. So for wac 02 001 there will one
program and for wac 02 002 there will be another and so on.

Each program or predator will have only one goal. To scan
only one series. The valid numbers for each series will be
encrypted and hardcoded in a class inside the program. This
way it is simple. Each predator tracks only one prey. Once it
finishes its task it will e-mail/ftp the data to a common email/site

See the beauty of this? We just post 290 programs for 290
days of 02. Nobody can find the valid numbers. So the data
is safe. Every person who wishes to see his series in the
weekly report will volunteer to run the report on a particular
day. The data will be automatically mailed to a site where it
will be assembled in to a report. People will have no reason to
complain. If they do not do their job there will be no report for
their series.

And we don't need proxies. Such a program can written with Kashmir core that I have in couple of days. I can write a perl script which creates encrypted valid number classes for 290
days. If 290 people run it then we will all be done like in half
an hour? :) Each program will be jar file named wac02xxx.jar
and so on.

Advantages.
1. Every program or predator is a very simple program.
2. The program does not need configuration.
3. The program does not need proxies.
4. Program is a single jar file to be run under jdk.
5. Data is encrypted so the people running the
program do not see the data.
6. Think about it. If USCIS was to educe number of hits
from 1000 to 250 this might still work.

Thing is clear. People do not get the data. The admin does.
What do you think of this idea? Any more suggestions?
 
Originally posted by jokerpoker_us
Sure Kashmir,
I think I can do better than that. I can provide you with a
Kashmir project file for like 10,000 cases. You just have to
tell me which series to scan and I will provide you data
with those series in max 24 hours.
...
Hi, jokerpoker_us;
That's great.
Can you scan WAC-02-222 thru 289 next weekend ?
It includes 8,467 I-485 cases.
Thanks,
-kashmir
 
Top