Grateful for your explanation. I was checking the chart from xarthisius.xyz but I am not so capable of interpreting the chart, though I see it is very informative for somebody who is capable of interpreting it. I understand somewhat the first chart with percentage and case numbers but I do not know why the "holes" part got wider from the numbers 13k-14k. Could you please explain what is implied by that?
Also is there any specific number of visas allocated for each region or country? Thank you so much.
So, in the initial entry let's say 1 million entries participate for Asia. about 80% of them are from Nepal and Iran. The other 20% are from the rest of Asia. In a (pseudo-)random process, they use an algorithm to assign every entry a number from 1 to 1 million. They interpret the word random as having a uniform distribution which means everyone's probability of getting any of the numbers must be equal, e.g. 1/(1 million). So, in every 1000 column of numbers, 800 will go to entries from Nepal and Asia and 200 to the rest of Asia. They will probably also check the uniform distribution of the statistical features (e.g. age, gender, degrees, etc) in the columns. So, if 20% of the Nepal and Iranian entries are above age of 50, then from the 800, about 160 should be above 50. And so on, this uniformity must be present in a uniform manner in every 1000 columns for all the columns (1000 colums of 1000 people). Otherwise, the lottery draw is flawed.
Then, they process the selectees from cases going from 1,2,3,... . If the case is not genuine (e.g. picture is a tree), they will become a hole and if they are genuine by their initial standards they will be selected as a selectee. They will keep this going on until they hit their wanted number of selectees. So, if they wanted 24000 winners from Asia, they kept going on until the case 31xxx which was the 24000th selectee.
However, as explained in paragraph 1, the Nepal and Iranian cases make up 80% of the cases they encounter, so they have to cut the number of cases they select. So, they say for 3850 (7% limit) limited visas, based on our statistical data from previous years, we need to select 3800 people from Nepal (cases grow and high response rate) and 5739 from Iran. So, in their selection once they select the 3800th Nepalese case, they will also make the rest of Nepalese holes. So, case number 12xxx is a Nepalese that is 3800th selectee. Then 12xxx+1 could also be a Nepalese but they will be hole. Same with Iran, in the 14k. This means at two points were they put the cutoffs, the number of holes must increase
Back to xarthisius' data and why the holes increase:
Given the uniformity described in paragraph 1, the 40% holes in the beginning until 12k are all the cases that were deemed fraudulent (e.g. picture of a tree). Also, from 1-31k, in every thousand case, 40% are the fraudulent entries that were made holes because of fraudulent entries. This is due to the uniformity described. (Reason 1 for becoming a blue hole) ----> making 40% hole
From 12xxx where Nepalese cases start to be cut, there's an increase in holes. These are because of Nepalese cases who were made hole because they had enough Nepalese by then to fill the 3850 visas available before their case number.
Then from 14k, there's an increase in holes in addition to the fraudulent cases and the Nepalese holes. This is due to the early cutoff for Iranians. So, the additional holes are actually Iranian entries who were made holes because they had selected enough Iranians to fill the 3850 visas available.
After 31k, where they have selected 24000 people from all of Asia, all cases above will become holes. So, all these holes will get the you're not selected on the website and the selectees will get the 1nl letter and invitation to proceed with their process.
Sorry for the long answer and my badly grammar writing. I'm not re-reading it. But, the main key to understand this data is the concept of a uniform distribution (Google it). You need to get that, the initial random draw in the lottery process is about coming up with a uniform distribution in the allocation of case rank numbers to all entries with respect to the statistical features of the data. This seems to be their interpretation of the word random. Note that the uniformity will be propagated in a manner throughout the process. Now you can understand their draw of the lottery process and the selection of winners. The rest of the process is about issuing 55k visas according to the case number and not violating the 7% limit per country by the end of the fiscal year. That's the whole process.
You could also use the data to infer the percentage of cases who are from Nepal, Iran and rest of Asia. Then, get the density (e.g. number of Nepalese/Iranian/Yemen/Japanese/... selectees in every 1000 column ) since we know the number of selectees from each country as published in the August 2021 bulletin and again the uniformity that I described.