Jump to content

Wikipedia:Reference desk/Archives/Computing/2016 June 7

From Wikipedia, the free encyclopedia
Computing desk
< June 6 << May | June | Jul >> June 8 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


June 7

[edit]

The fastest software to convert words to Unicode numbers

[edit]

Hello. I would like to ask what is the fastest software to convert words to Unicode numbers. I have a text file, which contains over one million lines in it. The text file looks like:

  1. digeneous
  2. jubate
  3. parramatta
  4. puppyfish
  5. sizableness

...and more.

I would like to convert all lines to Unicode numbers, and then the file will look like:

  1. 6400 6900 6700 6500 6E00 6500 6F00 7500 7300
  2. 6A00 7500 6200 6100 7400 6500
  3. 7000 6100 7200 7200 6100 6D00 6100 7400 7400 6100
  4. 7000 7500 7000 7000 7900 6600 6900 7300 6800
  5. 7300 6900 7A00 6100 6200 6C00 6500 6E00 6500 7300 7300

...and more.

Because there are over one million lines in the text file, I need software to do this conversion and hope it as fast as possible.

Thank you in advance! --Capim Dourado (talk) 08:30, 7 June 2016 (UTC)[reply]

Are you knowledgeable enough with HTML ? You could perhaps adapt the code from a page like the following [1] ( you still would have to do some copy-paste from the file and to the output file , of course. ) --Askedonty (talk) 13:21, 7 June 2016 (UTC)[reply]
This perl one-liner assumes an ascii encoded file, as with the example lines you gave.
perl -lne 'print join " ", map uc."00", (unpack("H*") =~ m/../g)' file_name
-- ToE 13:36, 7 June 2016 (UTC)[reply]
As pointed out below, you might not be intentionally seeking a little-endian representation, in which case you might prefer:
perl -lne 'print join " ", map sprintf("%04X",ord), split //' file_name
-- ToE 14:29, 7 June 2016 (UTC)[reply]
In Python 3:
#!/usr/bin/python3

with open("words.txt", "rb") as f:
    for line in f:
        print(" ".join(["%04x" % c for c in line]))
Note that unicode for 'd' is U+0064, not 6400. As to speed - this is a computationally trivial task, so any sensible implementation is going to be I/O bound. -- Finlay McWalter··–·Talk 13:59, 7 June 2016 (UTC)[reply]
The CPU performance is probably irrelevant. A faster disk will make it faster. But a million lines is a trivial task - my example processed a million lines, reading and writing to a cheap old SSD, in 3.7 seconds. @Thinking of England:'s example, being similarly I/O bound, runs in just the same time. Don't worry about making something fast until you have concrete evidence that it is slow; in this case it's not. -- Finlay McWalter··–·Talk 16:41, 7 June 2016 (UTC)[reply]
Your last sentence is sage advice. But, ... if we really wanted to, I believe that we can do a bit better. On my system, this C filter is about five times faster.
#include <stdio.h>
#define LF 0x0a
#define CR 0x0d
int main() {
  int c, prev = LF;
  while ((c = getchar()) != EOF) {
    if (c == LF || c == CR) putchar(c);
    else {
      if (prev != LF && prev != CR) putchar(' ');
      printf("%04X",c);
    }
    prev = c;
  }
  return 0;
}
(Edited to work with either DOS or Unix style newlines, passing both 0x0A and 0x0D, while %04X hex dumping all other characters.) -- ToE 17:22, 7 June 2016 (UTC)[reply]
It would be better to use getwchar; otherwise, this code will only work if the file is ASCII or the character encoding is Latin-1. Even getwchar is not guaranteed to return Unicode. Python 3 has an advantage here because its strings are guaranteed to be Unicode.
Checking for CR has no effect on Windows, because line endings are replaced with '\n' (== 10) by the C library. This check would only be useful if you're processing files with CRLF line endings on a non-CRLF system. -- BenRG (talk) 17:42, 8 June 2016 (UTC)[reply]
To underline what others have said, unless you're running this on lower-end embedded hardware (something like an Arduino) processor speed is not the limiting factor here. Fetching the data from storage is. Modern CPUs are outrageously fast. The problem is everything else is so much slower that the CPU has to wait on reads and writes. If you don't believe us, load the file into memory first (into a ramdisk, such as tmpfs on BSD or Linux), run the program, and then compare the time versus running it "normally". Clear the operating system disk caches first for more accuracy. --71.110.8.102 (talk) 19:33, 7 June 2016 (UTC)[reply]
And the more I think about this, the more I suspect there's some X-Y problem here. What are you trying to accomplish? Why do you want to convert your text to a textual representation of the bytes? And even assuming this is the right solution, why do you care about the speed? If you're just doing this as a one-off thing, who cares about speed? Just run it in the background. If something is constantly generating your input text, then the way to make things faster is to grab the output while it's in memory and do the conversion then. On Unix you could do this with a pipe. What makes it slow is storing the whole file on disk/SSD/whatever and then reading it back into memory to do the conversion. --71.110.8.102 (talk) 21:38, 8 June 2016 (UTC)[reply]

Icon sought

[edit]

Peeps, I rinsed my bytes again, it’s the 06 of June and I already lost 500mb. I need help…

I’m looking for a icon to cover a ‘’globe’’ icon entitled “Internet” or “WWW” as a Folder, and two more icons covering “Download” and “Upload” as Folders – hope you guys understand what I mean…

1) Give me something futuristic. No grandpa style stuff.

2) Before helping with point one, help me understand something i.e., if WWW or Internet is based on the world, what would be for the Space/Universe Wide Web? – Give me an icon based on Space/Universe Wide Web, that ‘won’t die out’ in other words.

Apostle (talk) 18:39, 7 June 2016 (UTC)[reply]

2)Before the internet as we know it existed, some people planned on calling it the Intergalactic_Computer_Network. I don't know what you like in terms of icons, but here are some "modern" sets you might like [2]. Here is another community site for sharing icons [3]. Here [4] is a directory of icon sites. SemanticMantis (talk) 19:31, 7 June 2016 (UTC)[reply]
Were they already spelling phrases with underscores in those days? —Tamfang (talk) 08:37, 8 June 2016 (UTC)[reply]
It has a certain appeal to it, IMO ;) SemanticMantis (talk) 15:15, 8 June 2016 (UTC)[reply]

Okay, I saved'em; have to check it some other time... Thanks. -- Apostle (talk) 19:00, 8 June 2016 (UTC)[reply]

Solving this problem using less brute force.

[edit]

The problem: given the ordered series of numbers from 1 to 109, and the possibility of inserting "+","-" or nothing between them, how many possible combinations are there to reach 100?

For example, 1+2+3-4+5+6+78+9 would be one. And there are other 10. However, to discover this, I went through all 6561 combinations. Is there a more intelligent way of finding the matching answers? Specially for longer series, it might be difficult to computer all combinations. --Llaanngg (talk) 21:32, 7 June 2016 (UTC)[reply]

Just to clarify, you want to consider longer sequences, but when you get to double digits you still consider joining them by digits so that a sequence 1..20 could include terms like 1213 and 141516? Also, are you only interested in sums to 100, or do you want to consider the general case? Dragons flight (talk) 08:32, 8 June 2016 (UTC)[reply]
Yes, 1213 would be included. And the sum does not have to be 100, that's just an example. --Llaanngg (talk) 10:44, 8 June 2016 (UTC)[reply]
The only thing that is immediately obvious to me is that the contribution of the remaining values in a partial sum is necessarily no more than the joiner of those values, which could allow you to prune branches from your search if you are already farther than that from your target value. Dragons flight (talk) 09:41, 8 June 2016 (UTC)[reply]
A related problem you might want to look into is Subset sum, which is NP complete. For example, it hints that if you keep the target at 100, dynamic programming might be faster than brute force at some point. It also sort of hints that if the target grows as well, this problem may be NP-complete, unless there is some elegant way to use the specifics to achieve a significant speedup.192.176.1.95 (talk) 10:13, 8 June 2016 (UTC)[reply]
@Llaanngg: I suppose there's more than 6561 possible expressions. You can insert any of the three symbols: 'plus' or 'minus' or 'nothing' between any two consecutive digits, which actually makes possibilities. However, you can also insert any of them in front of 1. This doubles the previous result (as 'plus' and 'nothing' mean the same in this case, so you have effectively two possibilities), so you have 13,122 expressions to check. --CiaPan (talk) 14:13, 8 June 2016 (UTC)[reply]
Putting something before 1 would not be literally "inserting between" as the problem states. Llaanngg (talk) 14:28, 8 June 2016 (UTC)[reply]
And allowing leading negatives doesn't really enrich the problem because that just doubles the existing space by mapping an original sum to one with all signs reversed for a negative of the original total. This leads to the interesting observation that there are 11 combinations which sum to 100, but only 1 which sums to -100. I'm not sure why this should be when of the 6561 sums 3598 are positive, 2952 are negative, and 11 are zero. Of course, if leading negatives are allowed, then there are 12 combinations each which sum to 100 or -100. So in a way, allowing for leading negatives hides some internal structure. -- ToE 16:09, 8 June 2016 (UTC)[reply]
1 .. N Number of possible combinations Number of combination to reach 100 Percentage
1 1 0 0
2 3 0 0
3 9 0 0
4 27 0 0
5 81 0 0
6 243 0 0
7 729 2 0.27
8 2187 5 0.23
9 6561 11 0.17
10 19683 24 0.12
11 59049 43 0.073
12 177147 88 0.050
13 531441 208 0.039
14 1594323 399 0.025
15 4782969 852 0.017
16 14348907 2051 0.014
17 43046721 4450 0.010
18 129140163 10082 0.008

As the number of possible sequences triples with each increase in the number of terms, it seems like the number of ways to reach the fixed target of 100 is roughly doubling (perhaps a bit more than doubling). This makes intuitive sense since the more terms, the more likely one is to diffuse away from the origin. Dragons flight (talk) 07:46, 9 June 2016 (UTC)[reply]