Wikipedia:Reference desk/Archives/Computing/2015 July 1
Computing desk | ||
---|---|---|
< June 30 | << Jun | July | Aug >> | July 2 > |
Welcome to the Wikipedia Computing Reference Desk Archives |
---|
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages. |
July 1
[edit]Rapidminer XPath query help
[edit]I am trying to extract tabular information from pages on a website. I am using rapidminer for this. I have: - page links stored in an excel file (for the code snippet below have just used a single page) - Rapidminer process accesses these links one at a time and extracts the tabular data
The problem is that this table can have n number of rows (variable across table son pages). The process I have created can get the table data rows but how can I modify it to iterate over n number of table rows dynamically.
The XML for the Rapidminer process is below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="6.4.000">
<context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="75"> <parameter key="text" value="http://www.advancepierre.com/categories/Foodservice/Fully-Cooked-Burgers-Chopped-Steaks-and-Patties/The-PUB-Steak-Burger-Products.aspx?category=Foodservice"/> <parameter key="add label" value="true"/> <parameter key="label_type" value="text"/> <parameter key="label_value" value="_link"/> </operator> <operator activated="true" class="text:documents_to_data" compatibility="6.4.001" expanded="true" height="76" name="Documents to Data" width="90" x="179" y="75"> <parameter key="text_attribute" value="Link"/> <parameter key="add_meta_information" value="false"/> </operator> <operator activated="true" class="web:retrieve_webpages" compatibility="5.3.002" expanded="true" height="60" name="Get Pages" width="90" x="45" y="210"> <parameter key="link_attribute" value="Link"/> <parameter key="page_attribute" value="myPage"/> <parameter key="random_user_agent" value="true"/> </operator> <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="210"> <parameter key="select_attributes_and_weights" value="true"/> <list key="specify_weights"> <parameter key="myPage" value="1.0"/> </list> </operator> <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="313" y="210"> <parameter key="create_word_vector" value="false"/> <parameter key="keep_text" value="true"/> <process expanded="true"> <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="76" name="Multiply" width="90" x="45" y="75"/> <operator activated="false" class="loop" compatibility="6.4.000" expanded="true" height="76" name="Loop" width="90" x="246" y="210"> <parameter key="set_iteration_macro" value="true"/> <parameter key="macro_name" value="itr"/> <parameter key="iterations" value="20"/> <process expanded="true"> <operator activated="true" class="text:extract_information" compatibility="6.4.001" expanded="true" height="60" name="Extract Information (2)" width="90" x="179" y="75"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Rw" value="//*[@id='body_body_tbody']/h:tr[${itr}]/h:td/h:strong/text()"/> </list> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="input 1" to_op="Extract Information (2)" to_port="document"/> <connect from_op="Extract Information (2)" from_port="document" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:extract_information" compatibility="6.4.001" expanded="true" height="60" name="Extract Information" width="90" x="246" y="75"> <parameter key="query_type" value="XPath"/> <list key="string_machting_queries"/> <list key="regular_expression_queries"/> <list key="regular_region_queries"/> <list key="xpath_queries"> <parameter key="Hierarchy" value="//*[@id='form1']/h:div[4]/h:div[2]/h:p[@class='breadcrumb']/text()"/> <parameter key="Hierarchy_L1" value="//*[@id='form1']/h:div[4]/h:div[2]/h:h2/text()"/> <parameter key="Tbl_Rw_Angus" value="//*[@id='body_body_tbody']/h:tr[1]/h:td/h:strong/text()"/> </list> <list key="namespaces"/> <list key="index_queries"/> <list key="jsonpath_queries"/> </operator> <connect from_port="document" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Extract Information" to_port="document"/> <connect from_op="Extract Information" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/> <connect from_op="Documents to Data" from_port="example set" to_op="Get Pages" to_port="Example Set"/> <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/> <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator>
</process>
The ideal output would be like:
The PUB® Steak Burger Products | Angus | 215-960 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 6 | 28 | 10.50 |
The PUB® Steak Burger Products | Angus | 215-940 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 4 | 40 | 10 |
The PUB® Steak Burger Products | Angus | 215-930 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 3 | 56 | 10.50 |
The PUB® Steak Burger Products | Choice | 15-960 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 6 | 27 | 10.13 |
The PUB® Steak Burger Products | Choice | 15-940 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 4 | 40 | 10 |
The PUB® Steak Burger Products | Choice | 15-930 | Flamebroiled USDA Choice Angus Beef Chuck Steak Burger | 3 | 53 | 9.94 |
- This is a case where I'd use 'lynx -dump' to pull the whole page in a preformatted way. Hitting that particular page, returns:
Item # Product Name Portion Size (oz.) Portions Per Case Case Weight(lb.) Angus [192]215-960 Flamebroiled USDA Choice Angus Beef Chuck Steak Burger 6.000000 28 10.50 [193]215-940 Flamebroiled USDA Choice Angus Beef Chuck Steak Burger 4.000000 40 10.00 [194]215-930 Flamebroiled USDA Choice Angus Beef Chuck Steak Burger 3.000000 56 10.50 Choice [195]15-960 Flamebroiled USDA Choice Beef Chuck Steak Burger 6.000000 27 10.13 [196]15-940 Flamebroiled USDA Choice Beef Chuck Steak Burger 4.000000 40 10.00 [197]15-930 Flamebroiled USDA Choice Beef Chuck Steak Burger 3.000000 53 9.94 [198]22801-761 Flamebroiled USDA Choice Beef Chuck Steak Burger 4.000000 40 10.00 [199]22800-761 Flamebroiled USDA Choice Beef Chuck Steak Burger 3.000000 56 10.50 Original [200]15-260 Flamebroiled Beef Steak Burger 6.000000 27 10.12 [201]15-250 Flamebroiled Beef Steak Burger 5.000000 32 10.00 [202]15-250-40 Flamebroiled Beef Steak Burger 5.000000 128 40.00 [203]15-245 Flamebroiled Beef Steak Burger 4.500000 36 10.12 [204]15-240 Flamebroiled Beef Steak Burger 4.000000 40 10.00 [205]15-230 Flamebroiled Beef Steak Burger 3.000000 53 9.94 [206]15-330-20 Flamebroiled Beef Steak Burger 3.000000 81 15.19 [207]15-230-2 Flamebroiled Beef Steak Burger with Foil Bags 3.000000 160 30.00 [208]15-275 Flamebroiled Beef Steak Burger 2.750000 58 9.96 [209]15-224 Flamebroiled Beef Steak Burger 2.400000 68 10.20 [210]10712 Flamebroiled Mini Beef Steak Burger with Bun 2.200000 72 9.90 [211]22985-330 Flamebroiled Beef Steak Burger, Strip Steak Shape CN 3.000000 56 10.50 [212]15-338-9 Flamebroiled Beef Steak Burger CN 3.800000 67 15.91 [213]15-330-09 Flamebroiled Beef Steak Burger CN 3.000000 81 15.19 [214]3-15-327-09 Flamebroiled Beef Steak Burger CN 2.700000 175 29.53 [215]15-327-09 Flamebroiled Beef Steak Burger CN 2.700000 88 14.85 [216]3-15-324-09 Flamebroiled Beef Steak Burger CN 2.400000 200 30.00 [217]15-324-09 Flamebroiled Beef Steak Burger CN 2.400000 90 13.50 [218]15-320-09 Flamebroiled Beef Steak Burger CN 2.010000 114 14.32 [219]15-312-9 Flamebroiled Mini Beef Steak Burger CN 1.200000 135 10.12 Smart Picks™ Beef Steak Burgers [220]68050 Smart Picks™ Flamebroiled Beef Steak Burger CN 2.000000 170 21.25 [221]68001 Smart Picks™ Flamebroiled Beef Steak Burger CN 1.600000 210 21.00
- As you can see, the data is formatted reasonably nice. It wouldn't be hard to strip off the last three numbers of each line and the item number from the beginning. The [###] fields are symbols for the links listed at the bottom of the dump (which I did not show). 199.15.144.250 (talk) 12:55, 1 July 2015 (UTC)
The problem here is that I need to accomplish this task using Rapidminer only - how can it be doe using Rapidminer. — Preceding unsigned comment added by 156.107.90.66 (talk) 04:23, 3 July 2015 (UTC)
Nfc
[edit]Is there any evidence that nfc is likely to be more successful than previous short range contactless technology such as rfid and Bluetooth? — Preceding unsigned comment added by 90.201.184.38 (talk) 05:59, 1 July 2015 (UTC)
- It is mainly about range. If you want communication to have a very short range, NFC is better. If you want communication to have a medium range, Bluetooth is better. If you want to have a large range, WiFi is better. If you want to reach out to anywhere in the world, you pass it off to the cell tower. 199.15.144.250 (talk) 12:35, 1 July 2015 (UTC)
- See Near field communication for our article. Tevildo (talk) 17:57, 2 July 2015 (UTC)
How to start windows in safe mode, then restart other services
[edit]Hi, I have a problem with my Vista computer from c2010, taking forever (~40 mins) to boot, giving me the grey screen of death at the start. So I start in safe mode, and that's fine, but there's no internet and no sound. Presumably no video either. Can I start windows in safe mode, *then* load up all the other things, one at a time, as needed? I only need a few things, like I say, sound, video, internet. Networking per se is present, and it recognises the ethernet connection, but won't let me on the web, won't connect to my ISP, etc. It just says, "Can't create this connection." IBE (talk) 18:42, 1 July 2015 (UTC)
- For clarity, are you running safe mode with networking, or safe mode? Nil Einne (talk) 19:03, 1 July 2015 (UTC)
- I'm pretty sure I did it with networking - that was what I ticked intentionally, so unless I'm doing something wrong, it's with networking. IBE (talk) 19:41, 1 July 2015 (UTC)
- I do not think you can "load" missing services. In a safe mode you can only save you data to an external drive then reset your laptop to its original state. Ruslik_Zero 20:12, 1 July 2015 (UTC)
- I'm pretty sure I did it with networking - that was what I ticked intentionally, so unless I'm doing something wrong, it's with networking. IBE (talk) 19:41, 1 July 2015 (UTC)
- The top two results in this Google search seem relevant, BUT if you are not completely comfortable with making registry edits, I would not go any further. (for the record, I am seeing a result for krisdavidson.org and one for majorgeeks.com) --LarryMac | Talk 20:55, 1 July 2015 (UTC)
- How hard can it be? ;) well, I'm going to reinstall windows if I can't fix it, so it doesn't matter much, thanks I'll give it a try. IBE (talk) 16:49, 2 July 2015 (UTC)